Motivation
I want a few DiGraph datasets that look like an “interesting” dependency graph (for some definition of “interesting”), to play around with graph algorithms.
Need a small-enough graph for quick iteration and testing, and a large-enough one for perf to matter, but still small enough to process locally on a single machine.
OSS dependencies datasets from deps.dev
deps.dev is a nice service to query dependencies between OSS packages across multiple ecosystems. They publish BigQuery datasets that can be used for batch analysis!
Google Cloud docs on using BigQuery public datasets: https://cloud.google.com/bigquery/public-data
deps.dev dataset page on the BigQuery datasets marketplace: https://console.cloud.google.com/marketplace/product/bigquery-public-data/deps-dev
I wrote this BigQuery query to get all direct dependency edges for packages in an ecosystem Sys
(inspired by the “Dependent count” section of the sample queries):
This query with Sys='PYPI'
processed 9.94 GB and produced 1,087,123 edges, which is a fairly small graph, for fast loading and iteration when playing with the data later. Since the data is smaller than 1 GB, I could export it as a single file to Google Drive (from the “Save Results” menu in BigQuery), download from Google Drive, and upload to GitHub.
This query does the same but with Sys='NPM'
, which is a significantly larger ecosystem than PyPI. The query processed close to 1 TB (!) and produced 22,699,790 edges, which should be interesting enough for large-ish data analysis (still small enough to do locally, but large enough that performance matters). I failed exporting it to Google Drive (larger than the 1 GB single-file export limit), so I saved the results as a new BigQuery table (from the “Save Results” menu in BigQuery), and then exported that to a Google Cloud Storage bucket (export docs) by providing a path that ends with a *
(i.e., gs://itamaro_depsdev_datasets/npm_deps1/20240804/*
). It created 10 shards of about 105 MB each.
I downloaded the sharded dataset from GCS:
Can’t upload these as is to GitHub due to GitHub file size limits (files over 100 MB must be GitLFS, which I rather avoid), so I merged and re-sharded them to smaller shards that GitHub can handle (https://github.com/itamaro/fun-with-digraphs/tree/main/data/deps.dev/npm/deps1/20240804):
Patent citation network dataset from SNAP
This one is not a “dependency graph” (in the sense of software dependencies), but has a similar structure and might be interesting to play with and compare.
SNAP stands for “Stanford Network Analysis Project” (https://snap.stanford.edu/index.html), and provides lots of network datasets.
The Patent citations network dataset contains 3,774,768 patents and 16,518,948 edges that form a directed graph with a node from patent X to Y meaning that patent X cites patent Y.
Like with the npm dataset, the single file dataset exceeds the 100 MB file size limit, so go through a similar sharding exercise. It’s also in a space-separated format, so use this opportunity to convert it to CSV. Processed dataset here: https://github.com/itamaro/fun-with-digraphs/tree/main/data/snap/patents-cit
Source: J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.