forked from fatiando/pooch
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
JOSS paper about Pooch (fatiando#116)
This is the paper we will submit to the Journal of Open Source Software. All authors have reviewed the paper and agreed to be listed on it. Fixes fatiando#112
- Loading branch information
Showing
2 changed files
with
236 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
@article{scikit-learn, | ||
title={Scikit-learn: Machine Learning in {P}ython}, | ||
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and | ||
Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and | ||
Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and | ||
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, | ||
journal={Journal of Machine Learning Research}, | ||
volume={12}, | ||
pages={2825--2830}, | ||
year={2011} | ||
} | ||
|
||
@article{scikit-image, | ||
title={scikit-image: image processing in Python}, | ||
author={Van der Walt, Stefan and Sch{\"o}nberger, Johannes L and | ||
Nunez-Iglesias, Juan and Boulogne, Fran{\c{c}}ois and Warner, | ||
Joshua D and Yager, Neil and Gouillart, Emmanuelle and Yu, Tony}, | ||
journal={PeerJ}, | ||
volume={2}, | ||
pages={e453}, | ||
year={2014}, | ||
publisher={PeerJ Inc.}, | ||
doi={10.7717/peerj.453} | ||
} | ||
|
||
@software{metpy, | ||
title={MetPy: A {Python} Package for Meteorological Data}, | ||
author={May, Ryan M. and Arms, Sean C. and Marsh, Patrick and Bruning, Eric and Leeman, John R. | ||
and Goebbert, Kevin and Thielen, Jonathan E. and Bruck, Zachary}, | ||
organization={Unidata}, | ||
year={2008 - 2019}, | ||
version={0.11.1}, | ||
doi={10.5065/D6WW7G29}, | ||
url={https://github.com/Unidata/MetPy}, | ||
address={Boulder, Colorado} | ||
} | ||
|
||
@article{verde, | ||
title={Verde: Processing and gridding spatial data using Green's functions}, | ||
doi={10.21105/joss.00957}, | ||
url={https://doi.org/10.21105/joss.00957}, | ||
year={2018}, | ||
month=oct, | ||
publisher={The Open Journal}, | ||
volume={3}, | ||
number={30}, | ||
pages={957}, | ||
author={Leonardo Uieda}, | ||
journal={Journal of Open Source Software} | ||
} | ||
|
||
@misc{rockhound, | ||
doi={10.5281/ZENODO.3086002}, | ||
url={https://zenodo.org/record/3086002}, | ||
author={Uieda, Leonardo and Soler, Santiago R.}, | ||
language={en}, | ||
title={Rockhound: Download geophysical models/datasets and load them in Python}, | ||
publisher={Zenodo}, | ||
year={2019} | ||
} | ||
|
||
@misc{icepack, | ||
doi = {10.5281/ZENODO.3542092}, | ||
url = {https://zenodo.org/record/3542092}, | ||
author = {Shapero, Daniel and Lilien, David and Ham, David A. and Hoffman, Andrew}, | ||
title = {icepack/icepack: icepack: glacier flow modeling with the finite element method in Python}, | ||
publisher = {Zenodo}, | ||
year = {2019} | ||
} | ||
|
||
@misc{predictatops, | ||
doi = {10.5281/ZENODO.1450596}, | ||
url = {https://zenodo.org/record/1450596}, | ||
author = {Gosses, Justin}, | ||
title = {JustinGOSSES/predictatops: v0.0.4}, | ||
publisher = {Zenodo}, | ||
year = {2019} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
--- | ||
title: "Pooch: A friend to fetch your data files" | ||
tags: | ||
- python | ||
authors: | ||
- name: Leonardo Uieda | ||
orcid: 0000-0001-6123-9515 | ||
affiliation: 1 | ||
- name: Santiago Rubén Soler | ||
orcid: 0000-0001-9202-5317 | ||
affiliation: "2,3" | ||
- name: Rémi Rampin | ||
orcid: 0000-0002-0524-2282 | ||
affiliation: 4 | ||
- name: Hugo van Kemenade | ||
orcid: 0000-0001-5715-8632 | ||
affiliation: 5 | ||
- name: Matthew Turk | ||
orcid: 0000-0002-5294-0198 | ||
affiliation: 6 | ||
- name: Daniel Shapero | ||
orcid: 0000-0002-3651-0649 | ||
affiliation: 7 | ||
- name: Anderson Banihirwe | ||
orcid: 0000-0001-6583-571X | ||
affiliation: 8 | ||
- name: John Leeman | ||
orcid: 0000-0002-3624-1821 | ||
affiliation: 9 | ||
affiliations: | ||
- name: Department of Earth, Ocean and Ecological Sciences, School of Environmental Sciences, University of Liverpool, UK | ||
index: 1 | ||
- name: Instituto Geofísico Sismológico Volponi, Universidad Nacional de San Juan, Argentina | ||
index: 2 | ||
- name: CONICET, Argentina | ||
index: 3 | ||
- name: New York University, USA | ||
index: 4 | ||
- name: Independent (Non-affiliated) | ||
index: 5 | ||
- name: University of Illinois at Urbana-Champaign, USA | ||
index: 6 | ||
- name: Polar Science Center, University of Washington Applied Physics Lab, USA | ||
index: 7 | ||
- name: The US National Center for Atmospheric Research, USA | ||
index: 8 | ||
- name: Leeman Geophysical, USA | ||
index: 9 | ||
date: 02 December 2019 | ||
bibliography: paper.bib | ||
--- | ||
|
||
# Summary | ||
|
||
Scientific software is usually created to analyze, model, and visualize data. | ||
As such, many software libraries include sample datasets in their distributions | ||
for use in documentation, tests, benchmarks, and workshops. | ||
The usual approach is to include smaller datasets in the GitHub repository | ||
directly and package them with the source and binary distributions | ||
(e.g., scikit-learn [@scikit-learn] and scikit-image [@scikit-image] do this). | ||
Larger datasets require writing code to download the files from a remote server | ||
to the user's computer. | ||
The same problem is faced by scientists using version control to manage their | ||
research projects. | ||
As data files increase in size, it becomes unfeasible to store them in GitHub | ||
repositories. | ||
While downloading a data file over HTTP can be done easily with modern Python | ||
libraries, it is not trivial to manage a set of files, keep them updated, and | ||
check for corruption. | ||
Instead of scientists and library authors recreating the same code, it would be | ||
best to have a minimalistic and easy to set up tool for fetching and maintaining | ||
data files. | ||
|
||
Pooch is a Python library that fills this gap. | ||
It manages a data *registry* by downloading files from one or more remote | ||
servers and storing them in a local data cache. | ||
Pooch is written in pure Python and has minimal dependencies. | ||
The integrity of downloads is verified by comparing the file's SHA256 hash with | ||
the one stored in the data registry. | ||
This is also the mechanism used to detect if a file needs to be re-downloaded | ||
due to an update in the registry. | ||
Pooch is meant to be a drop-in replacement for the custom download code that | ||
users have already written (or are planning to write). | ||
In the ideal scenario, the end-user of a software package should not need to know that | ||
Pooch is being used. | ||
Setup is as easy as calling a single function (`pooch.create`), including | ||
setting up an environment variable for overwriting the data cache path and | ||
versioning the downloads so that multiple versions of the same package can | ||
coexist in the same machine. | ||
For example, this is the code required to set up a module | ||
`datasets.py` that uses Pooch to manage data downloads: | ||
|
||
```python | ||
import pooch | ||
|
||
# Get the version string from the project | ||
from . import version | ||
|
||
# Create a new instance of pooch.Pooch | ||
GOODBOY = pooch.create( | ||
# Cache path using the default for the operating system | ||
path=pooch.os_cache("myproject"), | ||
# Base URL of the remote data server (for example, on GitHub) | ||
base_url="https://github.com/me/myproject/raw/{version}/data/", | ||
# PEP 440 compliant version number (added to path and base_url) | ||
version=version, | ||
# An environment variable that overwrites the path | ||
env="MYPROJECT_DATA_DIR", | ||
) | ||
# Load the registry from a simple text file. | ||
# Each line has: file_name sha256 [url] | ||
GOODBOY.load_registry("registry.txt") | ||
|
||
def fetch_some_data(): | ||
# Get the path to the data file in the local cache | ||
# If it's not there or needs updating, download it | ||
fname = GOODBOY.fetch("some-data.csv") | ||
# Load it with NumPy/pandas/xarray/etc. | ||
data = pandas.read_csv(fname) | ||
return data | ||
``` | ||
|
||
Pooch is designed to be extended: users can plug in custom download functions | ||
and post-download processing functions. | ||
For example, a custom download function could fetch files from a | ||
password-protected FTP server (the default is HTTP/HTTPS or anonymous FTP) and | ||
a processing function could decrypt a file using a user-defined password once | ||
the download is completed. | ||
We include ready-made download functions for HTTP and FTP (including basic | ||
authentication) as well as processing functions for unpacking archives (zip or | ||
tar) and decompressing files (gzip, lzma, and bzip2). | ||
|
||
To the best of the authors' awareness, the only other Python software with some | ||
overlapping functionality is [Intake](https://github.com/intake/intake). | ||
While Intake is powerful and can be used to manage large data archives, | ||
we argue that Pooch has a simpler setup and meets the | ||
specific needs of scientific software authors and individual scientists. | ||
For example, Pooch does not require users to change their data loading code to | ||
fit into a plug-in structure, instead only providing the file path for the | ||
user. | ||
|
||
The Pooch API is stable and has been field-tested by other projects: | ||
MetPy [@metpy], Verde [@verde], RockHound [@rockhound], predictatops | ||
[@predictatops], and icepack [@icepack]. | ||
Pooch is also being implemented as the download manager for scikit-image | ||
([GitHub pull request number 3945](https://github.com/scikit-image/scikit-image/pull/3945)). | ||
|
||
|
||
# Acknowledgements | ||
|
||
We would like to thank all of the volunteers who have dedicated their time and | ||
energy to build the open-source ecosystem on which our work relies. | ||
The order of authors is based on number of commits to the GitHub repository. | ||
A full list of all contributors to the project can be found on the | ||
[GitHub repository](https://github.com/fatiando/pooch/graphs/contributors). | ||
|
||
|
||
# References |