Skip to content

Commit

Permalink
JOSS paper about Pooch (fatiando#116)
Browse files Browse the repository at this point in the history
This is the paper we will submit to the Journal of Open Source Software. 
All authors have reviewed the paper and agreed to be listed on it.

Fixes fatiando#112
  • Loading branch information
leouieda committed Dec 2, 2019
1 parent 3db0ea2 commit 5f2800e
Show file tree
Hide file tree
Showing 2 changed files with 236 additions and 0 deletions.
78 changes: 78 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and
Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and
Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}

@article{scikit-image,
title={scikit-image: image processing in Python},
author={Van der Walt, Stefan and Sch{\"o}nberger, Johannes L and
Nunez-Iglesias, Juan and Boulogne, Fran{\c{c}}ois and Warner,
Joshua D and Yager, Neil and Gouillart, Emmanuelle and Yu, Tony},
journal={PeerJ},
volume={2},
pages={e453},
year={2014},
publisher={PeerJ Inc.},
doi={10.7717/peerj.453}
}

@software{metpy,
title={MetPy: A {Python} Package for Meteorological Data},
author={May, Ryan M. and Arms, Sean C. and Marsh, Patrick and Bruning, Eric and Leeman, John R.
and Goebbert, Kevin and Thielen, Jonathan E. and Bruck, Zachary},
organization={Unidata},
year={2008 - 2019},
version={0.11.1},
doi={10.5065/D6WW7G29},
url={https://github.com/Unidata/MetPy},
address={Boulder, Colorado}
}

@article{verde,
title={Verde: Processing and gridding spatial data using Green's functions},
doi={10.21105/joss.00957},
url={https://doi.org/10.21105/joss.00957},
year={2018},
month=oct,
publisher={The Open Journal},
volume={3},
number={30},
pages={957},
author={Leonardo Uieda},
journal={Journal of Open Source Software}
}

@misc{rockhound,
doi={10.5281/ZENODO.3086002},
url={https://zenodo.org/record/3086002},
author={Uieda, Leonardo and Soler, Santiago R.},
language={en},
title={Rockhound: Download geophysical models/datasets and load them in Python},
publisher={Zenodo},
year={2019}
}

@misc{icepack,
doi = {10.5281/ZENODO.3542092},
url = {https://zenodo.org/record/3542092},
author = {Shapero, Daniel and Lilien, David and Ham, David A. and Hoffman, Andrew},
title = {icepack/icepack: icepack: glacier flow modeling with the finite element method in Python},
publisher = {Zenodo},
year = {2019}
}

@misc{predictatops,
doi = {10.5281/ZENODO.1450596},
url = {https://zenodo.org/record/1450596},
author = {Gosses, Justin},
title = {JustinGOSSES/predictatops: v0.0.4},
publisher = {Zenodo},
year = {2019}
}
158 changes: 158 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
title: "Pooch: A friend to fetch your data files"
tags:
- python
authors:
- name: Leonardo Uieda
orcid: 0000-0001-6123-9515
affiliation: 1
- name: Santiago Rubén Soler
orcid: 0000-0001-9202-5317
affiliation: "2,3"
- name: Rémi Rampin
orcid: 0000-0002-0524-2282
affiliation: 4
- name: Hugo van Kemenade
orcid: 0000-0001-5715-8632
affiliation: 5
- name: Matthew Turk
orcid: 0000-0002-5294-0198
affiliation: 6
- name: Daniel Shapero
orcid: 0000-0002-3651-0649
affiliation: 7
- name: Anderson Banihirwe
orcid: 0000-0001-6583-571X
affiliation: 8
- name: John Leeman
orcid: 0000-0002-3624-1821
affiliation: 9
affiliations:
- name: Department of Earth, Ocean and Ecological Sciences, School of Environmental Sciences, University of Liverpool, UK
index: 1
- name: Instituto Geofísico Sismológico Volponi, Universidad Nacional de San Juan, Argentina
index: 2
- name: CONICET, Argentina
index: 3
- name: New York University, USA
index: 4
- name: Independent (Non-affiliated)
index: 5
- name: University of Illinois at Urbana-Champaign, USA
index: 6
- name: Polar Science Center, University of Washington Applied Physics Lab, USA
index: 7
- name: The US National Center for Atmospheric Research, USA
index: 8
- name: Leeman Geophysical, USA
index: 9
date: 02 December 2019
bibliography: paper.bib
---

# Summary

Scientific software is usually created to analyze, model, and visualize data.
As such, many software libraries include sample datasets in their distributions
for use in documentation, tests, benchmarks, and workshops.
The usual approach is to include smaller datasets in the GitHub repository
directly and package them with the source and binary distributions
(e.g., scikit-learn [@scikit-learn] and scikit-image [@scikit-image] do this).
Larger datasets require writing code to download the files from a remote server
to the user's computer.
The same problem is faced by scientists using version control to manage their
research projects.
As data files increase in size, it becomes unfeasible to store them in GitHub
repositories.
While downloading a data file over HTTP can be done easily with modern Python
libraries, it is not trivial to manage a set of files, keep them updated, and
check for corruption.
Instead of scientists and library authors recreating the same code, it would be
best to have a minimalistic and easy to set up tool for fetching and maintaining
data files.

Pooch is a Python library that fills this gap.
It manages a data *registry* by downloading files from one or more remote
servers and storing them in a local data cache.
Pooch is written in pure Python and has minimal dependencies.
The integrity of downloads is verified by comparing the file's SHA256 hash with
the one stored in the data registry.
This is also the mechanism used to detect if a file needs to be re-downloaded
due to an update in the registry.
Pooch is meant to be a drop-in replacement for the custom download code that
users have already written (or are planning to write).
In the ideal scenario, the end-user of a software package should not need to know that
Pooch is being used.
Setup is as easy as calling a single function (`pooch.create`), including
setting up an environment variable for overwriting the data cache path and
versioning the downloads so that multiple versions of the same package can
coexist in the same machine.
For example, this is the code required to set up a module
`datasets.py` that uses Pooch to manage data downloads:

```python
import pooch

# Get the version string from the project
from . import version

# Create a new instance of pooch.Pooch
GOODBOY = pooch.create(
# Cache path using the default for the operating system
path=pooch.os_cache("myproject"),
# Base URL of the remote data server (for example, on GitHub)
base_url="https://github.com/me/myproject/raw/{version}/data/",
# PEP 440 compliant version number (added to path and base_url)
version=version,
# An environment variable that overwrites the path
env="MYPROJECT_DATA_DIR",
)
# Load the registry from a simple text file.
# Each line has: file_name sha256 [url]
GOODBOY.load_registry("registry.txt")

def fetch_some_data():
# Get the path to the data file in the local cache
# If it's not there or needs updating, download it
fname = GOODBOY.fetch("some-data.csv")
# Load it with NumPy/pandas/xarray/etc.
data = pandas.read_csv(fname)
return data
```

Pooch is designed to be extended: users can plug in custom download functions
and post-download processing functions.
For example, a custom download function could fetch files from a
password-protected FTP server (the default is HTTP/HTTPS or anonymous FTP) and
a processing function could decrypt a file using a user-defined password once
the download is completed.
We include ready-made download functions for HTTP and FTP (including basic
authentication) as well as processing functions for unpacking archives (zip or
tar) and decompressing files (gzip, lzma, and bzip2).

To the best of the authors' awareness, the only other Python software with some
overlapping functionality is [Intake](https://github.com/intake/intake).
While Intake is powerful and can be used to manage large data archives,
we argue that Pooch has a simpler setup and meets the
specific needs of scientific software authors and individual scientists.
For example, Pooch does not require users to change their data loading code to
fit into a plug-in structure, instead only providing the file path for the
user.

The Pooch API is stable and has been field-tested by other projects:
MetPy [@metpy], Verde [@verde], RockHound [@rockhound], predictatops
[@predictatops], and icepack [@icepack].
Pooch is also being implemented as the download manager for scikit-image
([GitHub pull request number 3945](https://github.com/scikit-image/scikit-image/pull/3945)).


# Acknowledgements

We would like to thank all of the volunteers who have dedicated their time and
energy to build the open-source ecosystem on which our work relies.
The order of authors is based on number of commits to the GitHub repository.
A full list of all contributors to the project can be found on the
[GitHub repository](https://github.com/fatiando/pooch/graphs/contributors).


# References

0 comments on commit 5f2800e

Please sign in to comment.