Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the documentation into more dedicated tutorials #237

Merged
merged 28 commits into from
Jun 7, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
280c844
Move the list of docs links to a separate page
leouieda May 9, 2021
0f0fea1
Include contact info from README in index
leouieda Jun 2, 2021
ba95715
Use a dedicated page for compatibility notes
leouieda Jun 2, 2021
822073d
Move and rename the beginner tutorial
leouieda Jun 2, 2021
8157cb5
Remove the testing section from the install
leouieda Jun 2, 2021
bb43890
Simplify the retrieve tutorial
leouieda Jun 2, 2021
b4465c7
Create a tutorial about using Pooch in a package
leouieda Jun 2, 2021
e784ae5
Tweaks to the retrieve tutorial title
leouieda Jun 2, 2021
a522a8c
Tutorial about sample data management
leouieda Jun 2, 2021
c31d4a4
Create a page about hashes
leouieda Jun 2, 2021
caec7af
Tutorial about setting environment variables
leouieda Jun 2, 2021
996b1ed
Tutorial about registry files
leouieda Jun 2, 2021
38b699f
Tutorial about multiple URLs
leouieda Jun 2, 2021
0a85865
Tutorial on download protocols
leouieda Jun 2, 2021
d7d913c
Put a warning about subdirectories in the registry
leouieda Jun 2, 2021
162d131
Tutorial on progress bars
leouieda Jun 2, 2021
b3db7a9
Include download retry in basic setup
leouieda Jun 2, 2021
bb54f70
Move tutorial on creating registry from remote
leouieda Jun 2, 2021
c613e81
Tutorial on logging
leouieda Jun 2, 2021
f95539b
Tutorial on authentication
leouieda Jun 2, 2021
409e739
Fix wrong link to retrieve
leouieda Jun 2, 2021
f8dbcad
Dedicated pages for auth, unpacking, and decompressing
leouieda Jun 2, 2021
0f5a66c
Add note about install commands running in a terminal
leouieda Jun 3, 2021
4259950
Clarify links and description of optional dependencies
leouieda Jun 3, 2021
d172958
Link to other pages from the retrieve tutorial
leouieda Jun 3, 2021
7bc5105
Better links to other sections
leouieda Jun 3, 2021
93be11c
Consistent naming of sections for processors
leouieda Jun 3, 2021
fafea08
Include citation reference in front page
leouieda Jun 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Create a page about hashes
  • Loading branch information
leouieda committed Jun 2, 2021
commit c31d4a487d52631278d6286398ff2590f400f827
47 changes: 0 additions & 47 deletions doc/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,53 +48,6 @@ result in a failed download.
Requires Pooch >= 1.3.0.


Bypassing the hash check
------------------------

Sometimes we might not know the hash of the file or it could change on the
server periodically. In these cases, we need a way of bypassing the hash check.
One way of doing that is with Python's ``unittest.mock`` module. It defines the
object ``unittest.mock.ANY`` which passes all equality tests made against it.
To bypass the check, we can set the hash value to ``unittest.mock.ANY`` when
specifying the ``registry`` argument for :func:`pooch.create`.

In this example, we want to use Pooch to download a list of weather stations
around Australia. The file with the stations is in an FTP server and we want to
store it locally in separate folders for each day that the code is run. The
problem is that the ``stations.zip`` file is updated on the server instead of
creating a new one, so the hash check would fail. This is how you can solve
this problem:

.. code:: python

import datetime
import unittest.mock
import pooch

# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()

GOODBOY = pooch.create(
path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE,
base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/",
# Use ANY for the hash value to ignore the checks
registry={
"stations.zip": unittest.mock.ANY,
},
)

Because hash check is always ``True``, Pooch will only download the file once.
When running again at a different date, the file will be downloaded again
because the local cache folder changed and the file is no longer present in it.
If you omit ``CURRENT_DATE`` from the cache path, then Pooch will only fetch
the files once, unless they are deleted from the cache.

.. note::

If this script is run over a period of time, your cache directory will
increase in size, as the files are stored in daily subdirectories.


Create registry file from remote files
--------------------------------------

Expand Down
97 changes: 97 additions & 0 deletions doc/hashes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
.. _hashes:

Hashes: Calculating and bypassing
=================================

Pooch uses cryptographic hashes to check if files are up-to-date or possibly
corrupted:

* If a file exists in the local folder, Pooch will check that its hash matches
the one in the registry. If it doesn't, we'll assume that it needs to be
updated.
* If a file needs to be updated or doesn't exist, Pooch will download it from
the remote source and check the hash. If the hash doesn't match, an exception
is raised to warn of possible file corruption.

Calculating hashes
------------------

You can generate hashes for your data files using ``openssl`` in the terminal:

.. code:: bash

$ openssl sha256 data/c137.csv
SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d

Or using the :func:`pooch.file_hash` function (which is a convenient way of
calling Python's :mod:`hashlib`):

.. code:: python

import pooch
print(pooch.file_hash("data/c137.csv"))


Specifying the hash algorithm
-----------------------------

By default, Pooch uses `SHA256 <https://en.wikipedia.org/wiki/SHA-2>`__
hashes.
Other hash methods that are available in :mod:`hashlib` can also be used:

.. code:: python

import pooch
print(pooch.file_hash("data/c137.csv", alg="sha512"))

In this case, you can specify the hash algorithm in the **registry** by
prepending it to the hash, for example ``"md5:0hljc7298ndo2"`` or
``"sha512:803o3uh2pecb2p3829d1bwouh9d"``.
Pooch will understand this and use the appropriate method.


Bypassing the hash check
------------------------

Sometimes we might not know the hash of the file or it could change on the
server periodically.
To bypass the check, we can set the hash value to ``None`` when specifying the
``registry`` argument for :func:`pooch.create`
(or the ``known_hash`` in :func:`pooch.retrieve`).

In this example, we want to use Pooch to download a list of weather stations
around Australia:

* The file with the stations is in an FTP server and we want to store it
locally in separate folders for each day that the code is run.
* The problem is that the ``stations.zip`` file is updated on the server
instead of creating a new one, so the hash check would fail.

This is how you can solve this problem:

.. code:: python

import datetime
import pooch

# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()

GOODBOY = pooch.create(
path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE,
base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/",
registry={
"stations.zip": None,
},
)

When running this same code again at a different date, the file will be
downloaded again because the local cache folder changed and the file is no
longer present in it.
If you omit ``CURRENT_DATE`` from the cache path, then Pooch will only fetch
the files once, unless they are deleted from the cache.

.. attention::

If this script is run over a period of time, your cache directory will
increase in size, as the files are stored in daily subdirectories.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
:hidden:
:caption: Training your Pooch

hashes.rst
intermediate.rst
advanced.rst
downloaders.rst
Expand Down