diff --git a/README.rst b/README.rst index c5155cde..64cbc6d3 100644 --- a/README.rst +++ b/README.rst @@ -25,7 +25,7 @@ Part of the `Fatiando a Terra `__ project :alt: Digital Object Identifier for the JOSS paper :target: https://doi.org/10.21105/joss.01943 -.. placeholder-for-doc-index +.. placeholder-doc-index-start About @@ -165,6 +165,9 @@ Contacting Us ask questions and leave comments. +.. placeholder-doc-index-end + + Citing Pooch ------------ @@ -226,31 +229,5 @@ License ------- This is free software: you can redistribute it and/or modify it under the terms -of the **BSD 3-clause License**. A copy of this license is provided in -`LICENSE.txt `__. - - -Documentation for other versions --------------------------------- - -* `Development `__ (reflects the *master* branch on - Github) -* `Latest release `__ -* `v1.3.0 `__ -* `v1.2.0 `__ -* `v1.1.1 `__ -* `v1.1.0 `__ -* `v1.0.0 `__ -* `v0.7.1 `__ -* `v0.7.0 `__ -* `v0.6.0 `__ -* `v0.5.2 `__ -* `v0.5.1 `__ -* `v0.5.0 `__ -* `v0.4.0 `__ -* `v0.3.1 `__ -* `v0.3.0 `__ -* `v0.2.1 `__ -* `v0.2.0 `__ -* `v0.1.1 `__ -* `v0.1 `__ +of the `BSD 3-clause License `__. +A copy of this license is provided with distributions of the software. diff --git a/doc/advanced.rst b/doc/advanced.rst deleted file mode 100644 index c4f16382..00000000 --- a/doc/advanced.rst +++ /dev/null @@ -1,170 +0,0 @@ -.. _advanced: - -Advanced tricks -=============== - -These are more advanced things that can be done for specific use cases. **Most -projects will not require these**. - - -Adjusting the logging level ---------------------------- - -Pooch will log events like downloading a new file, updating an existing one, or -unpacking an archive by printing to the terminal. You can change how verbose -these events are by getting the event logger from pooch and changing the -logging level: - -.. code:: python - - logger = pooch.get_logger() - logger.setLevel("WARNING") - -Most of the events from Pooch are logged at the info level; this code says that -you only care about warnings or errors, like inability to create the data -cache. The event logger is a :class:`logging.Logger` object, so you can use -that class's methods to handle logging events in more sophisticated ways if you -wish. - - -Retry failed downloads ----------------------- - -When downloading data repeatedly, like in continuous integration, failures can -occur due to sporadic network outages or other factors outside of our control. -In these cases, it can be frustrating to have entire jobs fail because a single -download was not successful. - -Pooch allows you to specify a number of times to retry the download in case of -failure by setting ``retry_if_failed`` in :func:`pooch.create`. This setting -will be valid for all downloads attempted with :meth:`pooch.Pooch.fetch`. The -download can fail because the file hash doesn't match the known hash (due to a -partial download, for example) or because of network errors coming from -:mod:`requests`. Other errors (file system permission errors, etc) will still -result in a failed download. - -.. note:: - - Requires Pooch >= 1.3.0. - - -Bypassing the hash check ------------------------- - -Sometimes we might not know the hash of the file or it could change on the -server periodically. In these cases, we need a way of bypassing the hash check. -One way of doing that is with Python's ``unittest.mock`` module. It defines the -object ``unittest.mock.ANY`` which passes all equality tests made against it. -To bypass the check, we can set the hash value to ``unittest.mock.ANY`` when -specifying the ``registry`` argument for :func:`pooch.create`. - -In this example, we want to use Pooch to download a list of weather stations -around Australia. The file with the stations is in an FTP server and we want to -store it locally in separate folders for each day that the code is run. The -problem is that the ``stations.zip`` file is updated on the server instead of -creating a new one, so the hash check would fail. This is how you can solve -this problem: - -.. code:: python - - import datetime - import unittest.mock - import pooch - - # Get the current data to store the files in separate folders - CURRENT_DATE = datetime.datetime.now().date() - - GOODBOY = pooch.create( - path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE, - base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/", - # Use ANY for the hash value to ignore the checks - registry={ - "stations.zip": unittest.mock.ANY, - }, - ) - -Because hash check is always ``True``, Pooch will only download the file once. -When running again at a different date, the file will be downloaded again -because the local cache folder changed and the file is no longer present in it. -If you omit ``CURRENT_DATE`` from the cache path, then Pooch will only fetch -the files once, unless they are deleted from the cache. - -.. note:: - - If this script is run over a period of time, your cache directory will - increase in size, as the files are stored in daily subdirectories. - - -Create registry file from remote files --------------------------------------- - -If you want to create a registry file for a large number of data files that are -available for download but you don't have their hashes or any local copies, -you must download them first. Manually downloading each file -can be tedious. However, we can automate the process using -:func:`pooch.retrieve`. Below, we'll explore two different scenarios. - -If the data files share the same base url, we can use :func:`pooch.retrieve` -to download them and then use :func:`pooch.make_registry` to create the -registry: - -.. code:: python - - import os - - # Names of the data files - filenames = ["c137.csv", "cronen.csv", "citadel.csv"] - - # Base url from which the data files can be downloaded from - base_url = "https://www.some-data-hosting-site.com/files/" - - # Create a new directory where all files will be downloaded - directory = "data_files" - os.makedirs(directory) - - # Download each data file to data_files - for fname in filenames: - path = pooch.retrieve( - url=base_url + fname, known_hash=None, fname=fname, path=directory - ) - - # Create the registry file from the downloaded data files - pooch.make_registry("data_files", "registry.txt") - -If each data file has its own url, the registry file can be manually created -after downloading each data file through :func:`pooch.retrieve`: - -.. code:: python - - import os - - # Names and urls of the data files. The file names are used for naming the - # downloaded files. These are the names that will be included in the registry. - fnames_and_urls = { - "c137.csv": "https://www.some-data-hosting-site.com/c137/data.csv", - "cronen.csv": "https://www.some-data-hosting-site.com/cronen/data.csv", - "citadel.csv": "https://www.some-data-hosting-site.com/citadel/data.csv", - } - - # Create a new directory where all files will be downloaded - directory = "data_files" - os.makedirs(directory) - - # Create a new registry file - with open("registry.txt", "w") as registry: - for fname, url in fnames_and_urls.items(): - # Download each data file to the specified directory - path = pooch.retrieve( - url=url, known_hash=None, fname=fname, path=directory - ) - # Add the name, hash, and url of the file to the new registry file - registry.write( - f"{fname} {pooch.file_hash(path)} {url}\n" - ) - -.. warning:: - - Notice that there are **no checks for download integrity** (since we don't - know the file hashes before hand). Only do this for trusted data sources - and over a secure connection. If you have access to file hashes/checksums, - **we highly recommend using them** to set the ``known_hash`` argument. diff --git a/doc/api/index.rst b/doc/api/index.rst index 7da9be46..dda8b574 100644 --- a/doc/api/index.rst +++ b/doc/api/index.rst @@ -1,7 +1,7 @@ .. _api: -API Reference -============= +List of functions and classes (API) +=================================== .. automodule:: pooch diff --git a/doc/authentication.rst b/doc/authentication.rst new file mode 100644 index 00000000..f92fb045 --- /dev/null +++ b/doc/authentication.rst @@ -0,0 +1,82 @@ +.. _authentication: + +Authentication +============== + +HTTP authentication +------------------- + +Use the :class:`~pooch.HTTPDownloader` class directly to provide login +credentials to HTTP servers that require basic authentication. For example: + +.. code:: python + + from pooch import HTTPDownloader + + + def fetch_protected_data(): + """ + Fetch a file from a server that requires authentication + """ + # Let the downloader know the login credentials + download_auth = HTTPDownloader(auth=("my_username", "my_password")) + fname = GOODBOY.fetch("some-data.csv", downloader=download_auth) + data = pandas.read_csv(fname) + return data + +It's probably not a good idea to hard-code credentials in your code. One way +around this is to ask users to set their own credentials through environment +variables. The download code could look something like so: + +.. code:: python + + import os + + + def fetch_protected_data(): + """ + Fetch a file from a server that requires authentication + """ + # Get the credentials from the user's environment + username = os.environ.get("SOMESITE_USERNAME") + password = os.environ.get("SOMESITE_PASSWORD") + # Let the downloader know the login credentials + download_auth = HTTPDownloader(auth=(username, password)) + fname = GOODBOY.fetch("some-data.csv", downloader=download_auth) + data = pandas.read_csv(fname) + return data + + +FTP/SFTP with authentication +---------------------------- + +Pooch also comes with the :class:`~pooch.FTPDownloader` and +:class:`~pooch.SFTPDownloader` downloaders that can be used +when files are distributed over FTP or SFTP (secure FTP). + +.. note:: + + To download files over SFTP, + `paramiko `__ needs to be installed. + + +Sometimes the FTP server doesn't support anonymous FTP and needs authentication +or uses a non-default port. +In these cases, pass in the downloader class explicitly (works with both FTP +and SFTP): + +.. code:: python + + import os + + + def fetch_c137(): + """ + Load the C-137 sample data as a pandas.DataFrame (over FTP this time). + """ + username = os.environ.get("MYDATASERVER_USERNAME") + password = os.environ.get("MYDATASERVER_PASSWORD") + download_ftp = pooch.FTPDownloader(username=username, password=password) + fname = GOODBOY.fetch("c137.csv", downloader=download_ftp) + data = pandas.read_csv(fname) + return data diff --git a/doc/beginner.rst b/doc/beginner.rst deleted file mode 100644 index bb17064a..00000000 --- a/doc/beginner.rst +++ /dev/null @@ -1,208 +0,0 @@ -.. _beginner: - -Beginner tricks -=============== - -This section covers the minimal setup required to use Pooch to manage your data -collection. We highly recommend looking at the :ref:`intermediate` tutorial as -well after you're done with this one. - -.. note:: - - If you're only looking to download a single file, see :ref:`retrieve` - instead. - - -The problem ------------ - -You develop a Python library called ``plumbus`` for analysing data emitted by -interdimensional portals. You want to distribute sample data so that your users -can easily try out the library by copying and pasting from the docs. You want -to have a ``plumbus.datasets`` module that defines functions like -``fetch_c137()`` that will return the data loaded as a -:class:`pandas.DataFrame` for convenient access. - - -Assumptions ------------ - -We'll set up Pooch to solve your data distribution needs. -In this example, we'll work with the follow assumptions: - -1. Your sample data are in a folder of your GitHub repository. -2. You use git tags to mark releases of your project in the history. -3. Your project has a variable that defines the version string. -4. The version string contains an indicator that the current commit is not a - release (like ``'v1.2.3+12.d908jdl'`` or ``'v0.1+dev'``). - -Other use cases can also be handled (see :ref:`intermediate`). -For now, let's say that this is the layout of your repository on GitHub: - -.. code-block:: none - - doc/ - ... - data/ - README.md - c137.csv - cronen.csv - plumbus/ - __init__.py - ... - datasets.py - setup.py - ... - -The sample data are stored in the ``data`` folder of your repository. - - -Basic setup ------------ - -Pooch can download and cache your data files to the users' computer -automatically. This is what the ``plumbus/datasets.py`` file would look like: - -.. code:: python - - """ - Load sample data. - """ - import pandas - import pooch - - from . import version # The version string of your project - - - POOCH = pooch.create( - # Use the default cache folder for the OS - path=pooch.os_cache("plumbus"), - # The remote data is on Github - base_url="https://github.com/rick/plumbus/raw/{version}/data/", - version=version, - # If this is a development version, get the data from the master branch - version_dev="master", - # The registry specifies the files that can be fetched - registry={ - # The registry is a dict with file names and their SHA256 hashes - "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", - "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", - }, - ) - - - def fetch_c137(): - """ - Load the C-137 sample data as a pandas.DataFrame. - """ - # The file will be downloaded automatically the first time this is run - # returns the file path to the downloaded file. Afterwards, Pooch finds - # it in the local cache and doesn't repeat the download. - fname = POOCH.fetch("c137.csv") - # The "fetch" method returns the full path to the downloaded data file. - # All we need to do now is load it with our standard Python tools. - data = pandas.read_csv(fname) - return data - - - def fetch_cronen(): - """ - Load the Cronenberg sample data as a pandas.DataFrame. - """ - fname = POOCH.fetch("cronen.csv") - data = pandas.read_csv(fname) - return data - - -The ``POOCH`` returned by :func:`pooch.create` is an instance of the -:class:`~pooch.Pooch` class. The class contains the data registry (files, URLs, -hashes, etc) and handles downloading files from the registry using the -:meth:`~pooch.Pooch.fetch` method. - -When the user calls ``plumbus.datasets.fetch_c137()`` for the first time, the -data file will be downloaded and stored in the local storage. In this case, -we're using :func:`pooch.os_cache` to set the local folder to the default cache -location for your OS. You could also provide any other path if you prefer. The -download is only performed once and after that Pooch knows to only return the -path to the already downloaded file. - -The setup shown here is the minimum required to use Pooch if your package -follows the assumptions laid out above. Pooch also supports downloading files -from multiple sources (including FTP), and more. See the :ref:`intermediate` -tutorial and the documentation for :func:`pooch.create` and :func:`pooch.Pooch` -for more options. - - -Hashes ------- - -Pooch uses `SHA256 `__ hashes by default -to check if files are up-to-date or possibly corrupted: - -* If a file exists in the local folder, Pooch will check that its hash matches - the one in the registry. If it doesn't, we'll assume that it needs to be - updated. -* If a file needs to be updated or doesn't exist, Pooch will download it from - the remote source and check the hash. If the hash doesn't match, an exception - is raised to warn of possible file corruption. - -You can generate hashes for your data files using ``openssl`` in the terminal: - -.. code:: bash - - $ openssl sha256 data/c137.csv - SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d - -Or using the :func:`pooch.file_hash` function (which is a convenient way of -calling Python's :mod:`hashlib`): - -.. code:: python - - import pooch - print(pooch.file_hash("data/c137.csv")) - -Alternative hashing algorithms supported by :mod:`hashlib` can be used as well: - -.. code:: python - - import pooch - print(pooch.file_hash("data/c137.csv", alg="sha512")) - -In this case, you can specify the hash algorithm in the registry by prepending -it to the hash, for example ``"md5:0hljc7298ndo2"`` or -``"sha512:803o3uh2pecb2p3829d1bwouh9d"``. Pooch will understand this and use -the appropriate method. - - -Versioning ----------- - -The files from different version of your project will be kept in separate -folders to make sure they don't conflict with each other. This way, you can -safely update data files while maintaining backward compatibility. For example, -if ``path=".plumbus"`` and ``version="v0.1"``, the data folder will be -``.plumbus/v0.1``. - -When your project updates, Pooch will automatically setup a separate folder for -the new data files based on the given version string. The remote URL will also -be updated. Notice that there is a format specifier ``{version}`` in the URL -that Pooch substitutes for you. - -Versioning is optional and can be ignored by omitting the ``version`` and -``version_dev`` arguments or setting them to ``None``. - - -Where to go from here ---------------------- - -Pooch has more features for handling different download protocols, handling -large registries, downloading from multiple sources, and more. Check out the -:ref:`intermediate` and :ref:`advanced` for more information. - -You can also customize the download itself (adding authentication, progress -bars, etc) and apply post-download steps (unzipping an archive, decompressing a -file, etc) through its :ref:`downloaders` and :ref:`processors`. - -If you any questions, please feel free to ask on our -`Slack chatroom `__ or by opening an -`issue on GitHub `__. diff --git a/doc/compatibility.rst b/doc/compatibility.rst new file mode 100644 index 00000000..6e1b4a22 --- /dev/null +++ b/doc/compatibility.rst @@ -0,0 +1,50 @@ +.. _compatibility: + +Compatibility notes +=================== + +Pooch version compatibility +--------------------------- + +We try to retain backwards compatibility whenever possible. Major breaking +changes to the Pooch API will be marked by a major release and deprecation +warnings will be issued in previous releases to give developers ample time to +adapt. + +If there are any backwards incompatible changes, they will be listed below: + +.. list-table:: + :widths: 20 10 70 + + * - **Version introduced** + - **Severity** + - **Notes** + * - v1.0.0 + - Low + - We replaced use of ``warning`` with the ``logging`` module for all + messages issued by Pooch. This allows messages to be logged with + different priorities and the user filter out log messages or silence + Pooch entirely. **Users who relied on Pooch issuing warnings will need + to update to capturing logs instead.** The vast majority of users are + unaffected. + + + +.. _python-versions: + +Python version compatibility +---------------------------- + +If you require support for older Python versions, please pin Pooch to the +following releases to ensure compatibility: + +.. list-table:: + :widths: 40 60 + + * - **Python version** + - **Last compatible Pooch release** + * - 2.7 + - 0.6.0 + * - 3.5 + - 1.2.0 + diff --git a/doc/decompressing.rst b/doc/decompressing.rst new file mode 100644 index 00000000..e9f4ae86 --- /dev/null +++ b/doc/decompressing.rst @@ -0,0 +1,56 @@ +.. _decompressing: + +Decompressing +============= + +If you have a compressed file that is not an archive (zip or tar), you can use +:class:`pooch.Decompress` to decompress it after download. + +For example, large binary files can be compressed with ``gzip`` to reduce +download times but will need to be decompressed before loading, which can be +slow. +You can trade storage space for speed by keeping a decompressed copy of the +file: + +.. code:: python + + from pooch import Decompress + + def fetch_compressed_file(): + """ + Load a large binary file that has been gzip compressed. + """ + # Pass in the processor to decompress the file on download + fname = GOODBOY.fetch("large-binary-file.npy.gz", processor=Decompress()) + # The file returned is the decompressed version which can be loaded by + # numpy + data = numpy.load(fname) + return data + +:class:`pooch.Decompress` returns ``"large-binary-file.npy.gz.decomp"`` as the +decompressed file name by default. +You can change this behaviour by passing a file name instead: + +.. code:: python + + import os + from pooch import Decompress + + def fetch_compressed_file(): + """ + Load a large binary file that has been gzip compressed. + """ + # Pass in the processor to decompress the file on download + fname = GOODBOY.fetch("large-binary-file.npy.gz", + processor=Decompress(name="a-different-file-name.npy"), + ) + # The file returned is now named "a-different-file-name.npy" + data = numpy.load(fname) + return data + +.. warning:: + + Passing in ``name`` can cause existing data to be lost! + For example, if a file already exists with the specified name it will be + overwritten with the new decompressed file content. + **Use this option with caution.** diff --git a/doc/downloaders.rst b/doc/downloaders.rst index 4f3d0e4f..5213da0c 100644 --- a/doc/downloaders.rst +++ b/doc/downloaders.rst @@ -1,14 +1,15 @@ .. _downloaders: -Downloaders -=========== +Downloaders: Customizing the download +===================================== By default, :meth:`pooch.Pooch.fetch` and :meth:`pooch.retrieve` will detect the download protocol from the given URL (HTTP, FTP, or SFTP) and use the -appropriate download method. Sometimes this is not enough: some servers require -logins, redirections, or other non-standard operations. To get around this, you -can pass a ``downloader`` argument to :meth:`~pooch.Pooch.fetch` and -:meth:`~pooch.retrieve`. +appropriate download method. +Sometimes this is not enough: some servers require logins, redirections, or +other non-standard operations. +To get around this, use the ``downloader`` argument of +:meth:`~pooch.Pooch.fetch` and :meth:`~pooch.retrieve`. Downloaders are Python *callable objects* (like functions or classes with a ``__call__`` method) and must have the following format: @@ -32,94 +33,21 @@ Downloaders are Python *callable objects* (like functions or classes with a ''' ... - Pooch provides downloaders for HTTP, FTP, and SFTP that support authentication -and optionally printing progress bars. See :ref:`api` for a list of available -downloaders. - - -HTTP authentication -------------------- - -Use the :class:`~pooch.HTTPDownloader` class directly to provide login -credentials to HTTP servers that require basic authentication. For example: - -.. code:: python - - from pooch import HTTPDownloader - - - def fetch_protected_data(): - """ - Fetch a file from a server that requires authentication - """ - # Let the downloader know the login credentials - download_auth = HTTPDownloader(auth=("my_username", "my_password")) - fname = GOODBOY.fetch("some-data.csv", downloader=download_auth) - data = pandas.read_csv(fname) - return data - -It's probably not a good idea to hard-code credentials in your code. One way -around this is to ask users to set their own credentials through environment -variables. The download code could look something like so: - -.. code:: python - - import os - - - def fetch_protected_data(): - """ - Fetch a file from a server that requires authentication - """ - # Get the credentials from the user's environment - username = os.environ.get("SOMESITE_USERNAME") - password = os.environ.get("SOMESITE_PASSWORD") - # Let the downloader know the login credentials - download_auth = HTTPDownloader(auth=(username, password)) - fname = GOODBOY.fetch("some-data.csv", downloader=download_auth) - data = pandas.read_csv(fname) - return data - - -FTP/SFTP with authentication ----------------------------- - -Pooch also comes with the :class:`~pooch.FTPDownloader` and -:class:`~pooch.SFTPDownloader` downloaders that can be used -when files are distributed over FTP or SFTP (secure FTP). - -However, sometimes the FTP server doesn't support anonymous FTP and needs -authentication or uses a non-default port. In these cases, pass in the -downloader class explicitly (works with both FTP and SFTP): - -.. code:: python - - import os - - - def fetch_c137(): - """ - Load the C-137 sample data as a pandas.DataFrame (over FTP this time). - """ - username = os.environ.get("MYDATASERVER_USERNAME") - password = os.environ.get("MYDATASERVER_PASSWORD") - download_ftp = pooch.FTPDownloader(username=username, password=password) - fname = GOODBOY.fetch("c137.csv", downloader=download_ftp) - data = pandas.read_csv(fname) - return data +and optionally printing progress bars. +See :ref:`api` for a list of available downloaders. -.. note:: +Common uses of downloaders include: - To download files over SFTP, the package `paramiko - `__ needs to be installed. +* Passing :ref:`login credentials ` to HTTP and FTP servers +* Printing :ref:`progress bars ` -Custom downloaders ------------------- +Creating your own downloaders +----------------------------- If your use case is not covered by our downloaders, you can implement your own. -:meth:`pooch.Pooch.fetch` and :meth:`pooch.retrive` will accept any *callable +:meth:`pooch.Pooch.fetch` and :func:`pooch.retrieve` will accept any *callable obejct* that has the signature specified above. As an example, consider the case in which the login credentials need to be provided to a site that is redirected from the original download URL: diff --git a/doc/hashes.rst b/doc/hashes.rst new file mode 100644 index 00000000..2860cd21 --- /dev/null +++ b/doc/hashes.rst @@ -0,0 +1,97 @@ +.. _hashes: + +Hashes: Calculating and bypassing +================================= + +Pooch uses cryptographic hashes to check if files are up-to-date or possibly +corrupted: + +* If a file exists in the local folder, Pooch will check that its hash matches + the one in the registry. If it doesn't, we'll assume that it needs to be + updated. +* If a file needs to be updated or doesn't exist, Pooch will download it from + the remote source and check the hash. If the hash doesn't match, an exception + is raised to warn of possible file corruption. + +Calculating hashes +------------------ + +You can generate hashes for your data files using ``openssl`` in the terminal: + +.. code:: bash + + $ openssl sha256 data/c137.csv + SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d + +Or using the :func:`pooch.file_hash` function (which is a convenient way of +calling Python's :mod:`hashlib`): + +.. code:: python + + import pooch + print(pooch.file_hash("data/c137.csv")) + + +Specifying the hash algorithm +----------------------------- + +By default, Pooch uses `SHA256 `__ +hashes. +Other hash methods that are available in :mod:`hashlib` can also be used: + +.. code:: python + + import pooch + print(pooch.file_hash("data/c137.csv", alg="sha512")) + +In this case, you can specify the hash algorithm in the **registry** by +prepending it to the hash, for example ``"md5:0hljc7298ndo2"`` or +``"sha512:803o3uh2pecb2p3829d1bwouh9d"``. +Pooch will understand this and use the appropriate method. + + +Bypassing the hash check +------------------------ + +Sometimes we might not know the hash of the file or it could change on the +server periodically. +To bypass the check, we can set the hash value to ``None`` when specifying the +``registry`` argument for :func:`pooch.create` +(or the ``known_hash`` in :func:`pooch.retrieve`). + +In this example, we want to use Pooch to download a list of weather stations +around Australia: + +* The file with the stations is in an FTP server and we want to store it + locally in separate folders for each day that the code is run. +* The problem is that the ``stations.zip`` file is updated on the server + instead of creating a new one, so the hash check would fail. + +This is how you can solve this problem: + +.. code:: python + + import datetime + import pooch + + # Get the current data to store the files in separate folders + CURRENT_DATE = datetime.datetime.now().date() + + GOODBOY = pooch.create( + path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE, + base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/", + registry={ + "stations.zip": None, + }, + ) + +When running this same code again at a different date, the file will be +downloaded again because the local cache folder changed and the file is no +longer present in it. +If you omit ``CURRENT_DATE`` from the cache path, then Pooch will only fetch +the files once, unless they are deleted from the cache. + +.. attention:: + + If this script is run over a period of time, your cache directory will + increase in size, as the files are stored in daily subdirectories. diff --git a/doc/index.rst b/doc/index.rst index df32d31a..2d2035fa 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -13,44 +13,63 @@ Pooch is a part of the `Fatiando a Terra `_ project. +.. admonition:: Using Pooch for your research? + + Please consider :ref:`citing it ` in your publications. + Citations help us get credit for all the effort we put into this project. + + .. include:: ../README.rst - :start-after: placeholder-for-doc-index + :start-after: placeholder-doc-index-start + :end-before: placeholder-doc-index-end .. toctree:: - :maxdepth: 2 + :maxdepth: 0 :hidden: :caption: Getting Started install.rst retrieve.rst - citing.rst + multiple-files.rst + sample-data.rst .. toctree:: - :maxdepth: 2 + :maxdepth: 0 :hidden: :caption: Training your Pooch - beginner.rst - intermediate.rst - advanced.rst + hashes.rst + user-defined-cache.rst + registry-files.rst + multiple-urls.rst + protocols.rst + logging.rst downloaders.rst processors.rst + authentication.rst + progressbars.rst + unpacking.rst + decompressing.rst .. toctree:: - :maxdepth: 2 + :maxdepth: 0 :hidden: - :caption: Reference Documentation + :caption: Reference api/index.rst + versions.rst + compatibility.rst changes.rst + citing.rst .. toctree:: - :maxdepth: 2 + :maxdepth: 0 :hidden: :caption: Getting help and contributing - Join the community - How to contribute - Code of Conduct - Source code on GitHub - The Fatiando a Terra project + Join the Community + Code of Conduct + How to Contribute + Source Code on GitHub + Authors + Fatiando a Terra diff --git a/doc/install.rst b/doc/install.rst index 60daa420..4d09655a 100644 --- a/doc/install.rst +++ b/doc/install.rst @@ -17,6 +17,11 @@ Python distributions to ensure you have all dependencies installed and the Installing Anaconda does not require administrative rights to your computer and doesn't interfere with any other Python installations in your system. +.. note:: + + The commands below should be executed in a terminal. On Windows, use the + "Anaconda Prompt" app or ``cmd.exe`` if you're not using Anaconda. + Installing with conda --------------------- @@ -60,43 +65,7 @@ Required: Optional: -* `tqdm `__: Required to print a download - progress bar (see :ref:`tqdm-progressbar` or :ref:`custom-progressbar`). -* `paramiko `__: Required for SFTP - downloads (see :class:`pooch.SFTPDownloader`). - - -Testing your install --------------------- - -.. note:: - - This step is optional. - -We ship a full test suite with the package. -To run the tests, you'll need to install some extra dependencies first: - -* `pytest `__ - -After that, you can test your installation by running the following inside a -Python interpreter or Jupyter notebook:: - - import pooch - pooch.test() - - -.. _python-versions: - -Python version compatibility ----------------------------- - -If you require support for older Python versions, please pin Pooch to the -following releases to ensure compatibility: - -+----------------+-------------------------------+ -| Python version | Last compatible Pooch release | -+----------------+-------------------------------+ -| 2.7 | 0.6.0 | -+----------------+-------------------------------+ -| 3.5 | 1.2.0 | -+----------------+-------------------------------+ +* `tqdm `__: For printing a download + progress bar. See :ref:`progressbars`. +* `paramiko `__: For SFTP downloads. See + :class:`pooch.SFTPDownloader`. diff --git a/doc/intermediate.rst b/doc/intermediate.rst deleted file mode 100644 index dc2f7389..00000000 --- a/doc/intermediate.rst +++ /dev/null @@ -1,340 +0,0 @@ -.. _intermediate: - -Intermediate tricks -=================== - -This section covers intermediate configuration that, while not strictly -necessary, you might want to consider using on your project. In particular, -allowing users to **control the local storage location** and **registry files** -are **recommended** for most projects. - - -User-defined local storage location ------------------------------------ - -In the above example, the location of the local storage in the users' computer -is hard-coded. There is no way for them to change it to something else. To -avoid being a tyrant, you can allow the user to define the ``path`` argument -using an environment variable: - -.. code:: python - - POOCH = pooch.create( - # This is still the default - path=pooch.os_cache("plumbus"), - base_url="https://github.com/rick/plumbus/raw/{version}/data/", - version=version, - version_dev="master", - registry={ - "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", - "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", - }, - # The name of an environment variable that can overwrite the path - env="PLUMBUS_DATA_DIR", - ) - -In this case, if the user defines the ``PLUMBUS_DATA_DIR`` environment -variable, we'll use its value instead of ``path``. Pooch will still append the -value of ``version`` to the path, so the value of ``PLUMBUS_DATA_DIR`` should -not include a version number. - - -Registry files (dealing with large registries) ----------------------------------------------- - -If your project has a large number of data files, it can be tedious to list -them in a dictionary. In these cases, it's better to store the file names and -hashes in a file and use :meth:`pooch.Pooch.load_registry` to read them: - -.. code:: python - - import os - import pkg_resources - - POOCH = pooch.create( - # Use the default cache folder for the OS - path=pooch.os_cache("plumbus"), - # The remote data is on Github - base_url="https://github.com/rick/plumbus/raw/{version}/data/", - version=version, - # If this is a development version, get the data from the master branch - version_dev="master", - # We'll load it from a file later - registry=None, - ) - # Get registry file from package_data - registry_file = pkg_resources.resource_stream("plumbus", "registry.txt") - # Load this registry file - POOCH.load_registry(registry_file) - -In this case, the ``registry.txt`` file is in the ``plumbus/`` package -directory and should be shipped with the package (see below for instructions). -We use `pkg_resources `__ -to access the ``registry.txt``, giving it the name of our Python package. - -The contents of ``registry.txt`` are: - -.. code-block:: none - - c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc - cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w - -A specific hashing algorithm can be enforced, if a checksum for a file is -prefixed with ``alg:``, e.g. - -.. code-block:: none - - c137.csv sha1:e32b18dab23935bc091c353b308f724f18edcb5e - cronen.csv md5:b53c08d3570b82665784cedde591a8b0 - - -To make sure the registry file is shipped with your package, include the -following in your ``MANIFEST.in`` file: - -.. code-block:: none - - include plumbus/registry.txt - -And the following entry in the ``setup`` function of your ``setup.py`` file: - -.. code:: python - - setup( - ... - package_data={"plumbus": ["registry.txt"]}, - ... - ) - -From Pooch v1.2.0 the registry file can also contain line comments, prepended -with a ``#``, e.g.: - -.. code-block:: none - - # C-137 sample data - c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc - # Cronenberg sample data - cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w - -.. note:: - - Make sure you set the Pooch version in your ``setup.py`` to >=1.2.0 when - using comments as earlier versions cannot handle them: - ``install_requires = [..., "pooch>=1.2.0", ...]`` - - -Creating a registry file ------------------------- - -If you have many data files, creating the registry and keeping it updated can -be a challenge. Function :func:`pooch.make_registry` will create a registry -file with all contents of a directory. For example, we can generate the -registry file for our fictitious project from the command-line: - -.. code:: bash - - $ python -c "import pooch; pooch.make_registry('data', 'plumbus/registry.txt')" - - -File-specific URLs ------------------- - -You can set a custom download URL for individual files with the ``urls`` -argument of :func:`pooch.create` or :class:`pooch.Pooch`. It should be a -dictionary with the file names as keys and the URLs for downloading the files -as values. For example, say we have a ``citadel.csv`` file that we want to -download from ``https://www.some-data-hosting-site.com`` instead: - -.. code:: python - - # The basic setup is the same - POOCH = pooch.create( - path=pooch.os_cache("plumbus"), - base_url="https://github.com/rick/plumbus/raw/{version}/data/", - version=version, - version_dev="master", - registry={ - "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", - "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", - # Still include the file in the registry - "citadel.csv": "893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc", - }, - # Now specify custom URLs for some of the files in the registry. - urls={ - "citadel.csv": "https://www.some-data-hosting-site.com/files/citadel.csv", - }, - ) - -Notice that versioning of custom URLs is not supported (since they are assumed -to be data files independent of your project) and the file name will not be -appended automatically to the URL (in case you want to change the file name in -local storage). - -Custom URLs can be used along side ``base_url`` or you can omit ``base_url`` -entirely by setting it to an empty string (``base_url=""``). However, doing so -requires setting a custom URL for every file in the registry. - -You can also include custom URLs in a registry file by adding the URL for a -file to end of the line (separated by a space): - -.. code-block:: none - - c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc - cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w - citadel.csv 893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc https://www.some-data-hosting-site.com/files/citadel.csv - -:meth:`pooch.Pooch.load_registry` will automatically populate the ``urls`` -attribute. This way, custom URLs don't need to be set in the code. In fact, the -module code doesn't change at all: - -.. code:: python - - # Define the Pooch exactly the same (urls is None by default) - POOCH = pooch.create( - path=pooch.os_cache("plumbus"), - base_url="https://github.com/rick/plumbus/raw/{version}/data/", - version=version, - version_dev="master", - registry=None, - ) - # If custom URLs are present in the registry file, they will be set - # automatically. - POOCH.load_registry(os.path.join(os.path.dirname(__file__), "registry.txt")) - - -Download protocols ------------------- - -Pooch supports the HTTP, FTP, and SFTP protocols by default. It will detect the -correct protocol from the URL and use the appropriate download method. For -example, if our data were hosted on an FTP server instead of GitHub, we could -use the following setup: - -.. code:: python - - POOCH = pooch.create( - path=pooch.os_cache("plumbus"), - # Use an FTP server instead of HTTP. The rest is all the same. - base_url="ftp://garage-basement.org/{version}/", - version=version, - version_dev="master", - registry={ - "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", - "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", - }, - ) - - - def fetch_c137(): - """ - Load the C-137 sample data as a pandas.DataFrame (over FTP this time). - """ - fname = POOCH.fetch("c137.csv") - data = pandas.read_csv(fname) - return data - -You can even specify custom functions for the download or login credentials for -authentication. See :ref:`downloaders` for more information. - -.. note:: - - To download files over SFTP, the package `paramiko - `__ needs to be installed. - - -Subdirectories --------------- - -You can have data files in subdirectories of the remote data store. These files -will be saved to the same subdirectories in the local storage folder. Note, -however, that the names of these files in the registry **must use Unix-style -separators** (``'/'``) even on Windows. We will handle the appropriate -conversions. - - -.. _tqdm-progressbar: - -Printing a download progress bar with ``tqdm`` ----------------------------------------------- - -The :class:`~pooch.HTTPDownloader` can use `tqdm `__ -to print a download progress bar. This is turned off by default but can be -enabled using: - -.. code:: python - - from pooch import HTTPDownloader - - - def fetch_large_data(): - """ - Fetch a large file from a server and print a progress bar. - """ - download = HTTPDownloader(progressbar=True) - fname = POOCH.fetch("large-data-file.h5", downloader=download) - data = h5py.File(fname, "r") - return data - -The resulting progress bar will be printed to stderr and should look something -like this: - -.. code:: - - 100%|█████████████████████████████████████████| 336/336 [...] - -.. note:: - - ``tqdm`` is not installed by default with Pooch. You will have to install - it separately in order to use this feature. - - -.. _custom-progressbar: - -Using custom progress bars --------------------------- - -.. note:: - - At the moment, this feature is only available for - :class:`pooch.HTTPDownloader`. - -Alternatively, you can pass an arbitrary object that behaves like a progress -that implements the ``update``, ``reset``, and ``close`` methods. ``update`` -should accept a single integer positional argument representing the current -completion (in bytes), while ``reset`` and ``update`` do not take any argument -beside ``self``. The object must also have a ``total`` attribute that can be set -from outside the class. -In other words, the custom progress bar needs to behave like a ``tqdm`` progress bar. -Here's a minimal working example of such a custom "progress display" class - -.. code:: python - - import sys - - class MinimalProgressDisplay: - def __init__(self, total): - self.count = 0 - self.total = total - - def __repr__(self): - return str(self.count) + "/" + str(self.total) - - def render(self): - print(f"\r{self}", file=sys.stderr, end="") - - def update(self, i): - self.count = i - self.render() - - def reset(self): - self.count = 0 - - def close(self): - print("", file=sys.stderr) - - -An instance of this class can now be passed to an ``HTTPDownloader`` as - -.. code:: python - - pbar = MinimalProgressDisplay(total=None) - download = HTTPDownloader(progressbar=pbar) diff --git a/doc/logging.rst b/doc/logging.rst new file mode 100644 index 00000000..6dddbe78 --- /dev/null +++ b/doc/logging.rst @@ -0,0 +1,27 @@ +.. _logging: + +Logging and verbosity +===================== + +Pooch uses the :mod:`logging` module to print messages about downloads and +:ref:`processor ` execution. + +Adjusting the logging level +--------------------------- + +Pooch will log events like downloading a new file, updating an existing one, or +unpacking an archive by printing to the terminal. +You can change how verbose these events are by getting the event logger from +pooch and changing the logging level: + +.. code:: python + + logger = pooch.get_logger() + logger.setLevel("WARNING") + +Most of the events from Pooch are logged at the info level; this code says that +you only care about warnings or errors, like inability to create the data +cache. +The event logger is a :class:`logging.Logger` object, so you can use that +class's methods to handle logging events in more sophisticated ways if you +wish. diff --git a/doc/multiple-files.rst b/doc/multiple-files.rst new file mode 100644 index 00000000..5831e14b --- /dev/null +++ b/doc/multiple-files.rst @@ -0,0 +1,121 @@ +.. _beginner: + +Fetching files from a registry +============================== + +If you need to manage the download of multiple files from one or more +locations, then this section is for you! + +Setup +----- + +In the following example we'll assume that: + +1. You have several data files served from the same base URL (for example, + ``"https://www.somewebpage.org/science/data"``). +2. You know the file names and their + `hashes `__. + +We will use :func:`pooch.create` to set up our download manager: + +.. code:: python + + import pooch + + + odie = pooch.create( + # Use the default cache folder for the operating system + path=pooch.os_cache("my-project"), + base_url="https://www.somewebpage.org/science/data/", + # The registry specifies the files that can be fetched + registry={ + "temperature.csv": "sha256:19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", + "gravity-disturbance.nc": "sha256:1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", + }, + ) + +The return value (``odie``) is an instance of :class:`pooch.Pooch`. +It contains all of the information needed to fetch the data files in our +**registry** and store them in the specified cache folder. + +.. note:: + + The Pooch **registry** is a mapping of file names and their associated + hashes (and optionally download URLs). + +.. tip:: + + If you don't know the hash or are otherwise unable to obtain it, it is + possible to bypass the check. This is **not recommended** for general use, + only if it can't be avoided. See :ref:`hashes`. + + +.. attention:: + + You can have data files in **subdirectories** of the remote data store + (URL). + These files will be saved to the same subdirectories in the local storage + folder. + + However, the names of these files in the registry **must use Unix-style + separators** (``'/'``) **even on Windows**. + Pooch will handle the appropriate conversions. + + +Downloading files +----------------- + +To download one our data files and load it with `xarray +`__: + +.. code:: python + + import xarray as xr + + + file_path = odei.fetch("gravity-disturbance.nc") + # Standard use of xarray to load a netCDF file (.nc) + data = xr.open_dataset(file_path) + +The call to :meth:`pooch.Pooch.fetch` will check if the file already exists in +the cache folder. + +If it doesn't: + +1. The file is downloaded and saved to the cache folder. +2. The hash of the downloaded file is compared against the one stored in the + registry to make sure the file isn't corrupted. +3. The function returns the absolute path to the file on your computer. + +If it does: + +1. Check if it's hash matches the one in the registry. +2. If it does, no download happens and the file path is returned. +3. If it doesn't, the file is downloaded once more to get an updated version on + your computer. + +Why use this method? +-------------------- + +With :class:`pooch.Pooch`, you can centralize the information about the URLs, +hashes, and files in a single place. +Once the instance is created, it can be used to fetch individual files without +repeating the URL and hash everywhere. + +A good way to use this is to place the call to :func:`pooch.create` in Python +module (a ``.py`` file). +Then you can ``import`` the module in ``.py`` scripts or Jupyter notebooks and +use the instance to fetch your data. +This way, you don't need to define the URLs or hashes in multiple +scripts/notebooks. + +Customizing the download +------------------------ + +The :meth:`pooch.Pooch.fetch` method supports for all of Pooch's +:ref:`downloaders ` and :ref:`processors `. +You can use HTTP, FTP, and SFTP +(even with :ref:`authentication `), +:ref:`decompress files `, +:ref:`unpack archives `, +show :ref:`progress bars `, and more with a bit of configuration. diff --git a/doc/multiple-urls.rst b/doc/multiple-urls.rst new file mode 100644 index 00000000..5b0cbfbb --- /dev/null +++ b/doc/multiple-urls.rst @@ -0,0 +1,82 @@ +.. _multipleurls: + +Multiple download URLs +====================== + +You can set different download URLs for individual files with the ``urls`` +argument of :func:`pooch.create`. +It should be a dictionary with the file names as keys and the URLs for +downloading the files as values. + +For example, say we have a ``citadel.csv`` file that we want to download from +``https://www.some-data-hosting-site.com`` instead: + +.. code:: python + + # The basic setup is the same + POOCH = pooch.create( + path=pooch.os_cache("plumbus"), + base_url="https://github.com/rick/plumbus/raw/{version}/data/", + version=version, + version_dev="master", + registry={ + "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", + "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", + # Still include the file in the registry + "citadel.csv": "893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc", + }, + # Now specify custom URLs for some of the files in the registry. + urls={ + "citadel.csv": "https://www.some-data-hosting-site.com/files/citadel.csv", + }, + ) + +When ``POOCH.fetch("citadel.csv")`` is called, the download will by from the +specified URL instead of the ``base_url``. +The file name will not be appended automatically to the URL in case you want to +change the file name in local storage. + +.. attention:: + + **Versioning of custom URLs is not supported** since they are assumed to be + data files independent of your project. + The file will **still be placed in a versioned cache folder**. + + +.. tip:: + + Custom URLs can be used along side ``base_url`` or you can omit + ``base_url`` entirely by setting it to an empty string (``base_url=""``). + **Doing so requires setting a custom URL for every file in the registry**. + +Usage with registry files +------------------------- + +You can also include custom URLs in a :ref:`registry file ` by +adding the URL for a file to end of the line (separated by a space): + +.. code-block:: none + + c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc + cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w + citadel.csv 893yprofwjndcwhx9c0ehp3ue9gcwoscjwdfgh923e0hwhcwiyc https://www.some-data-hosting-site.com/files/citadel.csv + +:meth:`pooch.Pooch.load_registry` will automatically populate the ``urls`` +attribute. +This way, custom URLs don't need to be set in the code. +In fact, the module code doesn't change at all: + +.. code:: python + + # Define the Pooch exactly the same (urls is None by default) + POOCH = pooch.create( + path=pooch.os_cache("plumbus"), + base_url="https://github.com/rick/plumbus/raw/{version}/data/", + version=version, + version_dev="master", + registry=None, + ) + # If custom URLs are present in the registry file, they will be set + # automatically. + POOCH.load_registry(os.path.join(os.path.dirname(__file__), "registry.txt")) + diff --git a/doc/processors.rst b/doc/processors.rst index 6c83dc8a..4f946b90 100644 --- a/doc/processors.rst +++ b/doc/processors.rst @@ -1,18 +1,20 @@ .. _processors: -Processors -========== +Processors: Post-download actions +================================= Post-download actions sometimes need to be taken on downloaded files -(unzipping, conversion to a more efficient format, etc). If these actions are -time or memory consuming, it might be worth doing them only once after the file -is downloaded. This is a way of trading disk space for computation time. +(unzipping, conversion to a more efficient format, etc). +If these actions are time or memory consuming, it might be worth doing them +only once after the file is downloaded. +This is a way of trading disk space for computation time. :meth:`pooch.Pooch.fetch` and :func:`pooch.retrieve` accept the ``processor`` argument to handle these situations. Processors are Python *callable objects* (like functions or classes with a ``__call__`` method) that are executed after a file is downloaded to perform -these actions. They must have the following format: +these actions. +They must have the following format: .. code:: python @@ -40,139 +42,21 @@ these actions. They must have the following format: The processor is executed after a file downloaded attempted (whether the download actually happens or not) and before returning the path to the -downloaded file. The processor lets us intercept the returned path, perform -actions, and possibly return a different path. +downloaded file. +The processor lets us intercept the returned path, perform actions, and +possibly return a different path. Pooch provides built-in processors for common tasks, like decompressing files and unpacking tar and zip archives. See the :ref:`api` for a full list. +Common uses cases for processors include: -Unpacking archives ------------------- +* :ref:`Unpacking archives ` to load individual members +* :ref:`Decompressing ` files -Let's say our data file is actually a zip (or tar) archive with a collection of files. -We may want to store an unpacked version of the archive or extract just a -single file from it. We can do both operations with the :class:`pooch.Unzip` -and :class:`pooch.Untar` processors. -For example, to extract a single file from a zip archive: - -.. code:: python - - from pooch import Unzip - - - def fetch_zipped_file(): - """ - Load a large zipped sample data as a pandas.DataFrame. - """ - # Extract the file "actual-data-file.txt" from the archive - unpack = Unzip(members=["actual-data-file.txt"]) - # Pass in the processor to unzip the data file - fnames = GOODBOY.fetch("zipped-data-file.zip", processor=unpack) - # Returns the paths of all extract members (in our case, only one) - fname = fnames[0] - # fname is now the path of the unzipped file ("actual-data-file.txt") - # which can be loaded by pandas directly - data = pandas.read_csv(fname) - return data - -By default, the :class:`~pooch.Unzip` processor (and similarly the -:class:`~pooch.Untar` processor) will create a new folder in the same location -as the downloaded archive file, and give it the same name as the archive file -with the suffix ``.unzip`` (or ``.untar``) appended. If you want to change the -location of the unpacked files, you can provide a parameter ``extract_dir`` to -the processor to tell it where you want to unpack the files: - -.. code:: python - - from pooch import Untar - - - def fetch_and_unpack_tar_file(): - """ - Unpack a file from a tar archive to a custom subdirectory in the cache. - """ - # Extract a single file from the archive, to a specific location - unpack_to_custom_dir = Untar(members=["actual-data-file.txt"], - extract_dir="custom_folder") - # Pass in the processor to untar the data file - fnames = GOODBOY.fetch("tarred-data-file.tar.gz", processor=unpack) - # Returns the paths of all extract members (in our case, only one) - fname = fnames[0] - return fname - - -To extract all files into a folder and return the path to each file, simply -omit the ``members`` parameter: - -.. code:: python - - def fetch_zipped_archive(): - """ - Load all files from a zipped archive. - """ - fnames = GOODBOY.fetch("zipped-archive.zip", processor=Unzip()) - return fnames - -Use :class:`pooch.Untar` to do the exact same for tar archives (with optional -compression). - - -Decompressing -------------- - -If you have a compressed file that is not an archive (zip or tar), you can use -:class:`pooch.Decompress` to decompress it after download. For example, large -binary files can be compressed with ``gzip`` to reduce download times but will -need to be decompressed before loading, which can be slow. You can trade -storage space for speed by keeping a decompressed copy of the file: - -.. code:: python - - from pooch import Decompress - - def fetch_compressed_file(): - """ - Load a large binary file that has been gzip compressed. - """ - # Pass in the processor to decompress the file on download - fname = GOODBOY.fetch("large-binary-file.npy.gz", processor=Decompress()) - # The file returned is the decompressed version which can be loaded by - # numpy - data = numpy.load(fname) - return data - -:class:`pooch.Decompress` returns ``"large-binary-file.npy.gz.decomp"`` as the -decompressed file name by default. You can change this behaviour by passing a -file name instead: - -.. code:: python - - import os - from pooch import Decompress - - def fetch_compressed_file(): - """ - Load a large binary file that has been gzip compressed. - """ - # Pass in the processor to decompress the file on download - fname = GOODBOY.fetch("large-binary-file.npy.gz", - processor=Decompress(name="a-different-file-name.npy"), - ) - # The file returned is now named "a-different-file-name.npy" - data = numpy.load(fname) - return data - -.. warning:: - - Passing in ``name`` can cause existing data to be lost! For example, if - a file already exists with the specified name it will be overwritten with - the new decompressed file content. **Use this option with caution.** - - -Custom processors ------------------ +Creating your own processors +---------------------------- Let's say we want to implement the :class:`pooch.Unzip` processor ourselves to extract a single file from the archive. We could do that with the following diff --git a/doc/progressbars.rst b/doc/progressbars.rst new file mode 100644 index 00000000..7f64784c --- /dev/null +++ b/doc/progressbars.rst @@ -0,0 +1,104 @@ +.. _progressbars: + +Printing progress bars +====================== + +.. _tqdm-progressbar: + +Using ``tqdm`` progress bars +---------------------------- + +The :class:`~pooch.HTTPDownloader` can use +`tqdm `__ to print a download progress bar. +This is turned off by default but can be enabled using: + +.. code:: python + + # Assuming you have a pooch.Pooch instance setup + POOCH = pooch.create( + ... + ) + + fname = POOCH.fetch( + "large-data-file.h5", + downloader=pooch.HTTPDownloader(progressbar=True), + ) + +The resulting progress bar will be printed to the standard error stream +(STDERR) and should look something like this: + +.. code:: + + 100%|█████████████████████████████████████████| 336/336 [...] + +.. note:: + + ``tqdm`` is not installed by default with Pooch. You will have to install + it separately in order to use this feature. + + +.. _custom-progressbar: + +Using custom progress bars +-------------------------- + +.. note:: + + At the moment, this feature is only available for + :class:`pooch.HTTPDownloader`. + +Alternatively, you can pass an arbitrary object that behaves like a progress +that implements the ``update``, ``reset``, and ``close`` methods: + +* ``update`` should accept a single integer positional argument representing + the current completion (in bytes). +* ``reset`` and ``close`` do not take any argument beside ``self``. + +The object must also have a ``total`` attribute that can be set from outside +the class. +In other words, the custom progress bar needs to behave like a ``tqdm`` +progress bar. + +Here's a minimal working example of such a custom "progress display" class: + +.. code:: python + + import sys + + class MinimalProgressDisplay: + def __init__(self, total): + self.count = 0 + self.total = total + + def __repr__(self): + return str(self.count) + "/" + str(self.total) + + def render(self): + print(f"\r{self}", file=sys.stderr, end="") + + def update(self, i): + self.count = i + self.render() + + def reset(self): + self.count = 0 + + def close(self): + print("", file=sys.stderr) + + +An instance of this class can now be passed to an ``HTTPDownloader`` as: + +.. code:: python + + # Assuming you have a pooch.Pooch instance setup + POOCH = pooch.create( + ... + ) + + minimal_progress = MinimalProgressDisplay(total=None) + + fname = POOCH.fetch( + "large-data-file.h5", + downloader=pooch.HTTPDownloader(progressbar=minimal_progress), + ) diff --git a/doc/protocols.rst b/doc/protocols.rst new file mode 100644 index 00000000..3eba610a --- /dev/null +++ b/doc/protocols.rst @@ -0,0 +1,42 @@ +.. _protocols: + +Download protocols +================== + +Pooch supports the HTTP, FTP, and SFTP protocols by default. +It will **automatically detect** the correct protocol from the URL and use the +appropriate download method. + +.. note:: + + To download files over SFTP, + `paramiko `__ needs to be installed. + +For example, if our data were hosted on an FTP server, we could use the +following setup: + +.. code:: python + + POOCH = pooch.create( + path=pooch.os_cache("plumbus"), + # Use an FTP server instead of HTTP. The rest is all the same. + base_url="ftp://garage-basement.org/{version}/", + version=version, + version_dev="master", + registry={ + "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", + "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", + }, + ) + + + def fetch_c137(): + """ + Load the C-137 sample data as a pandas.DataFrame (over FTP this time). + """ + fname = POOCH.fetch("c137.csv") + data = pandas.read_csv(fname) + return data + +You can even specify custom functions for the download or login credentials for +**authentication**. See :ref:`downloaders` for more information. diff --git a/doc/registry-files.rst b/doc/registry-files.rst new file mode 100644 index 00000000..571a146e --- /dev/null +++ b/doc/registry-files.rst @@ -0,0 +1,178 @@ +.. _registryfiles: + +Registry files +============== + +Usage +----- + +If your project has a large number of data files, it can be tedious to list +them in a dictionary. In these cases, it's better to store the file names and +hashes in a file and use :meth:`pooch.Pooch.load_registry` to read them. + +.. code:: python + + import os + import pkg_resources + + POOCH = pooch.create( + path=pooch.os_cache("plumbus"), + base_url="https://github.com/rick/plumbus/raw/{version}/data/", + version=version, + version_dev="main", + # We'll load it from a file later + registry=None, + ) + # Get registry file from package_data + registry_file = pkg_resources.resource_stream("plumbus", "registry.txt") + # Load this registry file + POOCH.load_registry(registry_file) + +In this case, the ``registry.txt`` file is in the ``plumbus/`` package +directory and should be shipped with the package (see below for instructions). +We use `pkg_resources `__ +to access the ``registry.txt``, giving it the name of our Python package. + +Registry file format +-------------------- + +Registry files are light-weight text files that specify a file's name and hash. +In our example, the contents of ``registry.txt`` are: + +.. code-block:: none + + c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc + cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w + +A specific hashing algorithm can be enforced, if a checksum for a file is +prefixed with ``alg:``: + +.. code-block:: none + + c137.csv sha1:e32b18dab23935bc091c353b308f724f18edcb5e + cronen.csv md5:b53c08d3570b82665784cedde591a8b0 + +From Pooch v1.2.0 the registry file can also contain line comments, prepended +with a ``#``: + +.. code-block:: none + + # C-137 sample data + c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc + # Cronenberg sample data + cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w + +.. attention:: + + Make sure you set the Pooch version in your ``setup.py`` to >=1.2.0 when + using comments as earlier versions cannot handle them: + ``install_requires = [..., "pooch>=1.2.0", ...]`` + + +Packaging registry files +------------------------ + +To make sure the registry file is shipped with your package, include the +following in your ``MANIFEST.in`` file: + +.. code-block:: none + + include plumbus/registry.txt + +And the following entry in the ``setup`` function of your ``setup.py`` file: + +.. code:: python + + setup( + ... + package_data={"plumbus": ["registry.txt"]}, + ... + ) + + +Creating a registry file +------------------------ + +If you have many data files, creating the registry and keeping it updated can +be a challenge. Function :func:`pooch.make_registry` will create a registry +file with all contents of a directory. For example, we can generate the +registry file for our fictitious project from the command-line: + +.. code:: bash + + $ python -c "import pooch; pooch.make_registry('data', 'plumbus/registry.txt')" + + +Create registry file from remote files +-------------------------------------- + +If you want to create a registry file for a large number of data files that are +available for download but you don't have their hashes or any local copies, +you must download them first. Manually downloading each file +can be tedious. However, we can automate the process using +:func:`pooch.retrieve`. Below, we'll explore two different scenarios. + +If the data files share the same base url, we can use :func:`pooch.retrieve` +to download them and then use :func:`pooch.make_registry` to create the +registry: + +.. code:: python + + import os + + # Names of the data files + filenames = ["c137.csv", "cronen.csv", "citadel.csv"] + + # Base url from which the data files can be downloaded from + base_url = "https://www.some-data-hosting-site.com/files/" + + # Create a new directory where all files will be downloaded + directory = "data_files" + os.makedirs(directory) + + # Download each data file to data_files + for fname in filenames: + path = pooch.retrieve( + url=base_url + fname, known_hash=None, fname=fname, path=directory + ) + + # Create the registry file from the downloaded data files + pooch.make_registry("data_files", "registry.txt") + +If each data file has its own url, the registry file can be manually created +after downloading each data file through :func:`pooch.retrieve`: + +.. code:: python + + import os + + # Names and urls of the data files. The file names are used for naming the + # downloaded files. These are the names that will be included in the registry. + fnames_and_urls = { + "c137.csv": "https://www.some-data-hosting-site.com/c137/data.csv", + "cronen.csv": "https://www.some-data-hosting-site.com/cronen/data.csv", + "citadel.csv": "https://www.some-data-hosting-site.com/citadel/data.csv", + } + + # Create a new directory where all files will be downloaded + directory = "data_files" + os.makedirs(directory) + + # Create a new registry file + with open("registry.txt", "w") as registry: + for fname, url in fnames_and_urls.items(): + # Download each data file to the specified directory + path = pooch.retrieve( + url=url, known_hash=None, fname=fname, path=directory + ) + # Add the name, hash, and url of the file to the new registry file + registry.write( + f"{fname} {pooch.file_hash(path)} {url}\n" + ) + +.. warning:: + + Notice that there are **no checks for download integrity** (since we don't + know the file hashes before hand). Only do this for trusted data sources + and over a secure connection. If you have access to file hashes/checksums, + **we highly recommend using them** to set the ``known_hash`` argument. diff --git a/doc/retrieve.rst b/doc/retrieve.rst index 04c9b284..454d6be3 100644 --- a/doc/retrieve.rst +++ b/doc/retrieve.rst @@ -1,21 +1,12 @@ .. _retrieve: -Retrieving a data file -====================== +Retrieving a single data file +============================= -A common task in data analysis workflows is downloading the data from a -publicly available source. This could be done manually (which can't be easily -reproduced) or programmatically using :mod:`urllib` or :mod:`requests` (which -can require a non-trivial amount of code). Ideally, we should -be checking that the downloaded file is not corrupted with a known -`checksum `__. +Basic usage +----------- - -Getting started ---------------- - -Pooch is designed to simplify all of these tasks (and more). If you're only -looking to download one or two data files only, Pooch offers the +If you only want to download one or two data files, use the :func:`pooch.retrieve` function: .. code-block:: python @@ -23,62 +14,81 @@ looking to download one or two data files only, Pooch offers the import pooch - # Download the file and save it locally. - fname = pooch.retrieve( + file_path = pooch.retrieve( # URL to one of Pooch's test files url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt", - # Pooch will check the MD5 checksum of the downloaded file against the - # given value to make sure it haven't been corrupted. You can use other - # hashes by specifying different algorithm names (sha256, sha1, etc). known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e", ) -The file is stored locally, by default in a folder called ``pooch`` in the -default cache location of your operating system (see :func:`pooch.os_cache`). -The function returns the full path to the downloaded data file, which you can -then pass to pandas, numpy, xarray, etc, to load into memory. +The code above will: + +1. Check if the file from this URL already exists in Pooch's default cache + folder (see :func:`pooch.os_cache`). +2. If it doesn't, the file is downloaded and saved to the cache folder. +3. The MD5 `hash `__ + is compared against the ``known_hash`` to make sure the file isn't + corrupted. +4. The function returns the absolute path to the file on your computer. -Running this code a second time will not trigger a download since the file -already exists. So you can place this function call at the start of your script -or Jupyter notebook without having to worry about repeat downloads. Anyone -getting a copy of your code should also get the correct data file the first -time they run it. +If the file already existed on your machine, Pooch will check if it's MD5 hash +matches the ``known_hash``: -If the file is updated on the server and ``known_hash`` is set to the checksum -of the new file, Pooch will automatically detect that the file needs to be -updated and download the new version. +* If it does, no download happens and the file path is returned. +* If it doesn't, the file is downloaded once more to get an updated version on + your computer. -.. note:: +Since the download happens only once, you can place this function call at the +start of your script or Jupyter notebook without having to worry about repeat +downloads. +Anyone getting a copy of your code should also get the correct data file the +first time they run it. - The :func:`pooch.retrieve` function is useful when you have one or two - files to download. **If you need to manage the download and caching of - several files** (for example, if you're developing a Python package or for - large data analysis projects), then you should start using the full - capabilities of the :class:`pooch.Pooch` class. It can handle sandboxing - data for different package versions, allow users to set the download - locations, and more. +.. seealso:: - See :ref:`beginner` and :ref:`intermediate` to get started. + You can use **different hashes** by specifying different algorithm names: + ``sha256:XXXXXX``, ``sha1:XXXXXX``, etc. See :ref:`hashes`. Unknown file hash ----------------- If you don't know the hash of the file, you can set ``known_hash=None`` to -bypass the check. If this is the case, :func:`~pooch.retrieve` will print a log -message with the SHA256 hash of the downloaded file. **It's highly recommended -that you copy and paste this hash into your code** and use it as the -``known_hash``. +bypass the check. +:func:`~pooch.retrieve` will print a log message with the SHA256 hash of the +downloaded file. +**It's highly recommended that you copy and paste this hash into your code +and use it as the** ``known_hash``. -That way, the next time your code is run (by you or someone -else) you can guarantee that the exact same file is downloaded. This is a way -to help make sure the results of your code are reproducible. +.. tip:: + + Setting the ``known_hash`` guarantees that the next time your code is run + (by you or someone else) the exact same file is downloaded. This helps + make the results of your code **reproducible**. Customizing the download ------------------------ -Function :func:`pooch.retrieve` has support for all of Pooch's -:ref:`downloaders ` and :ref:`processors `. You can -use HTTP, FTP, and SFTP (with or without authentication), decompress files, unpack -archives, show progress bars, and more with a bit of configuration. +The :func:`pooch.retrieve` function supports for all of Pooch's +:ref:`downloaders ` and :ref:`processors `. +You can use HTTP, FTP, and SFTP +(even with :ref:`authentication `), +:ref:`decompress files `, +:ref:`unpack archives `, +show :ref:`progress bars `, and more with a bit of configuration. + + +When not to use ``retrieve`` +---------------------------- + +If you need to manage the download and caching of several files from one or +more sources, then you should start using the full capabilities of the +:class:`pooch.Pooch` class. +It can handle sandboxing +data for different package versions, allow users to set the download +locations, and more. + +The classic example is a **Python package that contains several sample +datasets** for use in testing and documentation. + +See :ref:`beginner` and :ref:`intermediate` to get started. diff --git a/doc/sample-data.rst b/doc/sample-data.rst new file mode 100644 index 00000000..9e1e8ef9 --- /dev/null +++ b/doc/sample-data.rst @@ -0,0 +1,177 @@ +.. _intermediate: + +Manage a package's sample data +============================== + +In this section, we'll use Pooch to manage the download of a Python package's +sample datasets. +The setup will be very similar to what we saw in :ref:`beginner` (please read +that tutorial first). +This time, we'll also use some other features that make our lives a bit easier. + +The problem +----------- + +In this example, we'll work with the follow assumptions: + +* You develop a Python library called ``plumbus`` for analysing data emitted by + interdimensional portals. +* You want to distribute sample data so that your users can easily try out the + library by copying and pasting from the documentation. +* You want to have a ``plumbus.datasets`` module that defines functions like + ``fetch_c137()`` that will return the data loaded as a + :class:`pandas.DataFrame` for convenient access. +* Your sample data are in a folder of your GitHub repository but you don't want + to include the data files with your source and wheel distributions because of + their size. +* You use git tags to mark releases of your project. +* Your project has a variable that defines the version string. +* The version string contains an indicator that the current commit is not a + release (like ``'v1.2.3+12.d908jdl'`` or ``'v0.1+dev'``). + +For now, let's say that this is the layout of your repository on GitHub: + +.. code-block:: none + + doc/ + ... + data/ + README.md + c137.csv + cronen.csv + plumbus/ + __init__.py + ... + datasets.py + setup.py + ... + +The sample data are stored in the ``data`` folder of your repository. + +.. seealso:: + + Pooch can handle different use cases as well, like: FTP/SFTP, authenticated + HTTP, multiple URLs, decompressing and unpacking archives, etc. + See the tutorials under "Training your Pooch" and the documentation for + :func:`pooch.create` and :func:`pooch.Pooch` for more options. + + +Basic setup +----------- + +This is what the ``plumbus/datasets.py`` file would look like: + +.. code:: python + + """ + Load sample data. + """ + import pandas + import pooch + + from . import version # The version string of your project + + + BRIAN = pooch.create( + # Use the default cache folder for the operating system + path=pooch.os_cache("plumbus"), + # The remote data is on Github + base_url="https://github.com/rick/plumbus/raw/{version}/data/", + version=version, + # If this is a development version, get the data from the "main" branch + version_dev="main", + registry={ + "c137.csv": "sha256:19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", + "cronen.csv": "sha256:1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", + }, + ) + + + def fetch_c137(): + """ + Load the C-137 sample data as a pandas.DataFrame. + """ + # The file will be downloaded automatically the first time this is run + # returns the file path to the downloaded file. Afterwards, Pooch finds + # it in the local cache and doesn't repeat the download. + fname = BRIAN.fetch("c137.csv") + # The "fetch" method returns the full path to the downloaded data file. + # All we need to do now is load it with our standard Python tools. + data = pandas.read_csv(fname) + return data + + + def fetch_cronen(): + """ + Load the Cronenberg sample data as a pandas.DataFrame. + """ + fname = BRIAN.fetch("cronen.csv") + data = pandas.read_csv(fname) + return data + + +The ``BRIAN`` variable captures the value returned by :func:`pooch.create`, +which is an instance of the :class:`~pooch.Pooch` class. The class contains the +data registry (files, URLs, hashes, etc) and handles downloading files from the +registry using the :meth:`~pooch.Pooch.fetch` method. +When the user calls ``plumbus.datasets.fetch_c137()`` for the first time, the +data file will be downloaded and stored in the local storage. + +.. tip:: + + We're using :func:`pooch.os_cache` to set the local folder to the default + cache location for the user's operating system. You could also provide any + other path if you prefer. + + +Versioning +---------- + +The files from different version of your project will be kept in separate +folders to make sure they don't conflict with each other. This way, you can +safely update data files while maintaining backward compatibility. For example, +if ``path=".plumbus"`` and ``version="v0.1"``, the data folder will be +``.plumbus/v0.1``. + +When your project updates, Pooch will automatically setup a separate folder for +the new data files based on the given version string. The remote URL will also +be updated. Notice that there is a format specifier ``{version}`` in the URL +that Pooch substitutes for you. + +**Versioning is optional** and can be ignored by omitting the ``version`` and +``version_dev`` arguments or setting them to ``None``. + + +Retry failed downloads +---------------------- + +When downloading data repeatedly, like in continuous integration, failures can +occur due to sporadic network outages or other factors outside of our control. +In these cases, it can be frustrating to have entire jobs fail because a single +download was not successful. + +Pooch allows you to specify a number of times to retry the download in case of +failure by setting ``retry_if_failed`` in :func:`pooch.create`. This setting +will be valid for all downloads attempted with :meth:`pooch.Pooch.fetch`. The +download can fail because the file hash doesn't match the known hash (due to a +partial download, for example) or because of network errors coming from +:mod:`requests`. Other errors (file system permission errors, etc) will still +result in a failed download. + +.. note:: + + Requires Pooch >= 1.3.0. + + + +Where to go from here +--------------------- + +Pooch has more features for handling different download protocols, handling +large registries, downloading from multiple sources, and more. Check out the +tutorials under "Training your Pooch" for more information. + +You can also customize the download itself (adding authentication, progress +bars, etc) and apply post-download steps (unzipping an archive, decompressing a +file, etc) through its :ref:`downloaders ` and +:ref:`processors `. diff --git a/doc/unpacking.rst b/doc/unpacking.rst new file mode 100644 index 00000000..7beee837 --- /dev/null +++ b/doc/unpacking.rst @@ -0,0 +1,76 @@ +.. _unpacking: + +Unpacking archives +================== + +Let's say our data file is actually a zip (or tar) archive with a collection of +files. +We may want to store an unpacked version of the archive or extract just a +single file from it. +We can do both operations with the :class:`pooch.Unzip` and +:class:`pooch.Untar` processors. + +For example, to extract a single file from a zip archive: + +.. code:: python + + from pooch import Unzip + + + def fetch_zipped_file(): + """ + Load a large zipped sample data as a pandas.DataFrame. + """ + # Extract the file "actual-data-file.txt" from the archive + unpack = Unzip(members=["actual-data-file.txt"]) + # Pass in the processor to unzip the data file + fnames = GOODBOY.fetch("zipped-data-file.zip", processor=unpack) + # Returns the paths of all extract members (in our case, only one) + fname = fnames[0] + # fname is now the path of the unzipped file ("actual-data-file.txt") + # which can be loaded by pandas directly + data = pandas.read_csv(fname) + return data + +By default, the :class:`~pooch.Unzip` processor (and similarly the +:class:`~pooch.Untar` processor) will create a new folder in the same location +as the downloaded archive file, and give it the same name as the archive file +with the suffix ``.unzip`` (or ``.untar``) appended. + +If you want to change the location of the unpacked files, you can provide a +parameter ``extract_dir`` to the processor to tell it where you want to unpack +the files: + +.. code:: python + + from pooch import Untar + + + def fetch_and_unpack_tar_file(): + """ + Unpack a file from a tar archive to a custom subdirectory in the cache. + """ + # Extract a single file from the archive, to a specific location + unpack_to_custom_dir = Untar(members=["actual-data-file.txt"], + extract_dir="custom_folder") + # Pass in the processor to untar the data file + fnames = GOODBOY.fetch("tarred-data-file.tar.gz", processor=unpack) + # Returns the paths of all extract members (in our case, only one) + fname = fnames[0] + return fname + + +To extract all files into a folder and return the path to each file, omit the +``members`` parameter: + +.. code:: python + + def fetch_zipped_archive(): + """ + Load all files from a zipped archive. + """ + fnames = GOODBOY.fetch("zipped-archive.zip", processor=Unzip()) + return fnames + +Use :class:`pooch.Untar` to do the exact same for tar archives (with optional +compression). diff --git a/doc/user-defined-cache.rst b/doc/user-defined-cache.rst new file mode 100644 index 00000000..d33de97d --- /dev/null +++ b/doc/user-defined-cache.rst @@ -0,0 +1,34 @@ +.. _environmentvariable: + +User-defined cache location +--------------------------- + +The location of the local storage cache in the users' computer +is usually hard-coded when we call :func:`pooch.create`. +There is no way for them to change it to something else. + +To avoid being a tyrant, you can allow the user to define the cache location +using an environment variable: + +.. code:: python + + BRIAN = pooch.create( + # This is still the default + path=pooch.os_cache("plumbus"), + base_url="https://github.com/rick/plumbus/raw/{version}/data/", + version=version, + version_dev="master", + registry={ + "c137.csv": "19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc", + "cronen.csv": "1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w", + }, + # The name of an environment variable that can overwrite the path + env="PLUMBUS_DATA_DIR", + ) + +In this case, if the user defines the ``PLUMBUS_DATA_DIR`` environment +variable, Pooch use its value instead of ``path``. + +Pooch will still append the value of ``version`` to the path, so the value of +``PLUMBUS_DATA_DIR`` should not include a version number. + diff --git a/doc/versions.rst b/doc/versions.rst new file mode 100644 index 00000000..89c718ea --- /dev/null +++ b/doc/versions.rst @@ -0,0 +1,28 @@ +Documentation for other versions +-------------------------------- + +Use the links below to access documentation for specific versions +(when in doubt, use the **latest release**): + +* `Latest release `__ +* `Development `__ + (reflects the current development branch on GitHub) +* `v1.3.0 `__ +* `v1.2.0 `__ +* `v1.1.1 `__ +* `v1.1.0 `__ +* `v1.0.0 `__ +* `v0.7.1 `__ +* `v0.7.0 `__ +* `v0.6.0 `__ +* `v0.5.2 `__ +* `v0.5.1 `__ +* `v0.5.0 `__ +* `v0.4.0 `__ +* `v0.3.1 `__ +* `v0.3.0 `__ +* `v0.2.1 `__ +* `v0.2.0 `__ +* `v0.1.1 `__ +* `v0.1 `__ +