Bdbag Versions Save

Big Data Bag Utilities

v1.7.2

2 months ago

Release Notes

Minor feature addition and bugfixes

  • Introducing support for bag idempotentcy, or reproducible bags. A reproducible bag is a bag that has content-equivalence (in both payload and metadata, including manifests) to another bag created a different time with the same content, structure, bagging tool, and profile (if used). When this bag creation and bag archive mode is enabled, two separately created bags (or bag archive files) with content-equivalence will hash equally, whether the hash is calculated on the bytes of the resultant archive file or calculated on the equivalently ordered set of individual file hashes of the bag's contents. See the API Guide for additional information.
  • PR: #59 Only require the external package importlib_metadata for Python < 3.8. This module is already included as importlib.metadata in Python versions 3.8 and above.
  • Fix issue with HTTP fetch handler and auth header bearer-token stripping on redirects not getting restored to the cached requests session after redirect.
  • Remove dependency on deprecated distutils and distutils.util.strtobool function.
  • The is_bag API function will no longer attempt to instantiate a Bag object on non-directories.

v1.7.1

9 months ago

Release Notes

Bugfix Release

Fix issue with packaging.parse throwing InvalidVersion in the upgrade_config() function when trying to parse the informational version string VERSION set by bdbag when it is running in a "frozen" (e.g., with cx_Freeze) environment. In such cases, VERSION is set to something like 1.7.1-frozen, which is not PEP-440 compliant. This was not an issue in previous releases due to the fact that the implementation used pkg_resources.parse_version which was not as strict.

The code in upgrade_config() has been changed to parse the PEP-440 compliant version returned by distribution("bdbag").version function from importlib_metadata, rather than use the global string VERSION, which can still be (and is) used elsewhere for purely informational and descriptive purposes.

Note that this bug only affects bdbag when it is running in a frozen environment. Otherwise, release 1.7.0 is equivalent in functionality.

v1.7.0

9 months ago

Release Notes

  • PR: #54: Add support for passing a local profile path for profile validation. Thanks to Bernhard Hampel-Waffenthal for the contribution.
  • #40: Replace deprecated use of pkg_resources with importlib-metadata and packaging.
  • Fix issue with HTTP fetch transport where bearer-token auth gets stripped from the session on a legitimate redirect but not restored for any potential new request on that same URL-bound session.
  • Unpin tzlocal unless Python<3.
  • Support for Python 3.5 and 3.6 has been dropped. Python 3.7 compatibility is deprecated but still officially supported in this release.

v1.6.4

1 year ago

Release Notes

Added Google Cloud Storage fetch handler for handling gs:// URLs in fetch.txt.

Note that this is a soft dependency and you must install the gcloud CLI on the system where you will be running bdbag in order for this handler to function.

Enabling "requester pays":

This handler supports the requester pays usage pattern by allowing the billable project_id to be specified in the auth_params object for a corresponding keychain.json entry for a matching gs:// URI pattern.

For example, to configure (and allow) requester pays for a GS bucket, you would add a keychain.json entry similar to the following:

{
    "uri": "gs://gcs-bdbag-integration-testing/",
    "auth_type": "gcs-credentials",
    "auth_params": {
        "project_id": "bdbag-204999",
        "allow_requester_pays": true
    }
}

You can also explicitly disallow requester pays at the client-side in the following ways:

  • Set allow_requester_pays to false
  • Omit the allow_requester_pays field.
  • Omit the project_id field.
  • Omit the auth_params object entirely.

Note that if you do any of the above, data retrieval requests to buckets which have requester pays enabled will fail. The use case for this configuration option is to ensure that you don't pay for requests when requester pays is disabled on the bucket. Per the following GCS documentation:

Important: Buckets that have Requester Pays disabled still accept requests that include a billing project, 
and charges are applied to the billing project supplied in the request. 
Consider any billing implications prior to including a billing project in all of your requests.

IMPORTANT NOTE:

At the time of this writing, when using gcloud-CLI from Google Cloud SDK 416.0.0 and previous, it is possible to still be billed for bucket usage even if you've disallowed requester pays for a given bucket in keychain.json. This is because the gcloud init process requires that you specify a default project_id and this project id is subsequently stored in the application_default_credentials.json file used by the GCS APIs (which the bdbag fetch handler uses) as quota_project_id. If this value is present it will be passed on all GCS API calls as a fallback regardless even if explicitly not passed to the API call. This can be worked around by removing the quota_project_id from application_default_credentials.json.

Using service account credentials:

It is also possible to specify a service_account_credentials_file which is a file path referencing a service account credentials JSON file provided by Google Cloud Storage. For example:

{
    "uri": "gs://bdbag-dev/",
    "auth_type": "gcs-credentials",
    "auth_params": {
        "project_id": "bdbag-204400",
        "service_account_credentials_file": "/home/bdbag/bdbag-204400-41babdd46e24.json"
    }
}

v1.6.3

2 years ago

Release Notes

Bugfix release and dependency update.

  • Fix bug in bdbag_api.validate() where underlying BagError exceptions were not being propagated correctly.
  • Add an environment marker to setup.py for the python-requests dependency. This marker specifies that no greater than requests-2.25.1 be used with Python3.5 environments, due to underlying incompatibilities with requests dependency chain and Python3.5 after requests-2.26.0. Reported in issue #47.

Note that bdbag support for Python3.5 is planned to be dropped in the 1.7.0 release.

v1.6.2

2 years ago

Release Notes

  • Set "User-Agent" header for HTTP fetch handler (via python-requests) to "bdbag/{version} (requests/{version})".
  • Added sha1 support for bdbag_utils function create-rfm-from-url-list. See PR #46.
  • Fix issues with unicode handling in fetch.txt, RO metadata.json, keychain.json, and remote-file-manifest JSON files.
  • Fix issues with over-escaping (urlencoding) of filenames and urls in fetch.txt and RO metadata.json. Per the spec, only CR,LF, whitespace, and literal percent should be encoded.

v1.6.1

3 years ago

Release Notes

  • #41: Add support for regex patterns in filter_dict. See PR #42.
  • Add -frozen qualifier suffix (when applicable) to version strings returned by get_distribution.
  • Pinned setuptools_scm<6.0 due to it dropping support for Python 2.7/3.5 which we will still support for a little while longer.

v1.6.0

3 years ago

Release Notes

Minor feature release with bugfixes and dependency updates.

  • Implement #37: Support external fetch transports via plug-in architecture.
  • Added --output-path CLI (and corresponding API) argument for specifying output path for extracted archives.
  • Added a bypass_ssl_cert_verification configuration option for the https fetch handler so that SSL certificate verification could be disabled either globally (not recommended) or on a whitelisted set of URL paths used in simple substring matches against a bag's fetch.txt URLs.
  • Update the --validate-profile CLI argument so that it can take an optional keyword argument, bag-only, which can be used to bypass the otherwise automatic profile serialization validation, and therefore is suitable to use on extracted bag directories.
  • Fixed issue with archive_bag API function not including empty directories when creating zip format archives.
  • Modified extract_bag API function to accurately include the bag root directory path of the extracted bag archive in the return value. Previously, this value could have wound up being different from the file archive base name; for example if the archive file was renamed or was created in such a way that the base file name never matched the archived bag directory root.
  • Refactored bagit-profile support. This module is no longer "vendored" internally and is now a proper external dependency intended to be pulled from PyPi. The Profile class is patched internally, as needed. This dependency is currently pinned to 1.3.1.
  • Updated bdbag-profile.json and bdbag-ro-profile.json to leverage newer features of bagit-profile version 1.3. Loosened "Manifests-Required" to only require md5 for both profiles.
  • Pinned bagit-python dependency version to 1.8.1.
  • Added Python 3.8 and 3.9 support to setup.py metadata and travis builds.
  • Dropped Python 3.4 support.

v1.5.6

4 years ago

Release Notes

Bugfix release with minor feature addition.

  • Fix #34: New file hashes for existing manifest entries generated from remote-file-manifests don't get updated in bags.
  • Fix #36: Directory paths with slash at the end during "archive_bag" results in a malformed archive name.
  • Added update_keychain API function in auth/keychain.py for programmatic add/update/delete of keychain entries.
  • Added Python 3.7 support to setup.py metadata and Travis builds.

v1.5.5

4 years ago

Release Notes

Bugfix release.

  • Ensure tag file manifest entries for additional tag files uses denormalized path separator (unix-style /) similar to payload file manifest entries.
  • Return result bag path from the materialize() function.
  • Don't use strict mode when guessing mime types to allow for user-extended types.
  • Dropped Python 3.3 support.