figshare help

Figshare — Arkivum Integration Overview


The document describes the way in which Figshare stores content and can push files to Arkivum storage for preservation. This operation is performed to ensure the ownership and longer term preservation of files for Figshare’s Institutional Partners, even in exceptional situations such as service outages or Figshare ceasing to exist. An optional workflow, which evicts files from the Figshare storage is also described. 

The Figshare Storage Workflow

Files uploaded to Figshare follow a mostly linear workflow with the following steps:

  1. User uploads files either via the Web interface (Edit item page), or via FTP uploader, or via the Figshare REST API.
  2. File is immediately stored on Figshare’s temporary storage, being available for download.
  3. The MD5 checksum of the file is computed and stored for future integrity checks.
  4. Two operations are being performed in parallel:

                    The preview of the file is being generated and is stored on a separate Figshare storage instance. 

                    The file is copied from the temporary storage to its final storage location, which can be either supplied by Figshare (e.g.,                     Amazon Simple Storage System) or the institution. For the purposes of this document we will denote this as the figshare                     final storage, no matter the type of the actual technical implementation.

      5. The file is being mirrored on one or more 3rd-party storage solutions. For the purpose of this document we will assume that           Arkivum is the chosen solution and that this step is mandatory, regardless of it being optional in the generic Figshare           implementation.

All these steps are being carried out by a separate Figshare system, in charge of running asynchronous tasks. Its functioning is being monitored and logged, failed operations being retried in order to ensure workflow completion; the file model entry in Figshare’s database also includes the current state in the workflow.

The steps above are summarised in Fig. 1.

                                                                                 Fig. 1: The Figshare file lifecycle

Figshare-Arkivum Workflow

When an Arkivum integration is being employed, a number of extra operations need to be carried out in order to ensure correct functioning. Namely, Figshare needs to perform the following Arkivum REST API calls:

  1. POST /files/<folder> {form-data}: upload the file to theArkivum on-premises appliance;
  2. POST /api/2/files/release/<path>: release the file for ingestion and replication, allowing it to be transferred from the on-premises appliance to the Arkivum data center;
  3. (optional) GET /api/2/files/release/<id>: get the status of the ingestion/ replication operation (scheduled/ processing/ complete/ failed). This can be used  for deciding further operations (e.g. removal from Figshare’s final storage).

Given the successful completion of these steps, Arkivum can ensure the availability and integrity of the files, independent of the state of the file on Figshare’s final storage.

Figshare Storage Eviction

Please note: there may be once-off fees for setting up an eviction process.

Once a file is fully replicated by Arkivum (green status on) it can be evicted from both the local appliance and the Figshare final storage, if required. One use-case for this is being able to store more data on the Figshare storage than what is actually being contracted.

Such files can be evicted using a least recently used (LRU) strategy.  This works as follows:

  • Once a certain used storage threshold (e.g., 90%) is reached, an asynchronous cleanup task is launched.
  • This task iterates through the files on the Figshare storage.
  • The least recently downloaded files which have the green status on Arkivum are deleted from Figshare storage.
  • The process continues until a halting condition is reached; this is configurable, with the following options being available:
  • A certain proportion of the Figshare storage is free (e.g. 60%);
  • there are no files left which have not been downloaded during a predefined time delta (e.g. 10 days behind the current date);
  • there are no files left which have not been uploaded during a predefined time delta (e.g. 1 year behind the current date).

When choosing the halting condition, the impact on end users needs to be considered. Namely, if a file is not present on the Figshare final storage, when a user requests it for download from an item page the following operations need to be carried out in the background before the file can actually be downloaded:

  1. Retrieve the file from the Arkivum datacenter and make it available on the Arkivum local appliance.
  2. Copy the file from the Arkivum appliance to the Figshare final storage.

Once these are completed, the file can be immediately downloaded and it is marked with the current timestamp in order to ensure the correct functioning of the LRU algorithm. The process can take from a few minutes up to a few hours, depending on the size of the file(s). This is additionally dependent on various factors, such as source internet speed and other network conditions.

If a user requests a file which is not present in the Figshare storage, the following message will be displayed: "The file/s for this item are located in archival storage. The download will take some time to retrieve, please check back later.". The user can close the browser window and check at a later time if the file(s) are available for download.

Using Arkivum as Institutional Storage

The workflow above applies when the institution chooses to use Figshare storage (usually Amazon Simple Storage Service) or its own local storage infrastructure, but not Arkivum (i.e. use Arkivum as both the final repository storage and the preservation solution). If Arkivum is employed also as Figshare storage, the second step in the retrieval workflow above does not apply anymore; that means that after the file is copied from the Arkivum datacenter to the local appliance, Figshare will redirect any download request to the appliance directly.

Requirements for integrating Arkivum with Figshare

In order to implement any of the workflows above, the institution must provide Figshare with the following:

  • If your Arkivum system is self-hosted, you will need to supply Figshare with access to the Arkivum appliance: this usually means whitelisting certain IP ranges used by Figshare systems, which will allow the usage of Arkivum’s REST API. Arkivum can provide Figshare access on your behalf if your system is hosted by Arkivum.
  • Access to the institutional storage: this applies when the institution does not use Figshare storage; the exact requirements will be discussed on a case-by-case basis.
  • Preference on the eviction policy: this will depend on the contracted storage, internal regulations, and any other preferences of the institution. The trigger condition for the cleanup procedure must be defined too. There may be a once-off fee for the eviction configuration.

The points above can be agreed on during, or at any point after, the implementation process of Figshare for Institutions is complete, preferably after the storage decisions are finalised.

Share this article: