Skip to content

Latest commit

 

History

History
369 lines (283 loc) · 13.7 KB

README.md

File metadata and controls

369 lines (283 loc) · 13.7 KB

E-ARK IP validation and manipulation tool and library

This project provides a command-line interface and Java library to validate and manipulate OAIS Information Packages of different formats: E-ARK (version 1 & 2 in alpha stage), BagIt, Hungarian type 4 SIP.

The E-ARK Information Packages are maintained by the Digital Information LifeCycle Interoperability Standards Board (DILCIS Board). DILCIS Board is an international group of experts committed to maintain and sustain maintain a set of interoperability specifications which allow for the transfer, long-term preservation, and reuse of digital information regardless of the origin or type of the information.

More specifically, the DILCIS Board maintains specifications initially developed within the E-ARK Project (02.2014 - 01.2017):

  • Common Specification for Information Packages
  • E-ARK Submission Information Package (SIP)
  • E-ARK Archival Information Package (AIP)
  • E-ARK Dissemination Information Package (DIP)

The DILCIS Board collaborates closely with the Swiss Federal Archives in regard to the maintenance of the SIARD (Software Independent Archiving of Relational Databases) specification.

For more information about the E-ARK Information Packages specifications, please visit http://www.dilcis.eu/

Installation

Requirements

  • Java (>= 1.8)

Download the latest release to use as a tool or check below how to use it as a Java Library.

Usage

You can use the commons-ip as a command-line tool or a Java library.

Use as a command-line tool

To use commons-ip validator, need to download the latest release of the commons-ip tool. Then you can execute in the command-line using the following options:

  • validate, [REQUIRED] this option is for the CLI to know that is to perform the validation of a SIP.
  • -i, [REQUIRED] Path(s) to the SIP(s) archive file or folder(s).
  • -o, [OPTIONAL] Path to the Directory where you want to save the validation report.

Examples:

java -jar commons-ip-2.X.Y.jar validate -i sip1.zip
java -jar commons-ip-2.X.Y.jar validate -i sip1.zip sip2.zip -o output/

Output Example

The report generated by the validator is in JSON format and has the following structure:

{ 
  "header" : {
    "title" : "Validation Report CSIP",
    "specifications" : 
    [ 
      {
        "id" : "CSIP-2.0.4",
        "url" : "https://github.com/DILCISBoard/E-ARK-CSIP/releases/tag/v2.0.4"
      }, 
      {
        "id" : "SIP-2.0.4",
        "url" : "https://github.com/DILCISBoard/E-ARK-SIP/releases/tag/v2.0.4"
      } 
    ],
    "version_commons_ip" : "2.0.0-alpha3-SNAPSHOT",
    "date" : "2021-09-24T13:30:26.203+01:00",
    "path" : "${HOME}/Full-EARK-SIP.zip"
  },
  "validation" :
  [
    {
      "specification" : "CSIP-2.0.4",
      "id" : "CSIPSTR1",
      "name" : "CSIP Information Package folder structure",
      "location" : "",
      "description" : "Any Information Package MUST be included within a  single physical root folder (known as the “Information Package root folder”). For packages presented in an archive format, see CSIPSTR3, the archive MUST unpack to a single root folder.",
      "cardinality" : "",
      "level" : "MUST",
      "testing" : 
        {
          "outcome" : "PASSED",
          "issues" : [ ],
          "warnings" : [ ],
          "notes" : [ ]
        }
    },
    {
      "specification" : "CSIP-2.0.4",
      "id" : "CSIP32",
      "name" : "Digital provenance metadata",
      "location" : "mets/amdSec/digiprovMD",
      "description" : "For recording information about preservation the standard PREMIS is used. It is mandatory to include one <digiprovMD> element for each piece of PREMIS metadata. The use if PREMIS in METS is following the recommendations in the 2017 version of PREMIS in METS Guidelines.",
      "cardinality" : "0..n",
      "level" : "SHOULD",
      "testing" : 
      {
        "outcome" : "FAILED",
        "issues" : [ ],
        "warnings" : [ "It is mandatory to include one <digiprovMD> element in Root METS.xml for each piece of PREMIS metadata" ],
        "notes" : [ ]
      }
    }  
  ],
  "summary" : 
    {
      "success" : 120,
      "warnings" : 3,
      "errors" : 5,
      "skipped" : 33,
      "notes" : 8,
      "result" : "INVALID"
    }
}

Use as a Java Library

  • Using Maven
  1. Add the following repository
<repository>
  <id>KEEPS-Artifacts</id>
  <name>KEEP Artifacts-releases</name>
  <url>https://artifactory.keep.pt/artifactory/keep</url>
</repository>
  1. Add the following dependency
<dependency>
  <groupId>org.roda-project</groupId>
  <artifactId>commons-ip2</artifactId>
  <version>2.0.0-alpha1</version>
</dependency>

If using Java 11 or greater, you might need the following extra dependencies:

<dependency>
 <groupId>javax.xml.bind</groupId>
 <artifactId>jaxb-api</artifactId>
 <version>2.3.0-b170201.1204</version>
</dependency>
<dependency>
 <groupId>javax.activation</groupId>
 <artifactId>activation</artifactId>
 <version>1.1</version>
</dependency>
<dependency>
 <groupId>org.glassfish.jaxb</groupId>
 <artifactId>jaxb-runtime</artifactId>
 <version>2.3.0-b170127.1453</version>
</dependency>
  • Not using maven, download Commons IP latest jar, each of Commons IP dependencies (see pom.xml to know which dependencies/versions) and add them to your project classpath.

Write some code

  • Create a full E-ARK SIP
// 1) instantiate E-ARK SIP object
SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED(), IPContentInformationType.getMIXED());
sip.addCreatorSoftwareAgent("RODA Commons IP", "2.0.0");

// 1.1) set optional human-readable description
sip.setDescription("A full E-ARK SIP");

// 1.2) add descriptive metadata (SIP level)
IPDescriptiveMetadata metadataDescriptiveDC = new IPDescriptiveMetadata(
new IPFile(Paths.get("src/test/resources/eark/metadata_descriptive_dc.xml")),
new MetadataType(MetadataTypeEnum.DC), null);
sip.addDescriptiveMetadata(metadataDescriptiveDC);

// 1.3) add preservation metadata (SIP level)
IPMetadata metadataPreservation = new IPMetadata(
new IPFile(Paths.get("src/test/resources/eark/metadata_preservation_premis.xml")));
sip.addPreservationMetadata(metadataPreservation);

// 1.4) add other metadata (SIP level)
IPFile metadataOtherFile = new IPFile(Paths.get("src/test/resources/eark/metadata_other.txt"));
// 1.4.1) optionally one may rename file final name
metadataOtherFile.setRenameTo("metadata_other_renamed.txt");
IPMetadata metadataOther = new IPMetadata(metadataOtherFile);
sip.addOtherMetadata(metadataOther);

// 1.5) add xml schema (SIP level)
sip.addSchema(new IPFile(Paths.get("src/test/resources/eark/schema.xsd")));

// 1.6) add documentation (SIP level)
sip.addDocumentation(new IPFile(Paths.get("src/test/resources/eark/documentation.pdf")));

// 1.7) set optional RODA related information about ancestors
sip.setAncestors(Arrays.asList("b6f24059-8973-4582-932d-eb0b2cb48f28"));

// 1.8) add an agent (SIP level)
IPAgent agent = new IPAgent("Agent Name", "OTHER", "OTHER ROLE", CreatorType.INDIVIDUAL, "OTHER TYPE", "",
   IPAgentNoteTypeEnum.SOFTWARE_VERSION);
sip.addAgent(agent);

// 1.9) add a representation (status will be set to the default value, i.e.,
// ORIGINAL)
IPRepresentation representation1 = new IPRepresentation("representation 1");
sip.addRepresentation(representation1);

// 1.9.1) add a file to the representation
IPFile representationFile = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile.setRenameTo("data.pdf");
representation1.addFile(representationFile);

// 1.9.2) add a file to the representation and put it inside a folder
// called 'def' which is inside a folder called 'abc'
IPFile representationFile2 = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile2.setRelativeFolders(Arrays.asList("abc", "def"));
representation1.addFile(representationFile2);

// 1.10) add a representation & define its status
IPRepresentation representation2 = new IPRepresentation("representation 2");
representation2.setStatus(new RepresentationStatus(REPRESENTATION_STATUS_NORMALIZED));
sip.addRepresentation(representation2);

// 1.10.1) add a file to the representation
IPFile representationFile3 = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile3.setRenameTo("data3.pdf");
representation2.addFile(representationFile3);

// 2) build SIP, providing an output directory
Path zipSIP = sip.build(tempFolder);

Note: SIP implements the Observer Pattern. This way, if one wants to be notified of SIP build progress, one just needs to implement SIPObserver interface and register itself in the SIP. Something like (just presenting some of the events):

public class WhoWantsToBuildSIPAndBeNotified implements SIPObserver{

  public void buildSIP(){
    ...
    SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED());
    sip.addObserver(this);
    ...
  }

  @Override
  public void sipBuildPackagingStarted(int totalNumberOfFiles) {
    ...
  }

  @Override
  public void sipBuildPackagingCurrentStatus(int numberOfFilesAlreadyProcessed) {
    ...
  }
}
  • Parse a full E-ARK SIP
// 1) invoke static method parse and that's it
SIP earkSIP = EARKSIP.parse(zipSIP);

Development

In this sections are some relevant notes about Commons IP development.

XML Beans

XML Beans are used by Commons IP to manipulate METS files using Java code.

Details

Some changes were made to XML Schemas in order to be able to compile XML Schemas into Java classes using XJC as well as to be able to validate a XML file against its XML Schema without Internet connections.

The changes are:

  • src/main/resources/schemas2/mets1_12.xsd XLink Schema location made local (and respectively file available locally)
  • src/main/resources/schemas2/mets1_12.xjb Bindings file created to deal with attribute name conflict between METS and XLink Schemas

After Java classes are created, some changes were made to produce METS XML files well defined in terms of namespaces. Namely:

  • src/main/java/org/roda_project/commons_ip2/mets_v1_12/beans/package-info.java Annotations for corretly generate namespaces were added. Following is the before & then the after:
@javax.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED)

and the after

@javax.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED, xmlns = {
  @javax.xml.bind.annotation.XmlNs(prefix = "", namespaceURI = "http://www.loc.gov/METS/"),
  @javax.xml.bind.annotation.XmlNs(prefix = "xsi", namespaceURI = "http://www.w3.org/2001/XMLSchema-instance"),
  @javax.xml.bind.annotation.XmlNs(prefix = "csip", namespaceURI = "https://DILCIS.eu/XML/METS/CSIPExtensionMETS"),
  @javax.xml.bind.annotation.XmlNs(prefix = "sip", namespaceURI = "https://DILCIS.eu/XML/METS/SIPExtensionMETS"),
  @javax.xml.bind.annotation.XmlNs(prefix = "xlink", namespaceURI = "http://www.w3.org/1999/xlink")})

How to generate/update XML Beans

xjc -d src/main/java/ -p "org.roda_project.commons_ip2.mets_v1_12.beans" src/main/resources/schemas2/mets1_12.xsd -b src/main/resources/schemas2/mets1_12.xjb

IANA Media Types

The IANA Media Types list is required to perform SIP Validation. The list is located in the folder and named as follows:

/src/main/resources/controlledVocabularies/IANA_MEDIA_TYPES.txt

How to generate/update

To update IANA media types list, in commons-ip root directory run the following command:

./scripts/ianaMediaTypes_parser.sh

The command executes a script that downloads all IANA Media Types (Application, Audio, Font, Image, Message, Model, Multipart, Text, Video) from https://www.iana.org/assignments/media-types/${iana_file}.csv . Note that this downloads different .csv files then creates a .txt file with all IANA media types appended to the new file.

Extend List of IANA Media Types

The IANA media types list from https://www.iana.org/assignments/media-types/${iana_file}.csv was extended with the following mimetypes:

Image
  • image/jpeg
  • image/gif
Text
  • text/plain
Video
  • video/mpeg

Update Extended list of IANA Media Types

*To add/remove new values to the list add a new line with the IANA Media Type in scripts/ExtendedIANAMediaTypes.txt

*Run the script again

Commercial support

For more information or commercial support, contact KEEP SOLUTIONS.

Further reading

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

Credits

  • Hélder Silva (KEEP SOLUTIONS)

License

LGPLv3