2025 June 17
Evolving the preprint evaluation world with Sciety
This post is based on an interview with Sciety team at eLife.
Now, assuming XMP is a good idea - and I think on balance it is (as blogged earlier), why are we not seeing any metadata published in scholarly media files? The only drawbacks that occur to me are:
Hard to model - rigid, “simple” XMP data model, both complicates and constrains the RDF data model
(Continues)
So, putting the RDF issue aside for the moment (as if RDF didn’t have problems of its own - XML, URI, etc.) let’s just look at the options for writing the stuff. (Btw, I’m not referencing any tools or toolkits. This is just in the round.) There are various means of publishing metadata in XMP:
**Sidecar**
: XMP can be produced as standalone files - see [XMP Specification, (Sept. ’05)][3], p. 36. (These are called “sidecar” files if the file has the same name as the main document and is in the same directory.) The only things needed to produce these files are a text editor and a good grasp of the XMP serialization. A template will do for that. The main problem with a standalone file is that it does not travel with the media file and so risks being left behind.
Worth a note here. Not standalone as such but the [Mars][4] format (the draft XML formalization for PDF) discloses its metadata in an independent XMP file “metadata.xml” under the “META-INF/” directory. For distribution the whole directory structure is packaged up as a zip file and so the XMP is embedded in a “.mars” file, but accessed directly from the zip file or from the unpackaged directory the XMP can be manipulated just like any other XML document.
**Embedded**
: This is the normal means of distributing XMP - embedded within the media file. Some graphics formats are essentially linear (JPEG, PNG, GIF) and it is relatively straightforward to add in an XMP packet. Other formats (PDF, TIFF) have internal cross-referencing and are more difficult to deal with.
**Embedded + Sidecar**
: One possible method for dealing with the difficulty of writing XMP is to note that some media (especially PDFs) already have embedded XMP packets. As noted earlier, much if not all of the metadata in these XMP packets will be workflow-related and thus dispensible for final-form products where authority work-related metadata is desired. These packets may, or may not, be writeable and thus include additional padding whitespace. Even for read-only packets there is much (if not all) that can be discarded and also sometimes unnecesary bulk (e.g. default namespace declarations which are never used). _The bottom line is that any legacy XMP packet may typically be 2-3K in size and, just as in transplanting a cell nucleus, the XMP packet innards can be deftly substituted with a minimal XMP packet content, say 1K in size, which would be guaranteed to fit with suitable padding._ A packet that size would be sufficient to provide at minimum for a DOI and for a reference to additional metadata, e.g. a more complete standalone XMP packet. The two forms can coexist.
The third way option here allows embedding a minimal XMP packet into “difficult” packaging structures while pointing out to a fully-formed XMP packet. The “simple” packaging structures may both include a fully-formed XMP packet while also possibly referencing extended metadata sources as per my previous post [here][4].
Destacando nuestra comunidad en Colombia
2025 June 05