|
1 | 1 | <!-- |
2 | | -SPDX-FileCopyrightText: 2022 German Aerospace Center (DLR) |
| 2 | +SPDX-FileCopyrightText: 2025 German Aerospace Center (DLR) |
3 | 3 |
|
4 | 4 | SPDX-License-Identifier: CC-BY-SA-4.0 |
5 | 5 | --> |
6 | 6 |
|
7 | 7 | <!-- |
8 | | -SPDX-FileContributor: Michael Meinel |
| 8 | +SPDX-FileContributor: Stephan Druskat <[email protected]> |
9 | 9 | --> |
10 | 10 |
|
11 | | -# HERMES Data Model |
| 11 | +# Data model |
12 | 12 |
|
13 | | -*hermes* uses an internal data model to store the output of the different stages. |
14 | | -All the data is collected in a directory called `.hermes` located in the root of the project directory. |
| 13 | +`hermes`' internal data model acts like a contract between `hermes` and plugins. |
| 14 | +It is based on [**JSON-LD (JSON Linked Data)**](https://json-ld.org/), and |
| 15 | +the public API simplifies interaction with the data model through Python code. |
15 | 16 |
|
16 | | -You should not need to interact with this data directly. |
17 | | -Instead, use {class}`hermes.model.context.HermesContext` and respective subclasses to access the data in a consistent way. |
| 17 | +Output of the different `hermes` commands consequently is valid JSON-LD, serialized as JSON, that is cached in |
| 18 | +subdirectories of the `.hermes/` directory that is created in the root of the project directory. |
18 | 19 |
|
| 20 | +The cache is purely for internal purposes, its data should not be interacted with. |
19 | 21 |
|
20 | | -## Harvest Data |
| 22 | +Depending on whether you develop a plugin for `hermes`, or you develop `hermes` itself, you need to know either [_some_](#json-ld-for-plugin-developers), |
| 23 | +or _quite a few_ things about JSON-LD. |
21 | 24 |
|
22 | | -The data of the havesters is cached in the sub-directory `.hermes/harvest`. |
23 | | -Each harvester has a separate cache file to allow parallel harvesting. |
24 | | -The cache file is encoded in JSON and stored in `.hermes/harvest/HARVESTER_NAME.json` |
25 | | -where `HARVESTER_NAME` corresponds to the entry point name. |
| 25 | +The following sections provide documentation of the data model. |
| 26 | +They aim to help you get started with `hermes` plugin and core development, |
| 27 | +even if you have no previous experience with JSON-LD. |
26 | 28 |
|
27 | | -{class}`hermes.model.context.HermesHarvestContext` encapsulates these harvester caches. |
| 29 | +## The data model for plugin developers |
| 30 | + |
| 31 | +If you develop a plugin for `hermes`, you will only need to work with a single Python class and the public API |
| 32 | +it provides: {class}`hermes.model.SoftwareMetadata`. |
| 33 | + |
| 34 | +To work with this class, it is necessary that you know _some_ things about JSON-LD. |
| 35 | + |
| 36 | +### JSON-LD for plugin developers |
| 37 | + |
| 38 | +```{attention} |
| 39 | +Work in progress. |
| 40 | +``` |
| 41 | + |
| 42 | + |
| 43 | +### Working with the `hermes` data model in plugins |
| 44 | + |
| 45 | +> **Goal** |
| 46 | +> Understand how plugins access the `hermes` data model and interact with it. |
| 47 | +
|
| 48 | +`hermes` aims to hide as much of the data model as possible behind a public API |
| 49 | +to avoid that plugin developers have to deal with some of the more complex features of JSON-LD. |
| 50 | + |
| 51 | +#### Model instances in different types of plugin |
| 52 | + |
| 53 | +You can extend `hermes` with plugins for three different commands: `harvest`, `curate`, `deposit`. |
| 54 | + |
| 55 | +The commands differ in how they work with instances of the data model. |
| 56 | + |
| 57 | +- `harvest` plugins _create_ a single new model instance and return it. |
| 58 | +- `curate` plugins are passed a single existing model instance (the output of `process`), |
| 59 | +and return a single model instance. |
| 60 | +- `deposit` plugins are passed a single existing model instance (the output of `curate`), |
| 61 | +and return a single model instance. |
| 62 | + |
| 63 | +#### How plugins work with the API |
| 64 | + |
| 65 | +```{important} |
| 66 | +Plugins access the data model _exclusively_ through the API class {class}`hermes.model.SoftwareMetadata`. |
| 67 | +``` |
| 68 | + |
| 69 | +The following sections show how this class works. |
| 70 | + |
| 71 | +##### Creating a data model instance |
| 72 | + |
| 73 | +Model instances are primarily created in `harvest` plugins, but may also be created in other plugins to map |
| 74 | +existing data into. |
| 75 | + |
| 76 | +To create a new model instance, initialize {class}`hermes.model.SoftwareMetadata`: |
| 77 | + |
| 78 | +```{code-block} python |
| 79 | +:caption: Initializing a default data model instance |
| 80 | +from hermes.model import SoftwareMetadata |
| 81 | +
|
| 82 | +data = SoftwareMetadata() |
| 83 | +``` |
| 84 | + |
| 85 | +`SoftwareMetadata` objects initialized without arguments provide the default _context_ |
| 86 | +(see [_JSON-LD for plugin developers_](#json-ld-for-plugin-developers)). |
| 87 | +This means that now, you can use terms from the schemas included in the default context to describe software metadata. |
| 88 | + |
| 89 | +Terms from [_CodeMeta_](https://codemeta.github.io/terms/) can be used without a prefix: |
| 90 | + |
| 91 | +```{code-block} python |
| 92 | +:caption: Using terms from the default schema |
| 93 | +data["readme"] = ... |
| 94 | +``` |
| 95 | + |
| 96 | +Terms from [_Schema.org_](https://schema.org/) can be used with the prefix `schema`: |
| 97 | + |
| 98 | +```{code-block} python |
| 99 | +:caption: Using terms from a non-default schema |
| 100 | +data["schema:copyrightNotice"] = ... |
| 101 | +``` |
| 102 | + |
| 103 | +You can also use other linked data vocabularies. To do this, you need to identify them with a prefix and register them |
| 104 | +with the data model by passing it `extra_vocabs` as a `dict` mapping prefixes to URLs where the vocabularies are |
| 105 | +provided as JSON-LD: |
| 106 | + |
| 107 | +```{code-block} python |
| 108 | +:caption: Injecting additional schemas |
| 109 | +from hermes.model import SoftwareMetadata |
| 110 | +
|
| 111 | +# Contents served at https://bar.net/schema.jsonld: |
| 112 | +# { |
| 113 | +# "@context": |
| 114 | +# { |
| 115 | +# "name": "https://schema.org/name" |
| 116 | +# } |
| 117 | +# } |
| 118 | +
|
| 119 | +data = SoftwareMetadata(extra_vocabs={"foo": "https://bar.net/schema.jsonld"}) |
| 120 | +
|
| 121 | +data["foo:name"] = ... |
| 122 | +``` |
| 123 | + |
| 124 | +##### Adding data |
| 125 | + |
| 126 | +Once you have an instance of {class}`hermes.model.SoftwareMetadata`, you can add data to it, |
| 127 | +i.e., metadata that describes software: |
| 128 | + |
| 129 | +```{code-block} python |
| 130 | +:caption: Setting data values |
| 131 | +data["name"] = "My Research Software" # A simple "Text"-type value |
| 132 | +# → Simplified model representation : { "name": [ "My Research Software" ] } |
| 133 | +# Cf. "Accessing data" below |
| 134 | +data["author"] = {"name": "Shakespeare"} # An object value that uses terms available in the defined context |
| 135 | +# → Simplified model representation : { "name": [ "My Research Software" ], "author": [ { "name": "Shakespeare" } ] } |
| 136 | +# Cf. "Accessing data" below |
| 137 | +``` |
| 138 | + |
| 139 | +##### Accessing data |
| 140 | + |
| 141 | +You need to be able to access data in the data model instance to add, edit or remove data. |
| 142 | +Data can be accessed by using term strings, similar to how values in Python `dict`s are accessed by keys. |
| 143 | + |
| 144 | +```{important} |
| 145 | +When you access data from a data model instance, |
| 146 | +it will always be returned in a **list**-like object! |
| 147 | +``` |
| 148 | + |
| 149 | +The reason for providing data in list-like objects is that JSON-LD treats all property values as arrays. |
| 150 | +Even if you add "single value" data to a `hermes` data model instance via the API, the underlying JSON-LD model |
| 151 | +will treat it as an array, i.e., a list-like object: |
| 152 | + |
| 153 | +```{code-block} python |
| 154 | +:caption: Internal data values are arrays |
| 155 | +data["name"] = "My Research Software" # → [ "My Research Software" ] |
| 156 | +data["author"] = {"name": "Shakespeare"} # → [ { "name": [ "Shakespeare" ] } ] |
| 157 | +``` |
| 158 | + |
| 159 | +Therefore, you access data in the same way you would access data from a Python `list`: |
| 160 | + |
| 161 | +1. You access single values using indices, e.g., `data["name"][0]`. |
| 162 | +2. You can use a list-like API to interact with data objects, e.g., |
| 163 | +`data["name"].append("Hamilton")`, `data["name"].extend(["Hamilton", "Knuth"])`, `for name in data["name"]: ...`, etc. |
| 164 | + |
| 165 | +##### Interacting with data |
| 166 | + |
| 167 | +The following longer example shows different ways that you can interact with `SoftwareMetadata` objects and the data API. |
| 168 | + |
| 169 | +```{code-block} python |
| 170 | +:caption: Building the data model |
| 171 | +from hermes.model import SoftwareMetadata |
| 172 | +
|
| 173 | +# Create the model object with the default context |
| 174 | +data = SoftwareMetadata() |
| 175 | +
|
| 176 | +# Let's create author metadata for our software! |
| 177 | +# Below each line of code, the value of `data["author"]` is given. |
| 178 | +
|
| 179 | +data["author"] = {"name": "Shakespeare"} |
| 180 | +# → [{'name': ['Shakespeare']}] |
| 181 | +
|
| 182 | +data["author"].append({"name": "Hamilton"}) |
| 183 | +# [{'name': ['Shakespeare']}, {'name': ['Hamilton']}] |
| 184 | +
|
| 185 | +data["author"][0]["email"] = "[email protected]" |
| 186 | +# [{'name': ['Shakespeare'], 'email': ['[email protected]']}, {'name': ['Hamilton']}] |
| 187 | +
|
| 188 | +data["author"][1]["email"].append("[email protected]") |
| 189 | +# [{'name': ['Shakespeare'], 'email': ['[email protected]']}, {'name': ['Hamilton'], 'email': ['[email protected]']}] |
| 190 | +
|
| 191 | +data["author"][1]["email"].extend(["[email protected]", "[email protected]"]) |
| 192 | +# [ |
| 193 | +# {'name': ['Shakespeare'], 'email': ['[email protected]']}, |
| 194 | + |
| 195 | +# ] |
| 196 | +``` |
| 197 | + |
| 198 | +The example continues to show how to iterate through data. |
| 199 | + |
| 200 | +```{code-block} python |
| 201 | +:caption: for-loop, containment check |
| 202 | +for i, author in enumerate(data["author"], start=1): |
| 203 | + if author["name"][0] in ["Shakespeare", "Hamilton"]: |
| 204 | + print(f"Author {i} has expected name.") |
| 205 | + else: |
| 206 | + raise ValueError("Unexpected author name found!", author["name"][0]) |
| 207 | +
|
| 208 | +# Mock output: |
| 209 | +# $> Author 1 has expected name. |
| 210 | +# $> Author 2 has expected name. |
| 211 | +``` |
| 212 | + |
| 213 | +```{code-block} python |
| 214 | +:caption: Value check |
| 215 | +for email in data["author"][0]["email"]: |
| 216 | + if email.endswith(".edu"): |
| 217 | + print("Shakespeare has an email address at an educational institution.") |
| 218 | + else: |
| 219 | + print("Cannot confirm affiliation with educational institution for Shakespeare.") |
| 220 | +
|
| 221 | +# Mock output |
| 222 | +# $> Cannot confirm affiliation with educational institution for author. |
| 223 | +``` |
| 224 | + |
| 225 | +```{code-block} python |
| 226 | +:caption: Value check and list comprehension |
| 227 | +if all(["hamilton" in email for email in data["author"][1]["email"]]): |
| 228 | + print("Author has only emails with their name in it.") |
| 229 | +
|
| 230 | +# Mock output |
| 231 | +# $> Author has only emails with their name in it. |
| 232 | +``` |
| 233 | + |
| 234 | +The example continues to show how to assert data values. |
| 235 | + |
| 236 | +As mentioned in the [introduction to the data model](#data-model), |
| 237 | +`hermes` uses a JSON-LD-like internal data model. |
| 238 | +The API class {class}`hermes.model.SoftwareMetadata` hides many |
| 239 | +of the more complex aspects of JSON-LD and makes it easy to work |
| 240 | +with the data model. |
| 241 | + |
| 242 | +So the API class hides the internal model objects. |
| 243 | +Therefore, they work as you would expect from plain |
| 244 | +Python data: |
| 245 | + |
| 246 | +```{code-block} python |
| 247 | +:caption: Naive containment assertion that raises |
| 248 | +:emphasize-lines: 5,13 |
| 249 | +try: |
| 250 | + assert ( |
| 251 | + {'name': ['Shakespeare'], 'email': ['[email protected]']} |
| 252 | + in |
| 253 | + data["author"] |
| 254 | + ) |
| 255 | + print("The author was found!") |
| 256 | +except AssertionError: |
| 257 | + print("The author could not be found.") |
| 258 | + raise |
| 259 | +
|
| 260 | +# Mock output |
| 261 | +# $> The author was found! |
| 262 | +# |
| 263 | +# |
| 264 | +# Internal Model from data["author"]: |
| 265 | +# {'@list': [ |
| 266 | +# { |
| 267 | +# 'http://schema.org/name': [{'@value': 'Shakespeare'}], |
| 268 | +# 'http://schema.org/email': [{'@value': '[email protected]'}] |
| 269 | +# }, |
| 270 | +# { |
| 271 | +# 'http://schema.org/name': [{'@value': 'Hamilton'}], |
| 272 | +# 'http://schema.org/email': [ |
| 273 | +# {'@list': [ |
| 274 | + |
| 275 | +# ]} |
| 276 | +# ] |
| 277 | +# }] |
| 278 | +# } |
| 279 | +# ) |
| 280 | +``` |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +## See Also |
| 285 | + |
| 286 | +- API reference: {class}`hermes.model.SoftwareMetadata` |
0 commit comments