Documentation
This algorithm was derived from this original one implemented by Cognizone here. Credits to them. It was improved in significant ways:
- Used a layered visitor patterns architecture for more modularity
- Used sampling technique to work with large datasets
- Improved NodeShape derivation algorithm to exclude certain types, when entities have multiple types
- Added counting of entities and properties
This can work best if the dataset:
- Uses one and only one rdf:type value per entity (although the algorithm can be smart enough to exclude some types, see below)
- Contains only data, not the RDFS/OWL model
SHACL generation algorithm
The algorithm follow these steps to generate the SHACL:
-
Find all types in the dataset.
Relies on this SPARQL query.
Generates one
sh:NodeShape
for each type, with sh:targetClass
set to the type.
-
For each found type, find all properties used on instances of this type.
Relies on this SPARQL query.
Generates one
sh:PropertyShape
for each property on the type, with an sh:path
set to this property.
-
For each property shape previously found, determine its node kind (IRI or Literal).
Relies on this SPARQL query,
this one,
and this one.
Generates the
sh:nodeKind
constraint on the property shape accordingly.
-
For each property shape previously found with a sh:nodeKind IRI or BlankNode, determine the types of the property values.
Relies on this SPARQL query.
Generates the
sh:class
constraint on the property shape accordingly. If more than one class is found, the algorithm determines if some can be removed:
- If one class is a superset of all other classes found, (indicating that the dataset uses some redundancy on the typing of instances, e.g. assigning skos:Concept
and a subclass of skos:Concept to entities), but is a superset of other classes as well, then the this superset class (e.g. skos:Concept) is removed from the list,
and only the most precise class(-es) are kept.
- If one class is a superset of all other classes found, and is not a superset of other classes, then only the superset class is kept, and other more precise classes
are removed from the list
-
For each property shape previously found with a sh:nodeKind Literal, determine the datatype and languages of the property values.
Relies on this SPARQL query,
and this one.
Generates the
sh:datatype
and sh:languageIn
constraints on the property shape accordingly.
-
For each property shape previously found, determine the cardinalities of the property.
Relies on this SPARQL query,
and this one.
This can determine one minimum and maximum cardinalities set to 1.
Generates the
sh:minCount
and sh:maxCount
constraints on the property shape accordingly.
-
For each property shape previously found, list the values of the property if it has a limited number of possible values.
Relies on this SPARQL query.
This is done only if the property has 3 distinct values or less.
Generates an
sh:in
or sh:hasValue
constraint on the property shape accordingly.
-
For each node shape previously found, determines if one of the property shape is a label of the entity.
If a property skos:prefLabel, foaf:name, dcterms:title, schema:name or rdfs:label (in this order) is found, mark it as a label. Otherwise, tries to find
a literal property of datatype xsd:string or rdf:langString, with a sh:minCount 1; if only is found, mark it as a label.
Generates a
dash:propertyRole
with dash:LabelRole
value accordingly.
-
If requested, for each node shape and property shape previously found, count the number of instances of node shapes, number of occurrences of property shapes, and number of distinct values..
This currently works only with sh:targetClass target definition, but can be easily extended to deal with other target definition.
Generates a
void:Dataset
, void:classPartition
, void:propertyPartition
with a dcterms:conformsTo
pointing to the corresponding shapes.
Stores the counting in either void:entities
, void:triples
, or void:distinctObjects
properties.