Generate SHACL Profile

This algorithm derives a set of SHACL constraints from an RDF dataset. It can work from an uploaded RDF dataset, or from an online SPARQL endpoint. Detailed documentation is available below.

  Dataset

Remove Select file Change
You can select multiple files. Supported extensions : .rdf, .ttl, .n3, .trig. Other extensions will be treated as RDF/XML. You can also send zip files.
URL of an RDF file. Same extensions as file upload are supported.
Supported syntaxes : Turtle, RDF/XML, JSON-LD, TriG, TriX, N-Quads. We recommend Turtle.

  SPARQL endpoint

Must be a publicly accessible SPARQL endpoint, preferably without "too much" data (avoid trying with DBPedia or Wikidata, it will not work)

  Options

Whether rdfs:label and sh:name should be derived on the shapes. Uncheck if you don't need them or you have other means to retrieve them (e.g. from an OWL file)
/!\ takes time ! Run additional queries to count the number of targets of each node shape, number of occurrences and number of distinct values of each property shape. Stores them on void:classPartitions and void:propertyPartitions, using void:entities, void:triples and void:distinctObjects predicates.

Documentation

This algorithm was derived from this original one implemented by Cognizone here. Credits to them. It was improved in significant ways:

  • Used a layered visitor patterns architecture for more modularity
  • Used sampling technique to work with large datasets
  • Improved NodeShape derivation algorithm to exclude certain types, when entities have multiple types
  • Added counting of entities and properties

This can work best if the dataset:

  • Uses one and only one rdf:type value per entity (although the algorithm can be smart enough to exclude some types, see below)
  • Contains only data, not the RDFS/OWL model

SHACL generation algorithm

The algorithm follow these steps to generate the SHACL:

  1. Find all types in the dataset. Relies on this SPARQL query. Generates one sh:NodeShape for each type, with sh:targetClass set to the type.
  2. For each found type, find all properties used on instances of this type. Relies on this SPARQL query. Generates one sh:PropertyShape for each property on the type, with an sh:path set to this property.
  3. For each property shape previously found, determine its node kind (IRI or Literal). Relies on this SPARQL query, this one, and this one. Generates the sh:nodeKind constraint on the property shape accordingly.
  4. For each property shape previously found with a sh:nodeKind IRI or BlankNode, determine the types of the property values. Relies on this SPARQL query. Generates the sh:class constraint on the property shape accordingly. If more than one class is found, the algorithm determines if some can be removed:
    • If one class is a superset of all other classes found, (indicating that the dataset uses some redundancy on the typing of instances, e.g. assigning skos:Concept and a subclass of skos:Concept to entities), but is a superset of other classes as well, then the this superset class (e.g. skos:Concept) is removed from the list, and only the most precise class(-es) are kept.
    • If one class is a superset of all other classes found, and is not a superset of other classes, then only the superset class is kept, and other more precise classes are removed from the list
  5. For each property shape previously found with a sh:nodeKind Literal, determine the datatype and languages of the property values. Relies on this SPARQL query, and this one. Generates the sh:datatype and sh:languageIn constraints on the property shape accordingly.
  6. For each property shape previously found, determine the cardinalities of the property. Relies on this SPARQL query, and this one. This can determine one minimum and maximum cardinalities set to 1. Generates the sh:minCount and sh:maxCount constraints on the property shape accordingly.
  7. For each property shape previously found, list the values of the property if it has a limited number of possible values. Relies on this SPARQL query. This is done only if the property has 3 distinct values or less. Generates an sh:in or sh:hasValue constraint on the property shape accordingly.
  8. For each node shape previously found, determines if one of the property shape is a label of the entity. If a property skos:prefLabel, foaf:name, dcterms:title, schema:name or rdfs:label (in this order) is found, mark it as a label. Otherwise, tries to find a literal property of datatype xsd:string or rdf:langString, with a sh:minCount 1; if only one is found, mark it as a label. Generates a dash:propertyRole with dash:LabelRole value accordingly.
  9. If requested, for each node shape and property shape previously found, count the number of instances of node shapes, number of occurrences of property shapes, and number of distinct values.. This currently works only with sh:targetClass target definition, but can be easily extended to deal with other target definition. Generates a void:Dataset, void:classPartition, void:propertyPartition with a dcterms:conformsTo pointing to the corresponding shapes. Stores the counting in either void:entities, void:triples, or void:distinctObjects properties.

Modelling of datasets statistics

Here is an example of how statistics are expressed:

@prefix void:  <http://rdfs.org/ns/void#> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix dcat:  <http://www.w3.org/ns/dcat#> .
@prefix sh:    <http://www.w3.org/ns/shacl#>

# The dataset being analyzed
<https://xxx/sparql>
	a                    void:Dataset ;
	# one partition is created per NodeShape
	void:classPartition  <https://xxx/sparql/partition_Place> ;
	# Total number of triples in the Dataset
	void:triples         "11963716"^^xsd:int ;
	# A pointer to the URI of the shapes graph being used to generate these statistics
	sh:suggestedShapesGraph
	<https://xxx/shapes/> .

# A "Node Shape partition", that is, a partition of the entire dataset corresponding to all
# targets of one NodeShape
<https://xxx/partition_Place>
	# Link to the NodeShape
	dct:conformsTo          <https://xxx/shapes/Place> ;
	# When the NodeShape actually targets instances of a class, the partition we are describing is 
	# actually a class partition, and we can indicate the class here
	void:class              <https://www.ica.org/standards/RiC/ontology#Place> ;
	# Total number of targets of that shape in the dataset
	void:entities           "4551"^^xsd:int ;
	# One property partition is created per property shape in the node shape
	void:propertyPartition  <https://xxx/partition_Place_label> , <https://xxx/partition_Place_sameAs> .

# A "Property Shape partition", that is, a sub-partition of a "Node Shape partition" corresponding to all
# triples matching the path of the property
<https://xxx/partition_Place_label>
	# a link ot the property shape
	dct:conformsTo        <https://xxx/shapes/Place_label> ;
	# number of distinct values of the property shape
	void:distinctObjects  "17330"^^xsd:int ;
	# when the property shape as a simple path as a predicate, we can repeat it here
	# and our partition is actually a real property partition
	void:property         <http://www.w3.org/2000/01/rdf-schema#label> ;
	# number of triples corresponding to the property shape
	void:triples          "17567"^^xsd:int .

<https://xxx/partition_Place_sameAs>
	dct:conformsTo        <https://xxx/shapes/Place_sameAs> ;
	void:distinctObjects  "14847"^^xsd:int ;
	void:property         <http://www.w3.org/2002/07/owl#sameAs> ;
	void:triples          "14854"^^xsd:int .