SPARQL query strategy • cybedtools

The queries operate against the unified RDF graph produced by scripts/030-load-rdf-graph.R and, for query execution, the combined N-Triples file produced by scripts/025-export-ntriples.R.

Design choices

The package supports two kinds of question:

Within-framework questions (such as “which task statements in NICE involve cloud or container technologies?”).
Across-framework questions (such as “how do role and element counts vary by jurisdiction or sector?”).

The cybed: base vocabulary is the selection path for both. A single helper call targeting cybed:OrganizingUnit returns comparable parent-level bindings across NICE, DCWF, SFIA, ECSF, Cyber.org K-12, CSTA, CSEC2017, and DigComp 2.2; targeting cybed:Role restricts to the workforce-framework subset (NICE / DCWF / ECSF); targeting cybed:RoleElement returns atomic content nodes (parents, Subpoints, and Examples).

What these queries surface

Three findings the package’s analytical layer produces directly from the eight-framework graph:

Element density per framework varies by ~12x with Cyber.org K-12 / CSTA pedagogical Examples included (NICE 51.6 elements per work role, DigComp 4.2 per competence area). Without Examples the spread widens to ~49x because Cyber.org K-12’s 116 cells contain only 123 numbered standards. Per-unit density is a comparison aid across heterogeneous denominators, not a quality claim.
Jurisdictional element coverage is dominated by US frameworks (NICE, DCWF, Cyber.org K-12, CSTA) by an order of magnitude over EU frameworks (ECSF, DigComp), reflecting design-philosophy differences (ECSF profile-level by intent, DigComp citizen-self-assessment by intent) rather than corpus completeness.
The five highest-element-load NICE work roles concentrate disproportionate competency specification (Security Control Assessment 307, Secure Systems Development 232, Cybersecurity Architecture 219, Defensive Cybersecurity 206, Systems Security Management 204).

See the cross-framework-analysis vignette for the worked R that produces them.

Why single-BGP queries with R-side joins

The librdf C library that rdflib wraps exhibits poor performance and silent zero-row results on conjunctive triple patterns at this graph’s scale. Multi-pattern SPARQL joins via shared variables hang for many minutes. Multi-property selects on a single subject silently return no rows. Single basic graph patterns (one triple match per SPARQL call) execute fast and correctly.

The package’s discipline is therefore:

SPARQL queries are single basic graph patterns. One triple match per call.
Joins, multi-property assembly, and aggregation happen in R via dplyr.

This is implemented by the helpers in R/sparql-helpers.R:

sparql_pairs(rdf, predicate) returns subject-object pairs for all triples with a given predicate.
sparql_subjects(rdf, predicate, object) returns subjects of triples whose predicate and object are fixed.

Domain-level helpers compose these primitives:

framework_metadata(rdf) returns a tibble of (framework, name, jurisdiction, sector, specificity).
organizing_unit_framework_bindings(rdf) returns (unit, unit_name, framework, framework_name) for every framework’s top-level enumerated unit. The cross-framework cut.
role_framework_bindings(rdf) returns (role, role_name, framework, framework_name) restricted to workforce frameworks where cybed:Role is asserted (NICE / DCWF / ECSF).
element_framework_bindings(rdf) returns (element, framework, framework_name) for every cybed:RoleElement, including Subpoints and Examples.
example_framework_bindings(rdf) returns (example, framework, framework_name) for the cybed:Example subset (Cyber.org K-12 and CSTA Clarification scaffolding).
role_element_bindings(rdf) returns (role, element) for every cybed:hasElement triple. Excludes Examples (which are reachable only via cybed:hasExample).

Each domain helper makes one to four single-BGP queries and joins the results via dplyr left-joins or semi-joins.

Query families

Family A: Structural

A1. Framework metadata inventory. One row per framework with jurisdiction, sector, specificity. Uses framework_metadata().
A2. Organizing units per framework and elements per framework. Aggregated in R from organizing_unit_framework_bindings() and element_framework_bindings(). For workforce-restricted aggregations, swap in role_framework_bindings().
A3. Element density. Elements per top-level unit per framework, joining role_element_bindings() with organizing_unit_framework_bindings(). Surfaces the cross-framework structural-density spread (with-examples count): NICE around 51.6 per work role, DCWF around 39.8 per work role, ECSF around 32.5 per role profile, CSTA around 10.2 per level-x-concept cell, SFIA around 5.6 per skill, CSEC2017 around 5.0 per Knowledge Area, Cyber.org K-12 around 4.3 per grade-band-x-sub-concept cell, DigComp around 4.2 per competence area.
A4. Missing required properties. Quality control. Surface RoleElement subjects without cybed:elementText (use sparql_subjects() and dplyr anti_join).

Family B: Cross-framework pivots

B1. Element volume by jurisdiction. Join element_framework_bindings() to framework_metadata()$jurisdiction and aggregate.
B2. Element volume by sector. Same shape, on framework_metadata()$sector.
B3. Element volume by specificity. Same shape, on framework_metadata()$specificity.
B4. Framework-vs-framework structural comparison. Filter role_framework_bindings() to two frameworks and compare role counts, element-per-role distributions.

The runner (scripts/040-run-sparql.R) implements six named analyses (q10 through q15) that map onto these families and write one CSV per analysis to data/processed/query-results/.

Implementation notes

The package’s analytical queries live in R, not in .rq files. The inst/queries/ directory is reserved for user-supplied custom queries. See its README.md.
Direct SPARQL access remains available via rdflib::rdf_query(rdf, query_text) for users who need it. Stick to single basic graph patterns for reliability.
For aggregation, never use COUNT, GROUP BY, or HAVING in SPARQL. librdf’s SPARQL 1.1 aggregate support is unreliable. Aggregate in dplyr.
A future release may add an Apache Jena Fuseki backend for full SPARQL 1.1 support. This would relax the single-BGP discipline for users running against a Fuseki endpoint.