Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON schemas for extracts export validation function #1075

Merged
merged 3 commits into from
Sep 26, 2022
Merged

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Sep 25, 2022

This creates JSON schemas for all extracts and files generated by Reffy. A new getSchemaValidationFunction function is now exported by Reffy to validate data against the schemas. It is typically intended for use in Webref to validate curated data before publication.

Some minor adjustments made along the way to make data more consistent and keep schemas relatively simple:

  • The dfns extractor now always returns a heading property, defaulting to the top of the page
  • The events extractor now gets rid of null properties altogether
  • The events post-processor no longer outputs null href.

Existing tests on extractors were updated to also check the schema of the extracted data. More tests could be added to check post-processors, and the crawler as a whole.

Note this update also includes a minor bug fix in the sorting code of the events post-processor so that it may run on not-yet-curated events extracts.

This is based on (and intends to replace the current form of) w3c/webref#731

This creates JSON schemas for all extracts and files generated by Reffy. A new
`getSchemaValidationFunction` function is now exported by Reffy to validate data
against the schemas. It is typically intended for use in Webref to validate
curated data before publication.

Some minor adjustments made along the way to make data more consistent and keep
schemas relatively simple:
- The dfns extractor now always returns a `heading` property, defaulting to the
top of the page
- The events extractor now gets rid of null properties altogether
- The events post-processor no longer outputs null href.

Existing tests on extractors were updated to also check the schema of the
extracted data. More tests could be added to check post-processors, and the
crawler as a whole.

Note this update also includes a minor bug fix in the sorting code of the events
post-processor so that it may run on not-yet-curated events extracts.
tidoust added a commit to w3c/webref that referenced this pull request Sep 26, 2022
Linked to w3c/reffy#1075

This will only work once a version of Reffy has been released that exposes the
appropriate schema validation function.
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gosh, another amazing PR :)

a few suggestions, but feel free to merge as is an consider them as input for later :)

try {
schema = require(path.join(schemasFolder, schemaFile));
}
catch (err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this throw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It threw in the first version I had, but then it seemed a bit rude to throw when the caller requested an unknown schema. Typically, returning null allows to avoid a try... catch in Webref:
https://github.com/w3c/webref/pull/731/files#diff-128accda0c348bbd52246995174cc7409be8cf4308f3a1389b5d9e47a556cb55R15-R18

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should distinguish between situations where we know validation doesn't make sense (as in the webref example) from situations where the requested schema name doesn't make sense (e.g. because there was a typo in the name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how the code is supposed to distinguish between these two situations (in both cases, all the code could tell is that the schema file does not exist). Why does it matter in practice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it matters - to avoid a situation where you think you're all good because you didn't get a complaint when in fact you made a type in your schema name.

how - I assume we know the patterns and the names that can be applied to it - anything that doesn't match that should throw (but my assumption may very well be wrong :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying not to hardcode the list of names so that we don't have to update that piece of code when we add more extraction tools and post-processing modules. The code can tell which names are correct for extraction modules by looking at property in src/browserlib/reffy.json. There is no automatic way to tell what files post-processing modules that operate at the crawl level may generate though.

I'll take the liberty to merge as-is. If you think that deserves an update, the floor is yours :)

src/lib/util.js Outdated Show resolved Hide resolved
schemas/common.json Outdated Show resolved Hide resolved
schemas/common.json Outdated Show resolved Hide resolved
schemas/browserlib/extract-dfns.json Outdated Show resolved Hide resolved
schemas/files/index.json Show resolved Hide resolved
schemas/postprocessing/events.json Outdated Show resolved Hide resolved
schemas/postprocessing/idlparsed.json Outdated Show resolved Hide resolved
- Drop now useless `nullableurl` construct
- Fix RegExp for interface names
- Add new common `interfacetype` and `extensiontype` definitions, used in
idlparsed and idlnames-parsed.
- Adjust postprocessing/events to use common `interfaces` definition
- Make list of allowed dfns types explicit
@@ -1026,14 +1026,31 @@ function getInterfaceTreeInfo(iface, interfaces) {
* if the requested schema does not exist.
*/
function getSchemaValidationFunction(schemaName) {
// Helper function that selects the right schema file from the given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may question-mark as much as you want, it actually does improve readability :)

@tidoust tidoust merged commit 5991f32 into main Sep 26, 2022
@tidoust tidoust deleted the schemas branch September 26, 2022 17:48
tidoust added a commit that referenced this pull request Sep 26, 2022
New feature:
- Add JSON schemas for extracts and export validation function (#1075)

Bug fix:
- [CLI] Allow running post-processors without "--output" (#1074)

Dependencies bumped:
- Bump web-specs from 2.25.0 to 2.26.0 (#1076)
tidoust added a commit to w3c/webref that referenced this pull request Sep 27, 2022
This makes use of the new schema validation function in Reffy to make sure that
the curated data Webref produces follow expected scheams, see:
  w3c/reffy#1075

This replaces #731 and fixes #657.

Schemas, notably those that deal with parsed IDL structures, could go deeper
into details. To be improved over time.

Tests are run against the curated version of data. That is not necessary for
extracts that aren't actually curated (dfns, headings, ids, links, refs), just
more convenient not to have branching logic in the test code.
tidoust added a commit to w3c/webref that referenced this pull request Sep 27, 2022
This makes use of the new schema validation function in Reffy to make sure that
the curated data Webref produces follow expected scheams, see:
  w3c/reffy#1075

This replaces #731 and fixes #657.

Schemas, notably those that deal with parsed IDL structures, could go deeper
into details. To be improved over time.

Tests are run against the curated version of data. That is not necessary for
extracts that aren't actually curated (dfns, headings, ids, links, refs), just
more convenient not to have branching logic in the test code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants