-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON schemas for extracts export validation function #1075
Conversation
This creates JSON schemas for all extracts and files generated by Reffy. A new `getSchemaValidationFunction` function is now exported by Reffy to validate data against the schemas. It is typically intended for use in Webref to validate curated data before publication. Some minor adjustments made along the way to make data more consistent and keep schemas relatively simple: - The dfns extractor now always returns a `heading` property, defaulting to the top of the page - The events extractor now gets rid of null properties altogether - The events post-processor no longer outputs null href. Existing tests on extractors were updated to also check the schema of the extracted data. More tests could be added to check post-processors, and the crawler as a whole. Note this update also includes a minor bug fix in the sorting code of the events post-processor so that it may run on not-yet-curated events extracts.
Linked to w3c/reffy#1075 This will only work once a version of Reffy has been released that exposes the appropriate schema validation function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gosh, another amazing PR :)
a few suggestions, but feel free to merge as is an consider them as input for later :)
try { | ||
schema = require(path.join(schemasFolder, schemaFile)); | ||
} | ||
catch (err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It threw in the first version I had, but then it seemed a bit rude to throw when the caller requested an unknown schema. Typically, returning null
allows to avoid a try... catch
in Webref:
https://github.com/w3c/webref/pull/731/files#diff-128accda0c348bbd52246995174cc7409be8cf4308f3a1389b5d9e47a556cb55R15-R18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should distinguish between situations where we know validation doesn't make sense (as in the webref example) from situations where the requested schema name doesn't make sense (e.g. because there was a typo in the name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how the code is supposed to distinguish between these two situations (in both cases, all the code could tell is that the schema file does not exist). Why does it matter in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it matters - to avoid a situation where you think you're all good because you didn't get a complaint when in fact you made a type in your schema name.
how - I assume we know the patterns and the names that can be applied to it - anything that doesn't match that should throw (but my assumption may very well be wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying not to hardcode the list of names so that we don't have to update that piece of code when we add more extraction tools and post-processing modules. The code can tell which names are correct for extraction modules by looking at property
in src/browserlib/reffy.json
. There is no automatic way to tell what files post-processing modules that operate at the crawl level may generate though.
I'll take the liberty to merge as-is. If you think that deserves an update, the floor is yours :)
- Drop now useless `nullableurl` construct - Fix RegExp for interface names - Add new common `interfacetype` and `extensiontype` definitions, used in idlparsed and idlnames-parsed. - Adjust postprocessing/events to use common `interfaces` definition - Make list of allowed dfns types explicit
@@ -1026,14 +1026,31 @@ function getInterfaceTreeInfo(iface, interfaces) { | |||
* if the requested schema does not exist. | |||
*/ | |||
function getSchemaValidationFunction(schemaName) { | |||
// Helper function that selects the right schema file from the given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may question-mark as much as you want, it actually does improve readability :)
This makes use of the new schema validation function in Reffy to make sure that the curated data Webref produces follow expected scheams, see: w3c/reffy#1075 This replaces #731 and fixes #657. Schemas, notably those that deal with parsed IDL structures, could go deeper into details. To be improved over time. Tests are run against the curated version of data. That is not necessary for extracts that aren't actually curated (dfns, headings, ids, links, refs), just more convenient not to have branching logic in the test code.
This makes use of the new schema validation function in Reffy to make sure that the curated data Webref produces follow expected scheams, see: w3c/reffy#1075 This replaces #731 and fixes #657. Schemas, notably those that deal with parsed IDL structures, could go deeper into details. To be improved over time. Tests are run against the curated version of data. That is not necessary for extracts that aren't actually curated (dfns, headings, ids, links, refs), just more convenient not to have branching logic in the test code.
This creates JSON schemas for all extracts and files generated by Reffy. A new
getSchemaValidationFunction
function is now exported by Reffy to validate data against the schemas. It is typically intended for use in Webref to validate curated data before publication.Some minor adjustments made along the way to make data more consistent and keep schemas relatively simple:
heading
property, defaulting to the top of the pageExisting tests on extractors were updated to also check the schema of the extracted data. More tests could be added to check post-processors, and the crawler as a whole.
Note this update also includes a minor bug fix in the sorting code of the events post-processor so that it may run on not-yet-curated events extracts.
This is based on (and intends to replace the current form of) w3c/webref#731