Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support defining schema in STRUCT type for mongodb cdc #19982

Open
hzxa21 opened this issue Jan 2, 2025 · 2 comments
Open

Support defining schema in STRUCT type for mongodb cdc #19982

hzxa21 opened this issue Jan 2, 2025 · 2 comments
Assignees
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 2, 2025

Currently data ingested from mongodb cdc is parsed as the JSONB type. See example here.

Recently we saw user requesting supporting strongly typed schema for data ingested by mongodb cdc, which leads to supporting defining strongly typed STRUCT in CREATE TABLE statement for mongodb cdc. Exmaple:

CREATE TABLE mongdb_source_table (
    payload STRUCT<
        foo STRUCT<
            foo1 JSONB,
            foo2 BIGINT
        >,
        bar STRUCT<
            bar1 STRUCT<
                bar1_1 VARCHAR,
                bar1_2 VARCHAR,
                bar1_3 STRUCT<
                    bar1_3_array STRUCT<
                        item1 DOUBLE,
                        item2 BIGINT,
                        ...
                    >[]
                >,
         ...
) WITH (
  connector=mongodb-cdc
  ...
);

We already support parsing jsonb to struct via josnb_populate_record so we can borrow the implementation here as well. The expected behavior is:

  • If the fields of a document ingested from mongodb is exactly the same as the defined struct, the document will be parsed and ingested successfully.
  • If the fields of a document ingested from mongodb is a superset of the defined struct, the document will be parsed and ingested successfully by ignoring the extra fields.
  • If the fields of a document ingested from mongodb is a subset of the defined struct, the document will be parsed and ingested successfully by filling in NULL for the missing fields.
  • If the payload a given field in the document ingested from mongodb cannot be converted to the type defined in the struct (e.g. cannot cast text to int), the whole document will be skipped by filling in NULL for the struct.
@github-actions github-actions bot added this to the release-2.3 milestone Jan 2, 2025
@xiangjinwu
Copy link
Contributor

The conversion rule may be different from jsonb_populate_record. Or in other words, AVOID updating jsonb_populate_record behavior to be closer to their usage in MongoDB. The function is supposed to be PostgreSQL compatible.

It can be a dialect here instead:

impl JsonParseOptions {
pub const CANAL: JsonParseOptions = JsonParseOptions {

@ClSlaid
Copy link

ClSlaid commented Jan 10, 2025

Do we really need the payload field? I prefer to:

  • users just need to set the inner fields.
  • turn on a strong_schema option in connector configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants