Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most BCP improvements from NCBI, as of May 2024 #558

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ucko
Copy link
Contributor

@ucko ucko commented May 30, 2024

This PR roughly corresponds to the primarily BCP-related commits from #555, adding 02aa63c per dependency considerations and omitting a couple of commits that can receive individual PRs, which I'll open shortly.

@ucko
Copy link
Contributor Author

ucko commented Jun 4, 2024

I've force-pushed a corrected version superseding #570.

@ucko
Copy link
Contributor Author

ucko commented Jun 6, 2024

Meanwhile, I'm looking to extend this series with BCP-specific tuneups superseding #573. To that end, I wanted to compare behavior for three cases: normal BCP in, BCP into an alternate version of the table where column types are all VARCHAR(64) (to confirm that conversion to VARCHAR(n) is OK when truncation definitely won't be needed), and BCP into an alternate version where they're all VARCHAR(1) (to see what happens when truncation definitely will be needed), using ctlib/unittests/blk_in.c and dblib/unittests/bcp.c as starting points.

On the DBLIB front, I found that all three cases fully succeeded when built against this branch, but VARCHAR(1) failed for at least one column (details TBD) when using proprietary libraries, so more strictness may actually be in order. On the CTLIB front, something appears to be off altogether, as even the initial run with the usual column types fails with

Open Client Message: 0x608010 0x6259b0
number 0x1010104 layer 1 origin 1 severity 1 number 4
msgstring: blk_describe(): blk layer: user error: Parameter colnum has an illegal value of 1
osstring: (null)
blk_describe(1) failed
Open Client Message: 0x608010 0x6259b0
number 0x1010104 layer 1 origin 1 severity 1 number 4
msgstring: blk_describe(): blk layer: user error: Parameter colnum has an illegal value of 2
osstring: (null)
blk_describe(2) failed
[...]

Any idea what I might be missing? I took care to predefine SYB_LP64 and verified that only FreeTDS-specific headers come from FreeTDS.

UPDATE: Those errors turned out to stem from the use of CS_VERSION_100 and BLK_VERSION_100; advancing the requested version helped.

@ucko
Copy link
Contributor Author

ucko commented Jun 12, 2024

Here are my summarized comparison results (comparing against version 16.0 of Sybase's client libraries), where *B is (VAR)BINARY; *C is (VAR)CHAR; B, C, VB, and VC are respectively BINARY, CHAR, VARBINARY, and VARCHAR; *DT is (SMALL)DATETIME; *$ is (SMALL)MONEY; and # is any numerical type -- BIT, any variant of INT, REAL, FLOAT, DECIMAL, or NUMERIC:

CTLIB BCP truncation differences w/CS layer messages ignored (see below for corrected version):

FROM\TO	*B(1)	*B(64)	*C(1)	*C(64)
C(10)	a	a	b	OK
VC(10)	a	a	c	OK
B(10)	b	OK	c/d	OK
VB(10)	e	OK	e	OK
*DT	f/g	g	c	OK
*$/#	c/h	OK	e/h	OK
  • a)
    • Sybase: clearly expects hex digits; behavior depends on what it sees first:
      • non-hex char among first 2n: rowxfer fails w/syntax error
      • excess char when length is implicit [C(n)]: xfer noisily truncates but OK(!)
      • excess char when length is given [VC(n)]: rowxfer fails w/overflow error
    • FreeTDS: rowxfer silently fails in all those cases
  • b) silently truncates under Sybase (but OK), fails under FreeTDS
  • c) noisily truncates in rowxfer under Sybase (but OK), fails under FreeTDS
  • d) B(n) -> *C(m): silently truncates under Sybase when 4 <= m < 2n; the cutoff of 4 suggests a leading 0x, but VB(n) -> *C(2n) is OK
  • e) rowxfer fails under both, (overflow) error only under Sybase
  • f) S: bind fails (type mismatch), xfer OK (!); F: xfer fails; (g) if NULL
  • g) rowxfer always OK, but Sybase bind fails (type mismatch)
  • h) BIT is always OK, as is TINYINT -> *BINARY(1)

DBLIB BCP truncation differences (also corrected below):

FROM\TO	*B(1)	*B(64)	*C(1)	*C(64)
*C(10)	a	a	b	OK
B(10)	c	OK	b/d	OK
VB(10)	b	OK	b/d	OK
*DT/*$	b	OK	b	OK
#	b/e	OK/f	b/e	OK/f
  • a)
    • Sybase: syntax error if non-hex char among first 2n, otherwise overflow error as appropriate
    • FreeTDS: syntax error if non-hex char present at all, otherwise silently truncates
  • b) conversion overflow error (20049) from Sybase, silent truncation by FreeTDS
  • c) oversized row error (20074) from Sybase, silent truncation by FreeTDS
  • d) *B(n) -> *C(m): silently truncates under Sybase when 4 <= m < 2n
  • e) BIT is always OK, as is TINYINT -> *BINARY(1)
  • f) Sybase crashes with heap corruption for DECIMAL and NUMERIC

@ucko
Copy link
Contributor Author

ucko commented Jun 12, 2024

I'll turn back to other PRs for now, but do still plan to add a proper replacement for #573 at some point.

@ucko
Copy link
Contributor Author

ucko commented Jun 18, 2024

Ah, blk_in (like most ctlib unit tests) doesn't register a handler for messages from the CS layer, just from the CT layer and the server; it looks like resurveying with one in place will make a difference, at least when running against FreeTDS.

@ucko
Copy link
Contributor Author

ucko commented Jun 20, 2024

CTLIB BCP truncation differences w/CS layer messages printed, and more corner cases examined

General notes

  • This behavior is consistent between 15.7 and 16.0 libraries when using {CS,BLK}_VERSION_125 and up.
    • blk_describe fails altogether under {CS,BLK}_VERSION_120 and below.
    • blk_rowxfer started allowing noisy truncation [(1, 2, 1, 42)] sometime between 12.5 (which issued that same error but then failed) and 15.7; I don't have any installations with intermediate versions, but even 16.0 is a decade old. (At any rate, all other cases, including those that encounter silent truncation, behave the same across versions, modulo 12.5's more limited data type support.)
  • LC and LB are LONGCHAR and LONGBINARY, which act like base versions; L?C and L?B are short for C/LC and B/LB respectively; *DT now additionally covers DATE, TIME, and BIG(DATE)TIME; L and USH are LONG and USHORT, which do not act like *INT.
  • UNICHAR is out of scope per broader differences -- Sybase expects even-length input, which it inserts into binary columns as is modulo possible truncation (but does attempt to convert when inserting into string columns), whereas FreeTDS treats it like (LONG)CHAR.
  • DECIMAL and NUMERIC also work differently, but only in the context of binary columns -- Sybase stores just the main body (and drops trailing NUL bytes as usual), whereas FreeTDS stores the whole thing (space permitting), complete with trailing NUL bytes per its usual practice.
FROM\TO	*B(1)	*B(64)	*C(1)	*C(64)
L?C(10)	a	a	b	OK
VC(10)	a	a	c	OK
L?B(10)	b	OK	c	OK
VB(10)	d	OK	d	OK
*DT	e/f	f	c	g
*$/#	c/h	OK	d/h	i
L/USH	j	j	j	j
  • a) Both expect hex digits with optional leading 0x (and effectively prepend a 0 to the body if the length is odd), but their behavior when something is off otherwise varies.
    • Sybase: clearly expects hex digits; behavior depends on what it sees first:
      • non-hex char among first 2n(+2): rowxfer fails w/syntax error
        (1, 2, 1, 26) - blk_rowxfer(): blk layer: internal BLK-Library error: Failed in conversion routine - syntax error. col = # row = #.
      • excess char w/CS_NULLTERM [C(n)/LC(n)]: xfer noisily truncates but OK(!); if the accompanying full length is odd, effectively prepends a zero and discards the last digit even if the initial run of hex digits has even length.
        (1, 2, 1, 42) - blk_rowxfer(): blk layer: internal BLK-Library error: Data truncated while doing local character set conversion. col = # row = #.
      • excess char when length is restated [VC(n)]: rowxfer fails w/overflow error
        (1, 2, 1, 25) - blk_rowxfer(): blk layer: internal BLK-Library error: Failed in conversion routine - condition overflow. col = # row = #.
    • FreeTDS:
      • any other syntax anywhere: rowxfer fails w/syntax error
        (2, 4, 1, 24) - cs_convert: cslib user api layer: common library error: The conversion/operation was stopped due to a syntax error in the source field.
      • excess char otherwise: rowxfer fails w/overflow error
        (2, 4, 1, 36) - cs_convert: cslib user api layer: common library error: The result is truncated because the conversion/operation resulted in overflow.
  • b) silently truncates under Sybase (but OK), fails under FreeTDS -- (2, 4, 1, 36) as above
  • c) noisily truncates in rowxfer under Sybase (but OK): (1, 2, 1, 42) as above; NB: truncates to even length when starting with binary
    fails under FreeTDS: (2, 4, 1, 36) as above
  • d) rowxfer overflow error - (1, 2, 1, 25) under Sybase, (2, 4, 1, 36) under FTDS
  • e)
    • Sybase: bind fails (type mismatch), blk_rowxfer OK (!), storing raw content (quite possibly truncated)
      (1, 1, 1, 10) - blk_bind(): blk layer: user error: a table column of type # cannot be bound to a program variable of type #.
    • FreeTDS: blk_rowxfer fails [(2, 4, 1, 36) as above]; (f) if NULL
  • f) rowxfer always OK, but Sybase bind fails (type mismatch as above)
  • g) OK, but hour padding differs (space with Sybase, 0 with FreeTDS) and FreeTDS prepends Jan 01 1900 for pure (BIG)TIME values and appends 12:00AM for pure DATE values
  • h) BIT is always OK, as is (U)TINYINT -> *BINARY(1)
  • i) OK, but only Sybase trims trailing zeroes from fractional MONEY(4) values; OTOH, only Sybase formats CS_REAL (float) values with excess precision.
  • j) Same as (e) [length 1, not NULL]/(f) for Sybase; under FreeTDS, rowxfer fails with a type mismatch (unless NULL)
    (2, 1, 1, 16) - cs_convert: cslib user api layer: external error: Conversion between # and # datatypes is not supported.

Notably, all errors under Sybase officially come from the CT/BLK layer, whereas FreeTDS only ever issues CS-layer errors that offer less context and may be likelier to slip through the cracks. That should be possible to remedy by letting _cs_convert take an optional pointer to a BCP-error variable and populate it in lieu of reporting an error directly; this pointer's presence would also result in behavior tweaks.

@ucko
Copy link
Contributor Author

ucko commented Jul 1, 2024

Corrected DBLIB truncation differences (modulo formatting details along the above lines and FreeTDS's support for additional client types):

FROM\TO	*B(1)	*B(64)	*C(1)	*C(64)
*C(10)	a	a	b	OK
*B(10)	b	OK	c	OK
*DT/*$	c	OK	c	OK
#	c/d	OK/e	c/d	OK/e
  • a) Both expect hex digits with optional leading 0x (and effectively prepend a 0 to the body if the length is odd), but their behavior when something is off otherwise varies.
    • Sybase: syntax error (20050) if non-hex char among first 2n(+2), otherwise overflow error (20049) as appropriate
    • FreeTDS: syntax error (20050) if non-hex char present at all, otherwise silently truncates
  • b) oversized row error (20074) from Sybase, silent truncation by FreeTDS
  • c) overflow error (20049) from Sybase, silent truncation by FreeTDS
  • d) BIT is always OK, as is TINYINT -> *BINARY(1)
  • e) Sybase crashes with heap corruption for DECIMAL and NUMERIC

@ucko ucko force-pushed the ncbi-2024-05-bcp-improvements branch from 7b71bdb to 468812f Compare July 3, 2024 19:27
ucko added 12 commits August 28, 2024 13:38
When the user erroneously supplies NULL data for a column which does not
use a nullable type and is not otherwise known to be nullable, cleanly
fail.  Pressing on anyway (in hopes that a server-side default would
take effect?) was liable to trigger legitimate assertion failures in
tds_checks.c.  To that end, take the null_error callback and invoke it
before bailing unless it is itself NULL.

Signed-off-by: Aaron M. Ucko <[email protected]>
* Read defaults, send them back explicitly as needed, and clean up
  their storage afterwards.
* tds5_bcp_add_variable_columns:
** Send spaces in lieu of fully empty SYB(VAR)CHAR data.
** Send correct textpos values in all cases.
** When writing the adjustment table, avoid a potential Sybase "row
   format error" by eliding the (effective) column count when it would
   have repeated the largest adjustment table entry.
* tds_deinit_bcpinfo: Free and reset current_row for transfers in.
  Avoid potentially trying to interpret fragments of outgoing TDS
  data as BLOB pointers.

Signed-off-by: Aaron M. Ucko <[email protected]>
Under these protocol versions, MS SQL Server describes these columns
as having NVARCHAR(n) types rather than traditional DATETIME types,
due to DATETIME's range limitations.  (Likewise for TIME, which
however still needs to come in as a preconverted UTF-16 string to
compensate for combined API and protocol limitations.)

- ctlib/blk.c: Perform character conversion on strings converted from
  fixed-width types such as CS_DATETIME_TYPE when a comparison of
  column-size limits appears to indicate an encoding difference.
- tds/config.c: Adjust STD_DATETIME_FMT to match MS's (ISO-ish)
  tastes; in particular, ensure that the entire date portion fits
  within the first 10 characters.

Signed-off-by: Aaron M. Ucko <[email protected]>
* Extend _cs_convert to take an additional handle argument (allowed to be
  NULL) to provide for the possibility of replacing the caller's original
  output buffer with the one tdsconvert allocated itself (as it always
  does for blobs), rather than copying the blob data.
* blk.c (_blk_get_col_data): Provide a non-NULL handle, both for
  efficiency's sake and because tds_alloc_bcp_column_data imposes a 4 KiB
  cap with the expectation that callers will take this approach.  (dblib
  already does.)

Signed-off-by: Aaron M. Ucko <[email protected]>
To that end, factor a static _blk_clean_desc function out of blk_done's
CS_BLK_ALL case.

Signed-off-by: Aaron M. Ucko <[email protected]>
* tds.h (struct tds_bcpinfo): Add more fields to enable the necessary
  bookkeeping.
* ctlib/blk.c: Implement blk_textfer; adjust other functions accordingly.
* tds/{bulk,data}.c: Allow for deferred, and possibly piecemeal, blob
  submission; in particular, make tds_bcp_send_record resumable.

Signed-off-by: Aaron M. Ucko <[email protected]>
* Allow _cs_convert to accept an explicit TDS destination type; direct
  it to copy cres.dta if that destination type is SYBMS{DATE,TIME}*.
* _blk_get_col_data: Don't give _cs_convert_not_client any data to
  convert (blindly in the wrong direction!); rather, if it indicates
  that the column holds a new MS type, supply an explicit TDS
  destination type to _cs_convert.
* _cs_dt_crack_v2: Formally populate datetzone from dr.timezone (still
  always 0 in practice).

Signed-off-by: Aaron M. Ucko <[email protected]>
Adjust existing failure modes' boundaries and introduce the
possibility of proceeding with truncated input with or without a
message (with the latter occurring in some cases that require no
conversion).  As for messages, arrange to emit them via _ctclient_msg
rather than _csclient_msg with the same (English) wording and context
as Sybase yields, modulo formal punctuation differences and the usual
discrepancies in layer and origin numbering.

To that end:
* Add an internal BLK_CONV_STATUS enum type whose values double as
  (new) message numbers.
* Cover those numbers in _ct_get_user_api_layer_error.
* Have _cs_convert additionally take a (quite possibly NULL) pointer
  to a bulk-conversion-status value to populate as appropriate.
* Keep a running count of rows sent for reporting purposes
  (determining the column number to cite by a linear scan of the
  bindinfo's columns array).

Signed-off-by: Aaron M. Ucko <[email protected]>
The destination view may well have an INSTEAD OF INSERT trigger that
makes meaningful use of the supplied values.

Signed-off-by: Aaron M. Ucko <[email protected]>
Store a copy of the hints; clean it up in _blk_clean_desc.

Signed-off-by: Aaron M. Ucko <[email protected]>
Cast away harmless discrepancies, mostly by taking advantage of
TDS_PUT_BYTE.

Signed-off-by: Aaron M. Ucko <[email protected]>
@ucko ucko force-pushed the ncbi-2024-05-bcp-improvements branch from 468812f to 8739305 Compare August 28, 2024 17:42
@ucko
Copy link
Contributor Author

ucko commented Aug 28, 2024

My latest version of this series adds a commit to address a corner case I recently encountered by (re)allowing BCP into computed columns in the context of a FIRE_TRIGGERS hint, since an INSTEAD OF INSERT trigger may well be in play.

@freddy77
Copy link
Contributor

Picking up a commit to replace one work I was doing, specifically ae519f8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants