-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use packed sequence for GridSearchCV #1083
Comments
Thanks for the detailed report and the reproducer. I could get the code to work with a few small changes. Here is the important part: from skorch.helper import SliceDataset
X_sl = SliceDataset(dataset, idx=0)
y_sl = SliceDataset(dataset, idx=1)
cross_validate = 4
mynet = ScoredRegressorNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
#train_split=ValidSplit(2),
train_split=False,
iterator_train__shuffle=False,
iterator_train__batch_size = 2,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(
estimator=pipe,
cv=cross_validate,
param_grid={},
refit=False,
n_jobs=1,
) Notable changes:
|
Thank you for the quick reply, I can get it works with GroupKFold and default int. StratifiedGroupKFold may be somehow important/useful in some cases. I guess I can skip it for now but probably will need it in the future. In fact, I found two workarounds, but I am not sure if they are valid:
class MySliceDataset(SliceDataset):
def __init__(self, dataset, idx=0, indices=None):
super().__init__(dataset, idx, indices)
def __array__(self, dtype=None):
# This method is invoked when calling np.asarray(X)
# https://numpy.org/devdocs/user/basics.dispatch.html
X = [self[i] for i in range(len(self))]
if np.isscalar(X[0]):
return np.asarray(X)
return np.asarray(np.concatenate(X), dtype=dtype)
y_sl = MySliceDataset(dataset, idx=1)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)]) If my workaround is correct, maybe the SliceDataset can add a check function whether elements of X have consistent length. If yes, use the previous way, if not, use np.concatenate. This way, skorch may support more types of data formats. However, I do discover another issue, if I switch to StratifiedKFold, it returns another error: "ValueError: Found input variables with inconsistent numbers of samples: [4, 70, 4]". When I looked at GridSearchCV api, it reads "For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. Therefore, I went back to where I generated sequence by assigning gt as int, and tried both giving cv=2 and cv=StratifiedKFold(n_splits=n_splits) when calling GridSearchCV. Surprisingly, cv=2 works but not StratifiedKFold. Aren't they equivalent, or skorch made it turn to KFold? I tried defining ScoredClassifierNet(NeuralNetClassifier) as well, and I got the same outcome. If it turns to KFold, is there a way to support StratifiedKFold? gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64) # where gt was generated in the previous for-loop
# this works
gs = GridSearchCV(estimator=pipe, cv=2,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)])
# this does not work
gs = GridSearchCV(estimator=pipe, cv=StratifiedKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)]) |
In fact, I identified another problem when enforce_sorted=False. Let me summarize what works (not 100% sure if valid) first import numpy as np
from imblearn.pipeline import Pipeline
from skorch import NeuralNetClassifier, NeuralNetRegressor, NeuralNet
from skorch.scoring import loss_scoring
from torch import nn
import torch
from torch.nn.utils.rnn import pack_sequence, unpack_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold
from skorch.helper import SliceDataset
class ScoredRegressorNet(NeuralNetRegressor):
def score(self, X, y=None):
return -loss_scoring(self, X, y)
# Define a custom dataset
class SequenceDataset(Dataset):
# sequence and labels are lists
def __init__(self, sequence, labels):
self.sequence = sequence
self.labels = labels
def __len__(self):
return len(self.sequence)
def __getitem__(self, idx):
return self.sequence[idx], self.labels[idx]
class MyRNN(nn.Module):
def __init__(self, input_size, **rnn_kwargs):
super(MyRNN, self).__init__()
self.rnn = nn.RNN(input_size = input_size, **rnn_kwargs, batch_first=True)
self.hidden_state = None
def forward(self, x):
y, hs = self.rnn(x)
# w/o unpack and concat, skorch cannot compute loss during the fit_loop
y = unpack_sequence(y)
y = torch.concat(y)
y = y.squeeze() # added to remove warning when calling loss_function
return y
n_seq = 5
nFeatures = 2
seq = [None]*n_seq
gt = [None]*n_seq
high = 5
for i in range(n_seq):
seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
dataset = SequenceDataset(seq, gt)
mdl = MyRNN(input_size=nFeatures, hidden_size=1, bidirectional=False).double()
def my_collate_fn(data):
# nSeq = len(data)
seq = pack_sequence([d[0] for d in data])
lab = [d[1] for d in data]
lab = torch.concat(lab)
return (seq, lab)
mynet = ScoredRegressorNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
train_split=None,
iterator_train__shuffle=False,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
class MySliceDataset(SliceDataset):
def __init__(self, dataset, idx=0, indices=None):
super().__init__(dataset, idx, indices)
def __array__(self, dtype=None):
# This method is invoked when calling np.asarray(X)
# https://numpy.org/devdocs/user/basics.dispatch.html
X = [self[i] for i in range(len(self))]
if np.isscalar(X[0]):
return np.asarray(X)
return np.asarray(np.concatenate(X), dtype=dtype)
X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)
mynet.fit(X_sl, y=y_sl) # this runs
nGroups = n_seq
NGroupHoldOut = 1
n_splits=np.ceil(nGroups/NGroupHoldOut).astype(np.int32)
cross_validate = StratifiedGroupKFold(n_splits=n_splits)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=cross_validate,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)]) Issue 1From my previous reply that shows StratifiedKFold does not work, while it is suppose to be the same when training a classifier with cv=int that actually works. from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold, GroupKFold, StratifiedKFold
class ScoredClassifierNet(NeuralNetClassifier):
def score(self, X, y=None):
return -loss_scoring(self, X, y)
for i in range(n_seq):
seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64)
dataset = SequenceDataset(seq, gt)
X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)
mynet = ScoredClassifierNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
train_split=None,
iterator_train__shuffle=False,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
# this works
gs = GridSearchCV(estimator=pipe, cv=n_splits,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)])
# this does not work
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=StratifiedKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)]) Issue 2The above test is based on the fact that sequence length is properly sorted beforehand, while it may not be always true. If I use pack_sequence(..., enforce_sorted=False) that deals with the sorting problem. The error becomes completely different, because when skorch calls get_len(batch[0]), it return "ValueError: Dataset does not have consistent lengths." The reason is that when enforce_sorted=False, the sequence object will have 4 lists, but when enforce_sorted=True, two of the 4 lists become None, forcing get_len() to use the length of the first list, the actual X needed. But when there is None in the list, get_len() will get lengths of 4 lists, causing an error. To reproduce this error, def my_collate_fn(data):
# nSeq = len(data)
seq = pack_sequence([d[0] for d in data], enforce_sorted=False)
lab = [d[1] for d in data]
lab = torch.concat(lab)
return (seq, lab)
for i in range(n_seq):
seq[i] = torch.rand((3*(n_seq+1+i), nFeatures), dtype=torch.float64) # make length order different, so one must use enforce_sorted=False
gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
dataset = SequenceDataset(seq, gt)
mynet = ScoredRegressorNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
train_split=None,
iterator_train__shuffle=False,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)
# nothing works
mynet.fit(X_sl, y=y_sl)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=n_splits,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl, y=y_sl,
groups=[x for x in range(n_seq)]) I did try to play with something like
Maybe a workaround is manually sorting before hand and unsort when everything is done later. However, this will not work if iterator_train__shuffle=True. If skorch plans to support sequences data, an option is that (1) the get_len() function should check whether the type is an object/instance of PackedSequence and act differently, and (2) the update the SliceDataset based on what I suggested in the previous reply. I don't know what is skorch's plan, if my suggestion is correct/considered, do I need to revise skorch and push? |
I think this is a valid approach for your problem. Not sure if this should be added by default to
It's not quite clear how you expect the split to be performed in this case. Stratified splits are for classification tasks, ensuring that each split has roughly the same distribution of classes. Since the target consists of sequences in this case, there is no clear way how to split the folds in a stratified way.
AFAICT, in this snippet, you're not using
So far, I could replicate this
I added a special check for def get_len(data):
if isinstance(data, torch.nn.utils.rnn.PackedSequence):
return len(data) This allowed the grid search to run without errors, but I did result in this warning:
This is a strong indicator that the loss is not correctly calculated, as torch will most likely broadcast the tensor to 78x78, and requires some extra handling to be fixed.
Hmm, I can't really replicate this. When I use your code with
|
I have updated my codes in the previous comment. As some codes are indeed missing for reproducing and I updated MyRNN to address the warning issue you mentioned.
Maybe added another helper dataset with a different name, and add to/explain in a tutorial to show what scenarios will be needed?
In fact, in my real data, I plan to assign each sequence with one unique label; every time point from the same sequence has the same label, but a different sequence may have a different label. Probably I will just need a GroupKFold. In any case, I was expecting any partition methods from scikit together with skorch should be able to handle it. How exactly a specific partition handles my case, I will need to check/think. Scikit partition reference The question is more about why cv=int and cv=StratifiedKFold(int) lead to different results, while the scikit document reads like they should be the same when the estimator is a classifier (I used NeuralNetClassifier) and y is binary/multiclass (I used int). I am not sure whether the DataLoader/Dataset from skorch corrupted the checking, the NeuralNetClassifier is not considered as a classifier, or something else.
Thanks for pointing out, I forgot to paste the part to reassign X_sl/y_sl when writing comments.
In fact, what I did (and skorch should) is if isinstance(data, torch.nn.utils.rnn.PackedSequence):
return len(data[0]) data is a PackedSequence object, which always has 4 indexable elements. Only the 0th one refers to the sequence. Others would be batch size, sorted/unsorted indices. I updated n_seq = 5, instead of 4, and you should still see it has only 4 elements. As a reminder, this only works with the MySliceDataset defined for y, as far as I can recall. I am not sure if it will work when someone only needs one prediction per sequence.
I am not sure why your check function works. For that warning, I added y=y.squeeze() in the def forward of MyRNN, and now there is no more warning regarding the loss function.
My fault. I forgot pipe = Pipeline([('net', mynet)]) when writing the comment. This caused using old collate_fn. |
At this point, it's really hard for me to still follow the discussion as many threads are going on at the same time and the code snippets are not self-contained and rely on previous code. So ideally, could you summarize your remaining issues and create a self-contained script for exactly those?
In general, for |
Sorry for the confusion. The remaining issue here for me is that I am confused if those train-test partition strategies are supported in my case. Even if there is no error in execution, is there an easy way to trace whether the partition fits need? I rewrote the codes below after some debugging and revisions. Now, I have two sets of nets and datasets. One for regression and the other for classification. I tested several partition strategies, cv=int, KFold, GroupKFold, StratifiedGroupKFold, and StratifiedKFold.
Now cv=int and StratifiedKFold for classification lead to the same result (I had a bug leading to different ones). I think StratifiedKFold does not work because when sklearn performs some checks, it found X has 5 sequences and y has more than 5 (total samples of all sequences) because the helper datasets are different. Maybe it is fair as you mentioned that it is difficult to define the partition in this case. However, this gives me a concern how other partitions were done. For instance, is KFold partitioned at sequence level or sample level, given that StratifiedKFold checks down to total samples. I guess they are done in sequence level (based on X), but I am not sure how to verify or check. import numpy as np
from imblearn.pipeline import Pipeline
from skorch import NeuralNetClassifier, NeuralNetRegressor
from skorch.scoring import loss_scoring
from torch import nn
import torch
from torch.nn.utils.rnn import pack_sequence, unpack_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold, KFold, GroupKFold, StratifiedKFold
from skorch.helper import SliceDataset
class ScoredRegressorNet(NeuralNetRegressor):
def score(self, X, y=None):
return -loss_scoring(self, X, y)
# Define a custom dataset
class SequenceDataset(Dataset):
# sequence and labels are lists
def __init__(self, sequence, labels):
self.sequence = sequence
self.labels = labels
def __len__(self):
return len(self.sequence)
def __getitem__(self, idx):
return self.sequence[idx], self.labels[idx]
class MyRNN(nn.Module):
def __init__(self, input_size, **rnn_kwargs):
super(MyRNN, self).__init__()
self.rnn = nn.RNN(input_size = input_size, **rnn_kwargs, batch_first=True)
self.hidden_state = None
def forward(self, x):
y, hs = self.rnn(x)
# w/o unpack and concat, skorch cannot compute loss during the fit_loop
y = unpack_sequence(y)
y = torch.concat(y)
y = y.squeeze() # added to remove warning when calling loss_function
return y
n_seq = 5
nFeatures = 2
seq = [None]*n_seq
gt_reg = [None]*n_seq
gt_clf = [None]*n_seq
high = 5
for i in range(n_seq):
seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
gt_reg[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
gt_clf[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64)
mdl = MyRNN(input_size=nFeatures, hidden_size=1, bidirectional=False).double()
def my_collate_fn(data):
# nSeq = len(data)
seq = pack_sequence([d[0] for d in data])
lab = [d[1] for d in data]
lab = torch.concat(lab)
return (seq, lab)
class MySliceDataset(SliceDataset):
def __init__(self, dataset, idx=0, indices=None):
super().__init__(dataset, idx, indices)
def __array__(self, dtype=None):
# This method is invoked when calling np.asarray(X)
# https://numpy.org/devdocs/user/basics.dispatch.html
X = [self[i] for i in range(len(self))]
if np.isscalar(X[0]):
return np.asarray(X)
return np.asarray(np.concatenate(X), dtype=dtype)
class ScoredClassifierNet(NeuralNetClassifier):
def score(self, X, y=None):
return -loss_scoring(self, X, y)
dataset_reg = SequenceDataset(seq, gt_reg)
dataset_clf = SequenceDataset(seq, gt_clf)
X_sl_reg = SliceDataset(dataset_reg, idx=0)
y_sl_reg = MySliceDataset(dataset_reg, idx=1)
X_sl_clf = SliceDataset(dataset_clf, idx=0)
y_sl_clf = MySliceDataset(dataset_clf, idx=1)
mynet_reg = ScoredRegressorNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
train_split=None,
iterator_train__shuffle=False,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
mynet_clf = ScoredClassifierNet(
mdl,
max_epochs=5,
lr=1e-3,
batch_size = 2000,
device="cpu",
train_split=None,
iterator_train__shuffle=False,
iterator_train__collate_fn=my_collate_fn,
iterator_valid__collate_fn=my_collate_fn,
criterion=nn.SmoothL1Loss,
)
nGroups = n_seq
NGroupHoldOut = 1
n_splits=np.ceil(nGroups/NGroupHoldOut).astype(np.int32)
pipe_reg = Pipeline([('net', mynet_reg)])
pipe_clf = Pipeline([('net', mynet_clf)])
''' Regression '''
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=n_splits,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_reg, y=y_sl_reg,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=KFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_reg, y=y_sl_reg,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=GroupKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_reg, y=y_sl_reg,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=StratifiedGroupKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_reg, y=y_sl_reg,
groups=[x for x in range(n_seq)])
''' does not work '''
gs = GridSearchCV(estimator=pipe_reg, cv=StratifiedKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_reg, y=y_sl_reg,
groups=[x for x in range(n_seq)])
# Classification
''' does not work '''
gs = GridSearchCV(estimator=pipe_clf, cv=n_splits,
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_clf, y=y_sl_clf,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=KFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_clf, y=y_sl_clf,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=GroupKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_clf, y=y_sl_clf,
groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=StratifiedGroupKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_clf, y=y_sl_clf,
groups=[x for x in range(n_seq)])
''' does not work'''
gs = GridSearchCV(estimator=pipe_clf, cv=StratifiedKFold(n_splits=n_splits),
param_grid={},
refit=False, n_jobs=1
)
gs.fit(X_sl_clf, y=y_sl_clf,
groups=[x for x in range(n_seq)]) For the solutions of the solved issues, let me know if there is a need for me to push any codes, or maybe just leave this for others with similar need to read in the future. Thanks for the help. |
Thanks for summarizing the current state.
This should be expected, right? Stratification requires that the targets be classes.
This is somewhat good news, since we expected that
Yes, it can be quite difficult to figure out what exactly goes on under the hood in sklearn. Here is a suggestion how you could check this: from sklearn.model_selection import KFold
class MyKFold(KFold):
def split(self, X, y=None, groups=None):
# same logic as KFold but with debugger enabled
splits = super().split(X, y=y, groups=groups)
for train_idx, test_idx in splits:
import pdb;pdb.set_trace()
yield train_idx, test_idx
...
grid_search = GridSearchCV(..., cv=MyKFold(5)) Here we define a custom split function for the cv and place a debugger to be able to inspect the splits. Then we pass this object to the
For now, let's just leave it as is for future reference. As mentioned previously, we can think of adding explicit support for |
Dear Dev or whoever can help,
I have sequences of different lengths and therefore thinking of using packed sequence to feed them to RNN (or whatever similar). What I tried/figured out is that I need to use Dataset to pack sequences and unpack inside the RNN model to let fit() works, but when it is time to apply GridSearchCV (partition by grouping some sequences), there seems to be no way to work around. Codes below are what I have for now.
importing modules
Defining some classes
Generate random data
Define NeuralNet
Fitting data
If I feed y as None, there is an error from StratifiedGroupKFold
"ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead." because y being fed to StratifiedGroupKFold is None.
If I feed y as gt, StratifiedGroupKFold will also fail because gt is a list, and if I concatenate gt, it will not work as well because lengths of X and y are not consistent.
To do another test with default cv, I tried
It returns
"ValueError: No y-values are given (y=None). You must implement your own DataLoader for training (and your validation) and supply it using the
iterator_train
anditerator_valid
parameters respectively."At this point, I am not sure why what is the workaround, as I do not fully understand skorch. Does it work with packed sequence in any other forms for what I need?
Thank you in advance!
The text was updated successfully, but these errors were encountered: