Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use packed sequence for GridSearchCV #1083

Open
nafraw opened this issue Dec 28, 2024 · 8 comments
Open

Unable to use packed sequence for GridSearchCV #1083

nafraw opened this issue Dec 28, 2024 · 8 comments

Comments

@nafraw
Copy link

nafraw commented Dec 28, 2024

Dear Dev or whoever can help,

I have sequences of different lengths and therefore thinking of using packed sequence to feed them to RNN (or whatever similar). What I tried/figured out is that I need to use Dataset to pack sequences and unpack inside the RNN model to let fit() works, but when it is time to apply GridSearchCV (partition by grouping some sequences), there seems to be no way to work around. Codes below are what I have for now.

importing modules

import numpy as np
from imblearn.pipeline import Pipeline

from skorch import NeuralNetClassifier, NeuralNetRegressor, NeuralNet
from skorch.scoring import loss_scoring
from skorch.dataset import ValidSplit
from torch import nn
import torch
from torch.nn.utils.rnn import pack_sequence, unpack_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold

Defining some classes

class ScoredRegressorNet(NeuralNetRegressor):
    def score(self, X, y=None):
        return -loss_scoring(self, X, y)

# Define a custom dataset
class SequenceDataset(Dataset):
    # sequence and labels are lists
    def __init__(self, sequence, labels):
        self.sequence = sequence
        self.labels = labels

    def __len__(self):
        return len(self.sequence)

    def __getitem__(self, idx):
        return self.sequence[idx], self.labels[idx]

class MyRNN(nn.Module):
    def __init__(self, input_size, **rnn_kwargs):
        super(MyRNN, self).__init__()
        self.rnn = nn.RNN(input_size = input_size, **rnn_kwargs, batch_first=True)
        self.hidden_state = None

    def forward(self, x):
        y, hs = self.rnn(x)
        # w/o unpack and concat, skorch cannot compute loss during the fit_loop
        y = unpack_sequence(y)
        y = torch.concat(y)
        return y

Generate random data

n_seq = 4
nFeatures = 2
seq = [None]*n_seq
gt = [None]*n_seq
high = 5
for i in range(n_seq):
    seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
    gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
dataset = SequenceDataset(seq, gt)

Define NeuralNet

mdl = MyRNN(input_size=nFeatures, hidden_size=1, bidirectional=False).double()

def my_collate_fn(data):
    # nSeq = len(data)
    seq = pack_sequence([d[0] for d in data])
    lab = [d[1] for d in data]
    lab = torch.concat(lab)
    return (seq, lab)

mynet = ScoredRegressorNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=ValidSplit(2),
    iterator_train__shuffle=False,
    iterator_train__batch_size = 2,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

Fitting data

mynet.fit(dataset, y=None) # this runs


nGroups = n_seq
NGroupHoldOut = 1
n_splits=np.ceil(nGroups/NGroupHoldOut).astype(np.int32)
cross_validate = StratifiedGroupKFold(n_splits=n_splits)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=cross_validate, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
# None of below works
gs.fit(dataset, y=None, 
       groups=[x for x in range(n_seq)])
gs.fit(dataset, y=gt, 
       groups=[x for x in range(n_seq)])


If I feed y as None, there is an error from StratifiedGroupKFold
"ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead." because y being fed to StratifiedGroupKFold is None.

If I feed y as gt, StratifiedGroupKFold will also fail because gt is a list, and if I concatenate gt, it will not work as well because lengths of X and y are not consistent.

To do another test with default cv, I tried

gs = GridSearchCV(estimator=pipe, cv=2, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(dataset, y=None, 
       groups=[x for x in range(n_seq)])

It returns
"ValueError: No y-values are given (y=None). You must implement your own DataLoader for training (and your validation) and supply it using the iterator_train and iterator_valid parameters respectively."

At this point, I am not sure why what is the workaround, as I do not fully understand skorch. Does it work with packed sequence in any other forms for what I need?

Thank you in advance!

@BenjaminBossan
Copy link
Collaborator

Thanks for the detailed report and the reproducer. I could get the code to work with a few small changes. Here is the important part:

from skorch.helper import SliceDataset

X_sl = SliceDataset(dataset, idx=0)
y_sl = SliceDataset(dataset, idx=1)

cross_validate = 4
mynet = ScoredRegressorNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    #train_split=ValidSplit(2),
    train_split=False,
    iterator_train__shuffle=False,
    iterator_train__batch_size = 2,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(
    estimator=pipe,
    cv=cross_validate, 
    param_grid={},
    refit=False,
    n_jobs=1,
)

Notable changes:

  • Use SliceDataset, a helper class from skorch to make GridSearchCV play nice with datasets.
  • Avoid StratifiedGroupKFold: I couldn't get this to work, not sure how important it is for you.
  • Set train_split=False: As GridSeachCV is already splitting the data into train and test, there is no need for the skorch internal split (except for early stopping).

@nafraw
Copy link
Author

nafraw commented Dec 28, 2024

Thank you for the quick reply, I can get it works with GroupKFold and default int. StratifiedGroupKFold may be somehow important/useful in some cases. I guess I can skip it for now but probably will need it in the future. In fact, I found two workarounds, but I am not sure if they are valid:

  • sklearn\model_selection_split.py:975, if I add y=np.concatenate(y) before y=np.asarray(y), the code runs w/o error. Obviously, this is not a smart hack.
  • Another workaround is then assign y_sl with another SliceDataset performs np.concatenate by the end of array() as this is what happens when calling np.asarray()
class MySliceDataset(SliceDataset):
    def __init__(self, dataset, idx=0, indices=None):
        super().__init__(dataset, idx, indices)
    
    def __array__(self, dtype=None):
        # This method is invoked when calling np.asarray(X)
        # https://numpy.org/devdocs/user/basics.dispatch.html
        X = [self[i] for i in range(len(self))]
        if np.isscalar(X[0]):
            return np.asarray(X)
        return np.asarray(np.concatenate(X), dtype=dtype)
y_sl = MySliceDataset(dataset, idx=1)
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])

If my workaround is correct, maybe the SliceDataset can add a check function whether elements of X have consistent length. If yes, use the previous way, if not, use np.concatenate. This way, skorch may support more types of data formats.

However, I do discover another issue, if I switch to StratifiedKFold, it returns another error: "ValueError: Found input variables with inconsistent numbers of samples: [4, 70, 4]". When I looked at GridSearchCV api, it reads "For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.

Therefore, I went back to where I generated sequence by assigning gt as int, and tried both giving cv=2 and cv=StratifiedKFold(n_splits=n_splits) when calling GridSearchCV. Surprisingly, cv=2 works but not StratifiedKFold. Aren't they equivalent, or skorch made it turn to KFold? I tried defining ScoredClassifierNet(NeuralNetClassifier) as well, and I got the same outcome. If it turns to KFold, is there a way to support StratifiedKFold?

gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64) # where gt was generated in the previous for-loop
# this works
gs = GridSearchCV(estimator=pipe, cv=2, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])
# this does not work
gs = GridSearchCV(estimator=pipe, cv=StratifiedKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])

@nafraw
Copy link
Author

nafraw commented Dec 30, 2024

In fact, I identified another problem when enforce_sorted=False. Let me summarize what works (not 100% sure if valid) first

import numpy as np
from imblearn.pipeline import Pipeline

from skorch import NeuralNetClassifier, NeuralNetRegressor, NeuralNet
from skorch.scoring import loss_scoring
from torch import nn
import torch
from torch.nn.utils.rnn import pack_sequence, unpack_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold
from skorch.helper import SliceDataset

class ScoredRegressorNet(NeuralNetRegressor):
    def score(self, X, y=None):
        return -loss_scoring(self, X, y)

# Define a custom dataset
class SequenceDataset(Dataset):
    # sequence and labels are lists
    def __init__(self, sequence, labels):
        self.sequence = sequence
        self.labels = labels

    def __len__(self):
        return len(self.sequence)

    def __getitem__(self, idx):
        return self.sequence[idx], self.labels[idx]

class MyRNN(nn.Module):
    def __init__(self, input_size, **rnn_kwargs):
        super(MyRNN, self).__init__()
        self.rnn = nn.RNN(input_size = input_size, **rnn_kwargs, batch_first=True)
        self.hidden_state = None

    def forward(self, x):
        y, hs = self.rnn(x)
        # w/o unpack and concat, skorch cannot compute loss during the fit_loop
        y = unpack_sequence(y)
        y = torch.concat(y)
        y = y.squeeze() # added to remove warning when calling loss_function
        return y

n_seq = 5
nFeatures = 2
seq = [None]*n_seq
gt = [None]*n_seq
high = 5
for i in range(n_seq):
    seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
    gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
dataset = SequenceDataset(seq, gt)


mdl = MyRNN(input_size=nFeatures, hidden_size=1, bidirectional=False).double()

def my_collate_fn(data):
    # nSeq = len(data)
    seq = pack_sequence([d[0] for d in data])
    lab = [d[1] for d in data]
    lab = torch.concat(lab)
    return (seq, lab)

mynet = ScoredRegressorNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=None,
    iterator_train__shuffle=False,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

class MySliceDataset(SliceDataset):
    def __init__(self, dataset, idx=0, indices=None):
        super().__init__(dataset, idx, indices)
    
    def __array__(self, dtype=None):
        # This method is invoked when calling np.asarray(X)
        # https://numpy.org/devdocs/user/basics.dispatch.html
        X = [self[i] for i in range(len(self))]
        if np.isscalar(X[0]):
            return np.asarray(X)
        return np.asarray(np.concatenate(X), dtype=dtype)

X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)

mynet.fit(X_sl, y=y_sl) # this runs


nGroups = n_seq
NGroupHoldOut = 1
n_splits=np.ceil(nGroups/NGroupHoldOut).astype(np.int32)
cross_validate = StratifiedGroupKFold(n_splits=n_splits)
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=cross_validate, 
                  param_grid={},
                  refit=False, n_jobs=1
               )

gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])

Issue 1

From my previous reply that shows StratifiedKFold does not work, while it is suppose to be the same when training a classifier with cv=int that actually works.

from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold, GroupKFold, StratifiedKFold
class ScoredClassifierNet(NeuralNetClassifier):
    def score(self, X, y=None):
        return -loss_scoring(self, X, y)

for i in range(n_seq):
    seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
    gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64)
dataset = SequenceDataset(seq, gt)
X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)

mynet = ScoredClassifierNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=None,
    iterator_train__shuffle=False,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

# this works
gs = GridSearchCV(estimator=pipe, cv=n_splits, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])
# this does not work
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=StratifiedKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])

Issue 2

The above test is based on the fact that sequence length is properly sorted beforehand, while it may not be always true. If I use pack_sequence(..., enforce_sorted=False) that deals with the sorting problem. The error becomes completely different, because when skorch calls get_len(batch[0]), it return "ValueError: Dataset does not have consistent lengths."

The reason is that when enforce_sorted=False, the sequence object will have 4 lists, but when enforce_sorted=True, two of the 4 lists become None, forcing get_len() to use the length of the first list, the actual X needed. But when there is None in the list, get_len() will get lengths of 4 lists, causing an error.

To reproduce this error,

def my_collate_fn(data):
    # nSeq = len(data)
    seq = pack_sequence([d[0] for d in data], enforce_sorted=False)
    lab = [d[1] for d in data]
    lab = torch.concat(lab)
    return (seq, lab)

for i in range(n_seq):
    seq[i] = torch.rand((3*(n_seq+1+i), nFeatures), dtype=torch.float64) # make length order different, so one must use enforce_sorted=False
    gt[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
dataset = SequenceDataset(seq, gt)

mynet = ScoredRegressorNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=None,
    iterator_train__shuffle=False,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

X_sl = SliceDataset(dataset, idx=0)
y_sl = MySliceDataset(dataset, idx=1)

# nothing works
mynet.fit(X_sl, y=y_sl) 
pipe = Pipeline([('net', mynet)])
gs = GridSearchCV(estimator=pipe, cv=n_splits, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl, y=y_sl, 
       groups=[x for x in range(n_seq)])

I did try to play with something like

  • pack_sequience in RNN model while feeding a list instead, this also returns error with get_len() for the same reason, inconsistent lengths of each element in the list
  • Add another parent list with None appended. This resulted another error when converting to tensor in skorch\net.py:1517

Maybe a workaround is manually sorting before hand and unsort when everything is done later. However, this will not work if iterator_train__shuffle=True.

If skorch plans to support sequences data, an option is that (1) the get_len() function should check whether the type is an object/instance of PackedSequence and act differently, and (2) the update the SliceDataset based on what I suggested in the previous reply.

I don't know what is skorch's plan, if my suggestion is correct/considered, do I need to revise skorch and push?

@BenjaminBossan
Copy link
Collaborator

  • Another workaround is then assign y_sl with another SliceDataset performs np.concatenate by the end of array() as this is what happens when calling np.asarray()

I think this is a valid approach for your problem. Not sure if this should be added by default to SliceDataset, as I don't know if this would be the correct approach in all situations that users may encounter this.

I do discover another issue, if I switch to StratifiedKFold

It's not quite clear how you expect the split to be performed in this case. Stratified splits are for classification tasks, ensuring that each split has roughly the same distribution of classes. Since the target consists of sequences in this case, there is no clear way how to split the folds in a stratified way.

Therefore, I went back to where I generated sequence by assigning gt as int

AFAICT, in this snippet, you're not using gt.

The above test is based on the fact that sequence length is properly sorted beforehand, while it may not be always true. If I use pack_sequence(..., enforce_sorted=False) that deals with the sorting problem. The error becomes completely different, because when skorch calls get_len(batch[0]), it return "ValueError: Dataset does not have consistent lengths."

The reason is that when enforce_sorted=False, the sequence object will have 4 lists

So far, I could replicate this

If skorch plans to support sequences data, an option is that (1) the get_len() function should check whether the type is an object/instance of PackedSequence and act differently

I added a special check for PackedSequence:

def get_len(data):
    if isinstance(data, torch.nn.utils.rnn.PackedSequence):
        return len(data)

This allowed the grid search to run without errors, but I did result in this warning:

UserWarning: Using a target size (torch.Size([78])) that is different to the input size (torch.Size([78, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

This is a strong indicator that the loss is not correctly calculated, as torch will most likely broadcast the tensor to 78x78, and requires some extra handling to be fixed.

but when enforce_sorted=True, two of the 4 lists become None, forcing get_len() to use the length of the first list, the actual X needed. But when there is None in the list, get_len() will get lengths of 4 lists, causing an error.

Hmm, I can't really replicate this. When I use your code with enforce_sorted=True), I get a PyTorch error when calling pack_sequence([d[0] for d in data], enforce_sorted=True):

RuntimeError: lengths array must be sorted in decreasing order when enforce_sorted is True. You can pass enforce_sorted=False to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.

@nafraw
Copy link
Author

nafraw commented Jan 10, 2025

I have updated my codes in the previous comment. As some codes are indeed missing for reproducing and I updated MyRNN to address the warning issue you mentioned.

  • Another workaround is then assign y_sl with another SliceDataset performs np.concatenate by the end of array() as this is what happens when calling np.asarray()

I think this is a valid approach for your problem. Not sure if this should be added by default to SliceDataset, as I don't know if this would be the correct approach in all situations that users may encounter this.

Maybe added another helper dataset with a different name, and add to/explain in a tutorial to show what scenarios will be needed?

I do discover another issue, if I switch to StratifiedKFold

It's not quite clear how you expect the split to be performed in this case. Stratified splits are for classification tasks, ensuring that each split has roughly the same distribution of classes. Since the target consists of sequences in this case, there is no clear way how to split the folds in a stratified way.

In fact, in my real data, I plan to assign each sequence with one unique label; every time point from the same sequence has the same label, but a different sequence may have a different label. Probably I will just need a GroupKFold. In any case, I was expecting any partition methods from scikit together with skorch should be able to handle it. How exactly a specific partition handles my case, I will need to check/think. Scikit partition reference

The question is more about why cv=int and cv=StratifiedKFold(int) lead to different results, while the scikit document reads like they should be the same when the estimator is a classifier (I used NeuralNetClassifier) and y is binary/multiclass (I used int). I am not sure whether the DataLoader/Dataset from skorch corrupted the checking, the NeuralNetClassifier is not considered as a classifier, or something else.

AFAICT, in this snippet, you're not using gt.

Thanks for pointing out, I forgot to paste the part to reassign X_sl/y_sl when writing comments.

I added a special check for PackedSequence:

def get_len(data):
    if isinstance(data, torch.nn.utils.rnn.PackedSequence):
        return len(data)

In fact, what I did (and skorch should) is

    if isinstance(data, torch.nn.utils.rnn.PackedSequence):
        return len(data[0])

data is a PackedSequence object, which always has 4 indexable elements. Only the 0th one refers to the sequence. Others would be batch size, sorted/unsorted indices. I updated n_seq = 5, instead of 4, and you should still see it has only 4 elements.

As a reminder, this only works with the MySliceDataset defined for y, as far as I can recall. I am not sure if it will work when someone only needs one prediction per sequence.

This allowed the grid search to run without errors, but I did result in this warning:

UserWarning: Using a target size (torch.Size([78])) that is different to the input size (torch.Size([78, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

This is a strong indicator that the loss is not correctly calculated, as torch will most likely broadcast the tensor to 78x78, and requires some extra handling to be fixed.

I am not sure why your check function works. For that warning, I added y=y.squeeze() in the def forward of MyRNN, and now there is no more warning regarding the loss function.

Hmm, I can't really replicate this. When I use your code with enforce_sorted=True), I get a PyTorch error when calling pack_sequence([d[0] for d in data], enforce_sorted=True):

My fault. I forgot pipe = Pipeline([('net', mynet)]) when writing the comment. This caused using old collate_fn.

@BenjaminBossan
Copy link
Collaborator

At this point, it's really hard for me to still follow the discussion as many threads are going on at the same time and the code snippets are not self-contained and rely on previous code. So ideally, could you summarize your remaining issues and create a self-contained script for exactly those?

The question is more about why cv=int and cv=StratifiedKFold(int) lead to different results, while the scikit document reads like they should be the same when the estimator is a classifier (I used NeuralNetClassifier) and y is binary/multiclass (I used int).

In general, for StratifiedKFold to be used, sklearn needs to check y. We're relying on duck-typing here to convince sklearn to accept a dataset via SliceDataset, as sklearn does not support this class natively. This is not a cleanly defined interface, instead we have to proceed through trial and error and things can also change with newer versions. If at any point a check on y goes awry, we can quickly get in a situation where stratification won't work or we encounter an error.

@nafraw
Copy link
Author

nafraw commented Jan 11, 2025

Sorry for the confusion. The remaining issue here for me is that I am confused if those train-test partition strategies are supported in my case. Even if there is no error in execution, is there an easy way to trace whether the partition fits need?

I rewrote the codes below after some debugging and revisions. Now, I have two sets of nets and datasets. One for regression and the other for classification. I tested several partition strategies, cv=int, KFold, GroupKFold, StratifiedGroupKFold, and StratifiedKFold.

  • StratifiedKFold is the only one does not work for regression
  • cv=int, and StratifiedKFold do not work for classification

Now cv=int and StratifiedKFold for classification lead to the same result (I had a bug leading to different ones).

I think StratifiedKFold does not work because when sklearn performs some checks, it found X has 5 sequences and y has more than 5 (total samples of all sequences) because the helper datasets are different. Maybe it is fair as you mentioned that it is difficult to define the partition in this case. However, this gives me a concern how other partitions were done. For instance, is KFold partitioned at sequence level or sample level, given that StratifiedKFold checks down to total samples. I guess they are done in sequence level (based on X), but I am not sure how to verify or check.

import numpy as np
from imblearn.pipeline import Pipeline

from skorch import NeuralNetClassifier, NeuralNetRegressor
from skorch.scoring import loss_scoring
from torch import nn
import torch
from torch.nn.utils.rnn import pack_sequence, unpack_sequence
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold, KFold, GroupKFold, StratifiedKFold
from skorch.helper import SliceDataset

class ScoredRegressorNet(NeuralNetRegressor):
    def score(self, X, y=None):
        return -loss_scoring(self, X, y)

# Define a custom dataset
class SequenceDataset(Dataset):
    # sequence and labels are lists
    def __init__(self, sequence, labels):
        self.sequence = sequence
        self.labels = labels

    def __len__(self):
        return len(self.sequence)

    def __getitem__(self, idx):
        return self.sequence[idx], self.labels[idx]

class MyRNN(nn.Module):
    def __init__(self, input_size, **rnn_kwargs):
        super(MyRNN, self).__init__()
        self.rnn = nn.RNN(input_size = input_size, **rnn_kwargs, batch_first=True)
        self.hidden_state = None

    def forward(self, x):
        y, hs = self.rnn(x)
        # w/o unpack and concat, skorch cannot compute loss during the fit_loop
        y = unpack_sequence(y)
        y = torch.concat(y)
        y = y.squeeze() # added to remove warning when calling loss_function
        return y

n_seq = 5
nFeatures = 2
seq = [None]*n_seq
gt_reg = [None]*n_seq
gt_clf = [None]*n_seq
high = 5
for i in range(n_seq):
    seq[i] = torch.rand((3*(n_seq+1-i), nFeatures), dtype=torch.float64) # generate sequence with different lengths
    gt_reg[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.float64)
    gt_clf[i] = torch.randint_like(seq[i][:,0], high, dtype=torch.int64)

mdl = MyRNN(input_size=nFeatures, hidden_size=1, bidirectional=False).double()

def my_collate_fn(data):
    # nSeq = len(data)
    seq = pack_sequence([d[0] for d in data])
    lab = [d[1] for d in data]
    lab = torch.concat(lab)
    return (seq, lab)

class MySliceDataset(SliceDataset):
    def __init__(self, dataset, idx=0, indices=None):
        super().__init__(dataset, idx, indices)
    
    def __array__(self, dtype=None):
        # This method is invoked when calling np.asarray(X)
        # https://numpy.org/devdocs/user/basics.dispatch.html
        X = [self[i] for i in range(len(self))]
        if np.isscalar(X[0]):
            return np.asarray(X)
        return np.asarray(np.concatenate(X), dtype=dtype)

class ScoredClassifierNet(NeuralNetClassifier):
    def score(self, X, y=None):
        return -loss_scoring(self, X, y)


dataset_reg = SequenceDataset(seq, gt_reg)
dataset_clf = SequenceDataset(seq, gt_clf)
X_sl_reg = SliceDataset(dataset_reg, idx=0)
y_sl_reg = MySliceDataset(dataset_reg, idx=1)
X_sl_clf = SliceDataset(dataset_clf, idx=0)
y_sl_clf = MySliceDataset(dataset_clf, idx=1)


mynet_reg = ScoredRegressorNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=None,
    iterator_train__shuffle=False,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

mynet_clf = ScoredClassifierNet(
    mdl,
    max_epochs=5,
    lr=1e-3,
    batch_size = 2000,
    device="cpu",
    train_split=None,
    iterator_train__shuffle=False,
    iterator_train__collate_fn=my_collate_fn,
    iterator_valid__collate_fn=my_collate_fn,
    criterion=nn.SmoothL1Loss,
)

nGroups = n_seq
NGroupHoldOut = 1
n_splits=np.ceil(nGroups/NGroupHoldOut).astype(np.int32)
pipe_reg = Pipeline([('net', mynet_reg)])
pipe_clf = Pipeline([('net', mynet_clf)])

''' Regression '''
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=n_splits, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_reg, y=y_sl_reg,
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=KFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_reg, y=y_sl_reg,
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=GroupKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_reg, y=y_sl_reg,
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_reg, cv=StratifiedGroupKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_reg, y=y_sl_reg,
       groups=[x for x in range(n_seq)])

''' does not work '''
gs = GridSearchCV(estimator=pipe_reg, cv=StratifiedKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_reg, y=y_sl_reg,
       groups=[x for x in range(n_seq)])

# Classification
''' does not work '''
gs = GridSearchCV(estimator=pipe_clf, cv=n_splits, 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_clf, y=y_sl_clf, 
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=KFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_clf, y=y_sl_clf, 
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=GroupKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_clf, y=y_sl_clf, 
       groups=[x for x in range(n_seq)])
''' work '''
gs = GridSearchCV(estimator=pipe_clf, cv=StratifiedGroupKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_clf, y=y_sl_clf, 
       groups=[x for x in range(n_seq)])
''' does not work'''
gs = GridSearchCV(estimator=pipe_clf, cv=StratifiedKFold(n_splits=n_splits), 
                  param_grid={},
                  refit=False, n_jobs=1
               )
gs.fit(X_sl_clf, y=y_sl_clf, 
       groups=[x for x in range(n_seq)])

For the solutions of the solved issues, let me know if there is a need for me to push any codes, or maybe just leave this for others with similar need to read in the future.

Thanks for the help.

@BenjaminBossan
Copy link
Collaborator

Thanks for summarizing the current state.

  • StratifiedKFold is the only one does not work for regression

This should be expected, right? Stratification requires that the targets be classes.

  • cv=int, and StratifiedKFold do not work for classification

Now cv=int and StratifiedKFold for classification lead to the same result (I had a bug leading to different ones).

This is somewhat good news, since we expected that cv=int should result in using StratifiedKFold for classification tasks.

However, this gives me a concern how other partitions were done. For instance, is KFold partitioned at sequence level or sample level, given that StratifiedKFold checks down to total samples. I guess they are done in sequence level (based on X), but I am not sure how to verify or check.

Yes, it can be quite difficult to figure out what exactly goes on under the hood in sklearn. Here is a suggestion how you could check this:

from sklearn.model_selection import KFold

class MyKFold(KFold):
    def split(self, X, y=None, groups=None):
        # same logic as KFold but with debugger enabled
        splits = super().split(X, y=y, groups=groups)
        for train_idx, test_idx in splits:
            import pdb;pdb.set_trace()
            yield train_idx, test_idx

...

grid_search = GridSearchCV(..., cv=MyKFold(5))

Here we define a custom split function for the cv and place a debugger to be able to inspect the splits. Then we pass this object to the cv argument. This way, we can inspect how KFold would split the data. The same can of course be done for StratifiedKFold, GroupKFold, etc. Perhaps you can use this approach to verify that the splits correspond to your expectations.

For the solutions of the solved issues, let me know if there is a need for me to push any codes, or maybe just leave this for others with similar need to read in the future.

For now, let's just leave it as is for future reference. As mentioned previously, we can think of adding explicit support for PackedSequence to get_len but I think that's about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants