Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU VM Training Error (EOFError: marshal data too short) #1037

Open
nadhem-zmandar opened this issue Jul 13, 2022 · 1 comment
Open

TPU VM Training Error (EOFError: marshal data too short) #1037

nadhem-zmandar opened this issue Jul 13, 2022 · 1 comment

Comments

@nadhem-zmandar
Copy link

nadhem-zmandar commented Jul 13, 2022

Describe the bug
I am trying to fine-tune the mT5 dataset on a custom dataset on a TPU on GCP. I am following carefully the process described in this repository however I have a tensorflow-related error.

2022-07-13 22:29:42.556669: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-13 22:29:42.556729: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/home/.local/bin/t5_mesh_transformer", line 5, in <module>
    from t5.models.mesh_transformer_main import console_entry_point
  File "/home/.local/lib/python3.9/site-packages/t5/__init__.py", line 17, in <module>
    import t5.data
  File "/home/.local/lib/python3.9/site-packages/t5/data/__init__.py", line 17, in <module>
    from t5.data.dataset_providers import *
  File "/home/.local/lib/python3.9/site-packages/t5/data/dataset_providers.py", line 28, in <module>
    import seqio
  File "/home/.local/lib/python3.9/site-packages/seqio/__init__.py", line 18, in <module>
    from seqio.dataset_providers import *
  File "/home/.local/lib/python3.9/site-packages/seqio/dataset_providers.py", line 34, in <module>
    from seqio import utils
  File "/home/.local/lib/python3.9/site-packages/seqio/utils.py", line 25, in <module>
    import tensorflow.compat.v2 as tf
  File "/home/.local/lib/python3.9/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python import data
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/__init__.py", line 21, in <module>
    from tensorflow.python.data import experimental
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/__init__.py", line 95, in <module>
    from tensorflow.python.data.experimental import service
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/service/__init__.py", line 387, in <module>
    from tensorflow.python.data.experimental.ops.data_service_ops import distribute
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 26, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 31, in <module>
    from tensorflow.python.data.ops import iterator_ops
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 36, in <module>
    from tensorflow.python.training.saver import BaseSaverBuilder
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 51, in <module>
    from tensorflow.python.training.saving import saveable_object_util
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 20, in <module>
    from tensorflow.python.eager import def_function
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 75, in <module>
    from tensorflow.python.eager import function as function_lib
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 35, in <module>
    from tensorflow.python.eager import backprop
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 786, in exec_module
  File "<frozen importlib._bootstrap_external>", line 918, in get_code
  File "<frozen importlib._bootstrap_external>", line 587, in _compile_bytecode
EOFError: marshal data too short

To Reproduce
Steps to reproduce the behavior:

  1. create a VM
  2. Create a TPU
  3. create a bucket and upload the .txt corpus on which I will train the model
  4. install t5[GCP] pip install t5[gcp]
  5. Set the Env variables following
  6. run the fine-tuning script
t5_mesh_transformer  \
  --tpu="${TPU_NAME}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --t5_tfds_data_dir="${DATA_DIR}" \
  --gin_file="dataset.gin" \
  - --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \
  --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
  --gin_param="run.train_steps = 1010000" \
  --gin_file="learning_rate_schedules/constant_0_001.gin"  \
  --gin_param = "tokens_per_batch=512" \
  --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" \

Expected behaviour
the training on the TPU should start

Any help would be appreciated.

Thank you

@anas-zafar
Copy link

Hi @nadhem-zmandar , were you able to resolve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants