Test new tf on axlearn for EFA#2116
Conversation
|
/ok to test 7665ae4 |
1 similar comment
|
/ok to test 7665ae4 |
|
/ok to test 7665ae4 |
|
/ok to test e6dab45 |
|
/ok to test bac0bf6 |
olupton
left a comment
There was a problem hiding this comment.
LGTM if the test job passes
…Toolbox into sbosisio/test-axlearn-new-tf
|
/ok to test 9523b6d |
There was a problem hiding this comment.
Remove for MaxText on EKS as well
|
/ok to test c2a0937 |
| fsdp: 2 | ||
| tensor-parallel: 2 | ||
| envs: |- | ||
| OFI_NCCL_PROTOCOL=SENDRECV |
There was a problem hiding this comment.
Observed a significant regression for MaxText from > 260 TFLOP/s/device to < 20 TFLOPS/s/device without SENDRECV protocol is this expected?
There was a problem hiding this comment.
ah wait, I forget to make sure MaxText can have the latest TF. Working on it
…Toolbox into sbosisio/test-axlearn-new-tf
|
/ok to test 7b0d42b |
|
/ok to test c56948a |
…Toolbox into sbosisio/test-axlearn-new-tf
|
/ok to test 473c130 |
|
/ok to test 0e07e2f |
|
/ok to test e376b7a |
|
/ok to test 71cb8bc |
1 similar comment
|
/ok to test 71cb8bc |
|
/ok to test 16ab71d |
|
/ok to test 9579fca |
|
/ok to test b3a0666 |
|
/ok to test 7e90ea5 |
|
/ok to test b3595d1 |
|
/ok to test 968a62d |
|
/ok to test 0663fa9 |
|
/ok to test bcc45cd |
|
/ok to test f9c9dc0 |
|
It looks like updating to
Traceback (most recent call last):
File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
from axlearn.experiments.text.gpt import c4_trainer
File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
from axlearn.experiments.text.gpt import ( # pytype: disable=pyi-error
File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 49, in <module>
from axlearn.common.input_lm import lm_text_preprocessor
File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
import seqio
File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 19, in <module>
from seqio.dataset_providers import *
File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 40, in <module>
from seqio import metrics as metrics_lib
File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
from seqio import utils
File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 31, in <module>
from seqio.vocabularies import Vocabulary
File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
import tensorflow_text as tf_text
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 21, in <module>
from tensorflow_text.python import keras
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/__init__.py", line 21, in <module>
from tensorflow_text.python.keras.layers import *
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/__init__.py", line 22, in <module>
from tensorflow_text.python.keras.layers.tokenization_layers import *
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/tokenization_layers.py", line 24, in <module>
from tensorflow_text.python.ops import unicode_script_tokenizer
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/__init__.py", line 25, in <module>
from tensorflow_text.python.ops.bert_tokenizer import BertTokenizer
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/bert_tokenizer.py", line 28, in <module>
from tensorflow_text.python.ops import regex_split_ops
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/regex_split_ops.py", line 23, in <module>
gen_regex_split_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_regex_split_ops.so'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/_regex_split_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder10SetShapeFnESt8functionIFN4absl12lts_202501276StatusEPNS_15shape_inference16InferenceContextEEEHere is the snapshot on the tensorflow and tensorflow related packages: Name: tensorflow-datasets
Version: 4.9.10
Summary: tensorflow/datasets is a library of datasets ready to use with TensorFlow.
Home-page: https://github.com/tensorflow/datasets
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: absl-py, array_record, dm-tree, etils, immutabledict, numpy, promise, protobuf, psutil, pyarrow, requests, simple_parsing, tensorflow-metadata, termcolor, toml, tqdm, wrapt
Required-by: seqio
---
Name: tensorflow-text
Version: 2.20.1
Summary: TF.Text is a TensorFlow library of text related ops, modules, and subgraphs.
Home-page: http://github.com/tensorflow/text
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: tensorflow
Required-by: seqio
---
Name: protobuf
Version: 6.33.6
Summary:
Home-page: https://developers.google.com/protocol-buffers/
Author: protobuf@googlegroups.com
Author-email: protobuf@googlegroups.com
License: 3-Clause BSD License
Location: /usr/local/lib/python3.12/dist-packages
Requires:
Required-by: google-api-core, google-cloud-storage-control, googleapis-common-protos, grain, grpc-google-iam-v1, grpcio-status, nsys-jax, orbax-checkpoint, proto-plus, tensorboard, tensorboardX, tensorflow-datasets, tensorflow-metadata, tf_nightly_cpu, xprof
---
Name: tf_nightly_cpu
Version: 2.22.0.dev20260530
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: absl-py, astunparse, flatbuffers, gast, google_pasta, grpcio, h5py, keras-nightly, libclang, ml_dtypes, numpy, opt_einsum, packaging, protobuf, requests, setuptools, six, termcolor, typing_extensions, wrapt
Required-by: There is no good prebuilt
Traceback (most recent call last):
File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
from axlearn.experiments.text.gpt import c4_trainer
File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
from axlearn.experiments.text.gpt import ( # pytype: disable=pyi-error
File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 49, in <module>
from axlearn.common.input_lm import lm_text_preprocessor
File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
import seqio
File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 19, in <module>
from seqio.dataset_providers import *
File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 41, in <module>
from seqio import metrics as metrics_lib
File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
from seqio import utils
File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 29, in <module>
from seqio.vocabularies import Vocabulary
File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
import tensorflow_text as tf_text
File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 20, in <module>
from tensorflow_text.core.pybinds import tflite_registrar
ModuleNotFoundError: No module named 'tensorflow_text.core'
Starting local Bazel server and connecting to it...
INFO: Reading 'startup' options from /opt/text/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'run' from /opt/text/.bazelrc:
Inherited 'common' options: --announce_rc --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility --noenable_bzlmod --noincompatible_enable_cc_toolchain_resolution --noincompatible_enable_android_toolchain_resolution --experimental_repo_remote_exec --java_runtime_version=remotejdk_21
INFO: Reading rc options for 'run' from /opt/text/.bazelrc:
Inherited 'build' options: --repo_env=ML_WHEEL_TYPE=snapshot --repo_env=ML_WHEEL_BUILD_DATE= --repo_env=ML_WHEEL_VERSION_SUFFIX= --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --repo_env=USE_PYWRAP_RULES=True --copt=-DGRPC_BAZEL_BUILD --host_copt=-DGRPC_BAZEL_BUILD --action_env=GRPC_BAZEL_RUNTIME=1 --repo_env=PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb --action_env=PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb --repo_env=RULES_PYTHON_ENABLE_PYSTAR=0 --define=grpc_no_ares=true --features=-force_no_whole_archive --host_features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --@rules_python//python/config_settings:precompile=force_disabled
INFO: Found applicable config definition build:short_logs in file /opt/text/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /opt/text/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:linux in file /opt/text/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --define=PREFIX=/usr --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /opt/text/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/py/python_repo.bzl:82:14: !!!Using pywrap rules instead of directly creating .so objects!!!
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/py/python_repo.bzl:87:10:
=============================
Hermetic Python configuration:
Version: "3.12"
Kind: ""
Interpreter: "default" (provided by rules_python)
Requirements_lock label: "@//oss_scripts/pip_package:requirements_lock_3_12.txt"
=====================================
Computing main repo mapping:
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'icu' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'build_bazel_apple_support' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'pybind11' because it already exists.
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
Computing main repo mapping:
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/refs/tags/v5.28.3.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
Computing main repo mapping:
Computing main repo mapping:
ERROR: Error computing the main repository mapping: Label '@@rules_ml_toolchain//cc/deps:cc_toolchain_deps.bzl' is invalid because 'cc/deps' is not a package; perhaps you meant to put the colon here: '@@rules_ml_toolchain//:cc/deps/cc_toolchain_deps.bzl'? |
No description provided.