-
Notifications
You must be signed in to change notification settings - Fork 270
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Example trace:
(RegistryActor pid=1683939) {"asctime":"2026-02-13 02:47:02,310","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(RegistryActor pid=1684325) {"asctime":"2026-02-13 02:47:08,227","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(pid=1685464) fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
(autoscaler +1m30s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(bundle_reservation_check_func pid=1684909) {"asctime":"2026-02-13 02:47:14,799","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/backends.py:21: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464) warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/modelopt/torch/utils/logging.py:115: UserWarning: Failed to import vllm plugin due to: AttributeError("module 'vllm.attention' has no attribute 'Attention'"). You may ignore this warning if you do not need this plugin.
(pid=1685464) warnings.warn(message, *args, **kwargs)
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/gpt/gpt_layer_specs.py:67: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464) warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/gpt/heterogeneous/heterogeneous_layer_specs.py:63: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464) warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/vision/vit_layer_specs.py:30: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464) warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
2026-02-13 02:47:19.733 | INFO | skyrl.backends.skyrl_train.workers.worker:_initiate_actors:529 - Initializing process group for RayActorGroup
(raylet) warning: The `extra-build-dependencies` option is experimental and may change without warning. Pass `--preview-features extra-build-dependencies` to disable this warning.
(raylet) Installed 262 packages in 1.07s
(MegatronPolicyWorkerBase pid=1685464) {"asctime":"2026-02-13 02:47:45,571","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
This seems to be harmless and also go away after training starts, but worth investigating why this has only recently started to appear.
cc: @tyler-griggs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working