Skip to content

ray emits logs about "failed to establish connection to the metrics exporter agent..." #1103

@erictang000

Description

@erictang000

Example trace:

(RegistryActor pid=1683939) {"asctime":"2026-02-13 02:47:02,310","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(RegistryActor pid=1684325) {"asctime":"2026-02-13 02:47:08,227","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(pid=1685464) fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
(autoscaler +1m30s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(bundle_reservation_check_func pid=1684909) {"asctime":"2026-02-13 02:47:14,799","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/backends.py:21: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464)   warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/modelopt/torch/utils/logging.py:115: UserWarning: Failed to import vllm plugin due to: AttributeError("module 'vllm.attention' has no attribute 'Attention'"). You may ignore this warning if you do not need this plugin.
(pid=1685464)   warnings.warn(message, *args, **kwargs)
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/gpt/gpt_layer_specs.py:67: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464)   warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/gpt/heterogeneous/heterogeneous_layer_specs.py:63: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464)   warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) /home/ray/.cache/uv/builds-v0/.tmpGvCpge/lib/python3.12/site-packages/megatron/core/models/vision/vit_layer_specs.py:30: UserWarning: Apex is not installed. Falling back to Torch Norm
(pid=1685464)   warnings.warn("Apex is not installed. Falling back to Torch Norm")
(pid=1685464) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
2026-02-13 02:47:19.733 | INFO     | skyrl.backends.skyrl_train.workers.worker:_initiate_actors:529 - Initializing process group for RayActorGroup
(raylet) warning: The `extra-build-dependencies` option is experimental and may change without warning. Pass `--preview-features extra-build-dependencies` to disable this warning.
(raylet) Installed 262 packages in 1.07s
(MegatronPolicyWorkerBase pid=1685464) {"asctime":"2026-02-13 02:47:45,571","levelname":"E","message":"Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14","filename":"core_worker_process.cc","lineno":825}

This seems to be harmless and also go away after training starts, but worth investigating why this has only recently started to appear.

cc: @tyler-griggs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions