Skip to content

fish-speech.tools.api_server --compile Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. #967

Open
@corporate9601

Description

@corporate9601

Self Checks

  • This template is only for bug reports. For questions, please visit Discussions.
  • I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
  • I have searched for existing issues, including closed ones. Search issues
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

Windows 10, Python3.11, torch==2.6.0+cu126, latest Triton for windows

Steps to Reproduce

I run the command:

python -m fish-speech.tools.api_server --listen 0.0.0.0:8080 --llama-checkpoint-path "checkpoints/fish-speech-1.5" --decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" --decoder-config-name firefly_gan_vq --compile

✔️ Expected Behavior

I expect fish speech server to run, and compile with Torch so it can be fast ( i need realtime tts )

❌ Actual Behavior

INFO: Started server process [29352]
INFO: Waiting for application startup.
2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:683 - Restored model from checkpoint
2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:689 - Using DualARTransformer
2025-05-07 13:21:20.842 | INFO | fish_speech.models.text2semantic.inference:load_model:697 - Compiling function...
2025-05-07 13:21:20.907 | INFO | tools.server.model_manager:load_llama_model:99 - LLAMA model loaded.
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:445: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:630: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\finite_scalar_quantization.py:147: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\lookup_free_quantization.py:209: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
2025-05-07 13:21:23.808 | INFO | fish_speech.models.vqgan.inference:load_model:46 - Loaded model:
2025-05-07 13:21:23.809 | INFO | tools.server.model_manager:load_decoder_model:107 - Decoder model loaded.
2025-05-07 13:21:23.824 | INFO | fish_speech.models.text2semantic.inference:generate_long:790 - Encoded text: Hello world.
2025-05-07 13:21:23.826 | INFO | fish_speech.models.text2semantic.inference:generate_long:808 - Generating sentence 1/1 of sample 1/1
0%| | 0/1023 [00:00<?, ?it/s]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 1/1023 [03:45<64:03:51, 225.67s/it]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 1/1023 [03:45<64:04:06, 225.68s/it]
ERROR: Traceback (most recent call last):
File "D:\Python\Python311\Lib\site-packages\kui\asgi\lifespan.py", line 36, in call
await result
File "D:\2025\Call Center Agent X\fish-speech\tools\api_server.py", line 83, in initialize_app
app.state.model_manager = ModelManager(
^^^^^^^^^^^^^
File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 65, in init
self.warm_up(self.tts_inference_engine)
File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 121, in warm_up
list(inference(request, tts_inference_engine))
File "D:\2025\Call Center Agent X\fish-speech\tools\server\inference.py", line 25, in inference_wrapper
raise HTTPException(
baize.exceptions.HTTPException: (500, ''Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "D:\\2025\\Call Center Agent X\\fish-speech\\fish_speech\\models\\text2semantic\\inference.py", line 307, in decode_one_token_ar\n codebooks = torch.stack(codebooks, dim=0). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.'')

ERROR: Application startup failed. Exiting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions