Skip to content

Add Bilibili video subtitle/transcript extraction#271

Open
Afterimages wants to merge 2 commits into
kepano:mainfrom
Afterimages:main
Open

Add Bilibili video subtitle/transcript extraction#271
Afterimages wants to merge 2 commits into
kepano:mainfrom
Afterimages:main

Conversation

@Afterimages
Copy link
Copy Markdown

Summary

This PR adds subtitle (transcript) extraction support for Bilibili video pages, following the same pattern as the existing YoutubeExtractor.

What it does

When a user visits a Bilibili video page (e.g. bilibili.com/video/BV...), the extractor now:

  • Fetches video metadata (title, author, description, publish date, cover image) via the Bilibili view API
  • Discovers available subtitle tracks via the player/v2 and player/wbi/v2 APIs (tries multiple endpoints for robustness)
  • Selects the best subtitle track based on language preference and track quality (prefers human-authored over AI-generated, respects zh-cn > zh > en fallback order)
  • Downloads and parses the subtitle JSON into a structured transcript, with intelligent line grouping (merges short adjacent lines, splits on long pauses) and CJK-aware text concatenation
  • Exposes the transcript as a transcript variable (alongside title, author, site, image, published, description, part, language) for downstream consumers like Obsidian Web Clipper

Screenshot

Transcript successfully extracted from a Bilibili video page, visible in the clipper's page variables:

image

Files changed

  • src/extractors/bilibili.ts — New BilibiliExtractor class with async transcript extraction, subtitle track discovery/selection, URL validation (restricted to .hdslb.com / .bilibili.com hosts), and an in-memory LRU transcript cache
  • src/extractor-registry.ts — Register BilibiliExtractor for bilibili.com URL patterns

Design notes

  • Follows the existing extractor conventions: extends BaseExtractor, implements canExtract / extractAsync / prefersAsync
  • Reuses the shared buildTranscript utility for consistent HTML/text output
  • Subtitle content fetch omits credentials: include because the Bilibili subtitle CDN (aisubtitle.hdslb.com) returns Access-Control-Allow-Origin: *, which is incompatible with credentialed requests; the URL already carries an auth_key parameter for authorization
  • Supports multi-page videos (respects the ?p= URL parameter)
  • Protocol-relative subtitle URLs (//i0.hdslb.com/...) are normalized to https://

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant