-
Notifications
You must be signed in to change notification settings - Fork 1.3k
BigQuerySource.get_table_query_string() silently ignores query when table is also set #6200
Description
Summary
When both table and query are provided to BigQuerySource, get_table_query_string() always returns table, silently ignoring query. This makes it impossible to use a custom query (e.g., for deduplication) on a PushSource batch source, since PushSource requires table for offline writes via offline_write_batch().
Expected Behavior
When both table and query are set on a BigQuerySource:
- Reads (
get_table_query_string()) should usequery— it's more specific and intentionally provided - Writes (
offline_write_batch()) should continue using.tabledirectly as the write destination
Current Behavior
get_table_query_string() in bigquery_source.py always prefers table:
def get_table_query_string(self) -> str:
if self.table:
return f"`{self.table}`"
return f"({self.query})"This means any custom query (e.g., deduplication logic) is silently ignored when table is also present.
Use Case
Streaming (push) sources often produce duplicate rows in BigQuery. The natural solution is:
batch_source = BigQuerySource(
name="my_batch_source",
table="project.dataset.my_table", # needed for push writes
query="""
SELECT * FROM `project.dataset.my_table`
QUALIFY ROW_NUMBER() OVER (PARTITION BY entity_id, event_time) = 1
""", # needed for deduplicated reads
timestamp_field="event_time",
)
push_source = PushSource(name="my_source", batch_source=batch_source)But because get_table_query_string() ignores query when table is set, reads return duplicates. And removing table to force query usage breaks offline_write_batch(), which accesses .table directly (bigquery.py:449).
Environment
- Feast version: 0.58.0 (also confirmed unresolved on 0.61.0 / current main)
- Offline store: BigQuery