Skip to content

feat(knowledge): connectors, user exclusions, expanded tools & airtable integration#3230

Open
waleedlatif1 wants to merge 10 commits intostagingfrom
feat/kb
Open

feat(knowledge): connectors, user exclusions, expanded tools & airtable integration#3230
waleedlatif1 wants to merge 10 commits intostagingfrom
feat/kb

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Knowledge base connectors: sync engine with SHA-256 change detection, connector registry, add/edit/delete UI
  • User-excluded documents persist across syncs — viewable and restorable from edit connector modal
  • Bulk delete paths set userExcluded for connector docs via SQL CASE
  • New knowledge tools: list_documents, list_chunks, delete_document, delete_chunk, update_chunk, list_tags
  • Airtable: list_bases and get_base_schema tools
  • Connector-aware delete confirmations in document and KB views
  • Refactored query hooks into hooks/queries/kb/ and hooks/queries/oauth/
  • DB migration: connector tables, sync logs, userExcluded column, document fields

Type of Change

  • New feature

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Feb 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Feb 17, 2026 7:45pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 17, 2026

Greptile Summary

This PR introduces a comprehensive knowledge base connector system — a sync engine that periodically pulls documents from external sources (Confluence, GitHub, Google Drive, Jira, Linear, Notion, Airtable) into knowledge bases via SHA-256 change detection. It also adds six new knowledge tools (list_documents, list_chunks, delete_document, delete_chunk, update_chunk, list_tags) and two new Airtable tools (list_bases, get_base_schema). The architecture is well thought out: the sync engine is connector-agnostic, uses an optimistic lock to prevent concurrent syncs, stores an audit trail in knowledge_connector_sync_log, and handles user-excluded documents correctly across syncs.

Several previously reported issues have been resolved in this iteration:

  • Sync mode inversion is fixed (syncMode defaults to 'full' in schema)
  • excludedDocs query in the sync engine correctly filters deletedAt IS NULL
  • GitHub's getDocument properly encodes path segments individually
  • Confluence and Jira properly escape user values before injecting into CQL/JQL

Key issues that remain:

  • Silent error swallowing in destructive tool operations: delete_document, delete_chunk, and update_chunk tools all call response.json() unconditionally and return success: true regardless of HTTP status. Failed API calls (404, 403, 500) are reported as successful operations to any agent or workflow using these tools — this is particularly risky for delete operations.
  • offset: 0 dropped in list_documents and list_chunks: Truthiness check if (params.offset) skips sending offset=0 to the API. While the server default handles this case, it's a semantic inconsistency that could confuse LLM-driven pagination.
  • folderId injection in Google Drive connector: User-controlled folderId is embedded in the Drive API q parameter using single-quote syntax without escaping. Drive folder IDs are normally alphanumeric so this is low-risk in practice, but defensive escaping is recommended for correctness.

Confidence Score: 3/5

  • Safe to merge with caution — core sync infrastructure is sound but the new knowledge tools have silent error-swallowing bugs that affect destructive operations used by agents.
  • The sync engine, connector implementations, API routes, and DB schema are well-constructed. Previously flagged critical issues (sync mode inversion, CQL/JQL injection, excluded docs soft-delete filter) have been addressed. The remaining issues are in the new knowledge tools layer: delete_document, delete_chunk, and update_chunk all return success: true unconditionally regardless of HTTP response status, which means agent workflows using these tools cannot detect failures. This is a logic bug in tool response handling, not a data-loss risk in the sync engine itself. The folderId issue in Google Drive is low risk in practice but warrants a fix.
  • Pay close attention to apps/sim/tools/knowledge/delete_document.ts, apps/sim/tools/knowledge/delete_chunk.ts, and apps/sim/tools/knowledge/update_chunk.ts — all three return success: true regardless of API response status.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/connectors/sync-engine.ts Core sync engine: correctly handles full vs incremental mode (full is default), soft-delete safety in excludedDocs query. However, updateDocument does not reset chunk data or remove old chunks before re-processing, which could leave stale chunks if content shrinks significantly.
apps/sim/tools/knowledge/delete_document.ts transformResponse always returns success: true regardless of HTTP status code — errors from the API (404, 403, 500) are silently swallowed and reported as success to callers.
apps/sim/tools/knowledge/delete_chunk.ts Same silent error issue: transformResponse hardcodes success: true without checking response.ok — failed deletes are reported as successful.
apps/sim/tools/knowledge/list_documents.ts Pagination offset of 0 is silently dropped because if (params.offset) treats 0 as falsy — the first page offset is never sent but default behavior at the server matches, so this is a semantic inconsistency rather than a bug. Also lacks response.ok check in transformResponse.
apps/sim/tools/knowledge/list_chunks.ts Same offset=0 falsy issue as list_documents.ts; also lacks response.ok check in transformResponse.
apps/sim/tools/knowledge/update_chunk.ts transformResponse hardcodes success: true and calls response.json() unconditionally — API errors (404, 403, 422) are silently swallowed.
apps/sim/connectors/google-drive/google-drive.ts buildQuery injects folderId into Drive API q parameter using single-quote syntax without escaping — a folderId containing a single quote would break the query. Drive folder IDs are normally alphanumeric but input is user-controlled.
apps/sim/connectors/airtable/airtable.ts Well-structured connector. Uses encodeURIComponent for tableIdOrName in URLs, caches field name lookups in syncContext. The sourceUrl construction for individual records uses the same encoding which is correct.
packages/db/schema.ts Schema additions are clean: new knowledgeConnector and knowledgeConnectorSyncLog tables, new columns on document (userExcluded, connectorId, externalId, contentHash, sourceUrl), partial unique index on (connectorId, externalId) WHERE deletedAt IS NULL. syncMode defaults to 'full', which is correct.
apps/sim/app/api/knowledge/[id]/connectors/[connectorId]/documents/route.ts GET and PATCH endpoints are clean — both active and excluded document queries properly filter with isNull(document.deletedAt), addressing the previously noted issue.

Sequence Diagram

sequenceDiagram
    participant Cron as Cron Scheduler
    participant SyncRoute as /api/knowledge/connectors/sync
    participant Dispatch as dispatchSync()
    participant Trigger as Trigger.dev
    participant Engine as executeSync()
    participant Connector as ConnectorRegistry
    participant Source as External Source
    participant DB as Database
    participant Storage as StorageService

    Cron->>SyncRoute: GET (every 5 min)
    SyncRoute->>DB: Query active connectors WHERE nextSyncAt <= now
    SyncRoute->>Dispatch: dispatchSync(connectorId) per due connector

    alt Trigger.dev available
        Dispatch->>Trigger: knowledgeConnectorSync.trigger(payload)
        Trigger->>Engine: executeSync(connectorId)
    else Fallback
        Dispatch->>Engine: executeSync() fire-and-forget
    end

    Engine->>DB: Lock connector (status='syncing')
    Engine->>DB: Insert sync log (status='started')
    Engine->>DB: Fetch OAuth token via refreshAccessTokenIfNeeded
    Engine->>Connector: listDocuments(token, sourceConfig, cursor)
    Connector->>Source: Paginated API calls (up to 500 pages)
    Source-->>Connector: ExternalDocument[]
    Connector-->>Engine: ExternalDocumentList

    Engine->>DB: Load existing docs (connectorId, deletedAt IS NULL)
    Engine->>DB: Load excluded docs (userExcluded=true, deletedAt IS NULL)

    loop For each external document
        alt New document
            Engine->>Storage: uploadFile(content as .txt)
            Engine->>DB: INSERT document record
            Engine->>Engine: processDocumentAsync() fire-and-forget
        else Content hash changed
            Engine->>Storage: uploadFile(updated content)
            Engine->>DB: UPDATE document record in-place
            Engine->>Engine: processDocumentAsync() fire-and-forget
        else Unchanged
            Engine->>Engine: docsUnchanged++
        end
    end

    alt Full sync mode
        Engine->>DB: Soft-delete stale docs (deletedAt=now)
    end

    Engine->>DB: UPDATE sync log (status='completed', counts)
    Engine->>DB: UPDATE connector (status='active', nextSyncAt)
Loading

Last reviewed commit: 4453edb

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

88 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@greptile
@cursor review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

89 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

bugbot review

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

bugbot review

@waleedlatif1
Copy link
Collaborator Author

@greptile review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

88 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

88 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

89 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

98 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

98 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

98 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +33 to +41
transformResponse: async (response, params): Promise<KnowledgeDeleteDocumentResponse> => {
const result = await response.json()

return {
success: true,
output: {
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Document deleted successfully',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformResponse swallows API errors silently

response.json() is called unconditionally and success: true is always returned regardless of whether the API responded with a 4xx or 5xx. A caller trying to delete a document that doesn't exist (404) or is unauthorized (403) will receive success: true with no indication of the failure. This is especially problematic for destructive operations.

Suggested change
transformResponse: async (response, params): Promise<KnowledgeDeleteDocumentResponse> => {
const result = await response.json()
return {
success: true,
output: {
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Document deleted successfully',
},
transformResponse: async (response, params): Promise<KnowledgeDeleteDocumentResponse> => {
if (!response.ok) {
const error = await response.text()
throw new Error(`Failed to delete document: ${response.status} ${error}`)
}
const result = await response.json()
return {
success: true,
output: {
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Document deleted successfully',
},
}
},

Comment on lines +40 to +51
transformResponse: async (response, params): Promise<KnowledgeDeleteChunkResponse> => {
const result = await response.json()

return {
success: true,
output: {
chunkId: params?.chunkId ?? '',
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Chunk deleted successfully',
},
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformResponse swallows API errors silently

Same issue as delete_document.tssuccess: true is always returned regardless of HTTP status. A failed chunk delete (404, 403, 500) is reported as success.

Suggested change
transformResponse: async (response, params): Promise<KnowledgeDeleteChunkResponse> => {
const result = await response.json()
return {
success: true,
output: {
chunkId: params?.chunkId ?? '',
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Chunk deleted successfully',
},
}
},
transformResponse: async (response, params): Promise<KnowledgeDeleteChunkResponse> => {
if (!response.ok) {
const error = await response.text()
throw new Error(`Failed to delete chunk: ${response.status} ${error}`)
}
const result = await response.json()
return {
success: true,
output: {
chunkId: params?.chunkId ?? '',
documentId: params?.documentId ?? '',
message: result.data?.message ?? 'Chunk deleted successfully',
},
}
},

Comment on lines +58 to +75
transformResponse: async (response, params): Promise<KnowledgeUpdateChunkResponse> => {
const result = await response.json()
const chunk = result.data || {}

return {
success: true,
output: {
documentId: params?.documentId ?? '',
id: chunk.id ?? '',
chunkIndex: chunk.chunkIndex ?? 0,
content: chunk.content ?? '',
contentLength: chunk.contentLength ?? 0,
tokenCount: chunk.tokenCount ?? 0,
enabled: chunk.enabled ?? true,
updatedAt: chunk.updatedAt ?? null,
},
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformResponse swallows API errors silently

success: true is always returned regardless of HTTP status. If the chunk doesn't exist (404) or the request is malformed (422), the tool reports success and returns empty/default values for all fields. The caller has no way to detect the failure.

Suggested change
transformResponse: async (response, params): Promise<KnowledgeUpdateChunkResponse> => {
const result = await response.json()
const chunk = result.data || {}
return {
success: true,
output: {
documentId: params?.documentId ?? '',
id: chunk.id ?? '',
chunkIndex: chunk.chunkIndex ?? 0,
content: chunk.content ?? '',
contentLength: chunk.contentLength ?? 0,
tokenCount: chunk.tokenCount ?? 0,
enabled: chunk.enabled ?? true,
updatedAt: chunk.updatedAt ?? null,
},
}
},
transformResponse: async (response, params): Promise<KnowledgeUpdateChunkResponse> => {
if (!response.ok) {
const error = await response.text()
throw new Error(`Failed to update chunk: ${response.status} ${error}`)
}
const result = await response.json()
const chunk = result.data || {}
return {
success: true,
output: {
documentId: params?.documentId ?? '',
id: chunk.id ?? '',
chunkIndex: chunk.chunkIndex ?? 0,
content: chunk.content ?? '',
contentLength: chunk.contentLength ?? 0,
tokenCount: chunk.tokenCount ?? 0,
enabled: chunk.enabled ?? true,
updatedAt: chunk.updatedAt ?? null,
},
}
},

Comment on lines +48 to +49
if (params.limit) queryParams.set('limit', String(params.limit))
if (params.offset) queryParams.set('offset', String(params.offset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset: 0 is treated as falsy and silently dropped

if (params.limit) and if (params.offset) use truthiness checks, so a value of 0 for offset is never appended to the query string. While the server likely defaults to 0 anyway, this makes it impossible to explicitly request the first page when the LLM passes offset: 0, and creates a behavioral inconsistency.

Suggested change
if (params.limit) queryParams.set('limit', String(params.limit))
if (params.offset) queryParams.set('offset', String(params.offset))
if (params.limit != null) queryParams.set('limit', String(params.limit))
if (params.offset != null) queryParams.set('offset', String(params.offset))

Comment on lines +133 to +135
if (folderId?.trim()) {
parts.push(`'${folderId.trim()}' in parents`)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folderId injected into Drive query without escaping single quotes

The Drive API q parameter uses single-quoted string literals. A folderId value containing a single quote (e.g. abc' OR '1'='1) would break out of the literal and potentially corrupt the query. Drive folder IDs are typically alphanumeric so exploitation is unlikely in practice, but since this is user-controlled input, defensive escaping is recommended:

Suggested change
if (folderId?.trim()) {
parts.push(`'${folderId.trim()}' in parents`)
}
if (folderId?.trim()) {
const safeFolderId = folderId.trim().replace(/\\/g, '\\\\').replace(/'/g, "\\'")
parts.push(`'${safeFolderId}' in parents`)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants