Skip to content

Fixing Json character offsets via wrapping the underlying character buffer"#2643

Draft
jurgenvinju wants to merge 18 commits intomainfrom
fix/json-offsets
Draft

Fixing Json character offsets via wrapping the underlying character buffer"#2643
jurgenvinju wants to merge 18 commits intomainfrom
fix/json-offsets

Conversation

@jurgenvinju
Copy link
Member

@jurgenvinju jurgenvinju commented Feb 11, 2026

This continues where #2638 left off.

  • stable fix and validation of offsets and lengths, via wrapping the underlying character stream
  • when called from Rascal all read and parse functions offer accurate error locations based on the above offset tracking, even when trackOrigins is off
  • the internal API which starts from a JsonReader never tracks origins and does not produce accurate offsets in error messages.
    • internal API now produces ParseError(locationTop or |unknown:///|, line, column, cause, path) instead of ParseError(locWithOffsetLineColumn, cause, path)
  • proper implementation and effect of trackOrigins=false for the public Rascal API.
    • for the internal API trackOrigins has no effect. It is always off.
  • setLenient was not implemented correctly, and needed to be updated to non-deprecated API
  • validation of line, column numbers (start is internal, end is external bound for columns)
  • test which also triggers parseJSON with originTracking enabled creates broken offset/length information if the input exceeds 1024 characters #2633 on unix systems
  • added several tests with sizes longer than 1024 to cover more corner cases with offset tracking.

TODO

  • for full test coverage of the OriginTrackingReader (nested class in JsonValueReader), we require a test which triggers non-complete-buffer fill without being at EOF. The read must be asked to fill the complete buffer, but it should succeed with a smaller amount of characters read. Then the next call to the OriginTrackingReader will have off!=0 which is the path we haven't tested yet.
    • non-complete buffer fill always happens (almost always) just before EOF, but after EOF we do not call read anymore, because we've already accepted the input. And so this case does not trigger off!=0

This API still streams pretty quickly. Could not measure a degredation in speed on smaller files like we have in the tests. On very larger files sometimes it's 1% slower and sometimes 1% faster, so that seems to be noise.

This is for another PR

  • fix offsets of high-low surrogate pairs (optional)

@jurgenvinju jurgenvinju self-assigned this Feb 11, 2026
@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 77.50000% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 46%. Comparing base (227422a) to head (d8dec4d).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
...pl/library/lang/json/internal/JsonValueReader.java 77% 7 Missing and 6 partials ⚠️
.../rascalmpl/exceptions/RuntimeExceptionFactory.java 25% 3 Missing ⚠️
src/org/rascalmpl/library/lang/json/IO.java 88% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##              main   #2643   +/-   ##
=======================================
  Coverage       46%     46%           
+ Complexity    6673    6672    -1     
=======================================
  Files          795     795           
  Lines        65878   65895   +17     
  Branches      9870    9872    +2     
=======================================
+ Hits         30682   30713   +31     
+ Misses       32825   32806   -19     
- Partials      2371    2376    +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines 1133 to 1134
// we've read until here before we were reset to 0.
offset += limit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct, we should do offset += limit - off and later also take limit and the off in the account.

I think it should be (if we want to use these 2 variables):

offset += limit - off;

limit = off + in.read(cbuf, off, len);

readCount++;

return limit - off;

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants