The skip_docs parameter is passed to the index_stream:
https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/cli.py#L117-L119
The index_stream method will skip lines:
https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/taggerfactory.py#L74-L75
But the dumpreader has already used time to load the json, even for skipped lines:
https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/readers/dumpreader.py#L26-L33
While working on a fix for this, IMO a good workaround is to tail the lines before sending them to the cli, e.g.:
pbzip2 -c -d -p8 latest-all.json.bz2 | tail -n +879322
(Notice here I am using a multi-threaded bzip2 implementation)
Also I suggest switching to orjson since it faster than the system json.
The
skip_docsparameter is passed to theindex_stream:https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/cli.py#L117-L119
The
index_streammethod will skip lines:https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/taggerfactory.py#L74-L75
But the
dumpreaderhas already used time to load the json, even for skipped lines:https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/readers/dumpreader.py#L26-L33
While working on a fix for this, IMO a good workaround is to tail the lines before sending them to the cli, e.g.:
(Notice here I am using a multi-threaded bzip2 implementation)
Also I suggest switching to
orjsonsince it faster than the systemjson.