index_dump skip parameter still loads the json lines previous to skip

The `skip_docs` parameter is passed to the `index_stream`:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/cli.py#L117-L119

The `index_stream` method will skip lines:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/taggerfactory.py#L74-L75

But the `dumpreader` has already used time to load the json, even for skipped lines:

https://github.com/wetneb/opentapioca/blob/1a26df5328aeff18752e65a2d389e2cbd007c038/opentapioca/readers/dumpreader.py#L26-L33

While working on a fix for this, IMO a good workaround is to tail the lines before sending them to the cli, e.g.:

```
pbzip2 -c -d -p8 latest-all.json.bz2 | tail -n +879322
```

(Notice here I am using a multi-threaded bzip2 implementation)

Also I suggest switching to `orjson` since it [faster](https://pythonspeed.com/articles/faster-python-json-parsing/) than the system `json`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

index_dump skip parameter still loads the json lines previous to skip #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

index_dump skip parameter still loads the json lines previous to skip #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions