Skip to content

Read upload files using read(CHUNK_SIZE) rather than iter().#1948

Merged
lovelydinosaur merged 4 commits intomasterfrom
cap-upload-chunk-sizes
Nov 22, 2021
Merged

Read upload files using read(CHUNK_SIZE) rather than iter().#1948
lovelydinosaur merged 4 commits intomasterfrom
cap-upload-chunk-sizes

Conversation

@lovelydinosaur
Copy link
Contributor

Resolves #1911

When digging into this, it turns out that when sending an upload file we're using iter(file_obj), which happens to be a line-by-line iterator, and might yield super-large chunks for binary files. That happens to be slow because you don't really end up streaming the file to the network at all, but rather batching it all up in memory first.

The low-hanging fruit here is to cap the size of the chunks that we send to a max of 64k, which from a bit of prodding seems to be a fairly decent value.

It's possible that using .read() on a stream if it exists might be beneficial too, but I've not dug into that yet.

@lovelydinosaur lovelydinosaur added the perf Issues relating to performance label Nov 22, 2021
@lovelydinosaur
Copy link
Contributor Author

Okay, this makes even more sense:

  • Use read(CHUNK_SIZE) directly when available.
  • Otherwise use the iterator interface.

@lovelydinosaur lovelydinosaur changed the title Cap upload chunk sizes Read upload files using read(CHUNK_SIZE) rather than iter(). Nov 22, 2021
@lovelydinosaur lovelydinosaur merged commit 6f5865f into master Nov 22, 2021
@lovelydinosaur lovelydinosaur deleted the cap-upload-chunk-sizes branch November 22, 2021 13:15
@lovelydinosaur lovelydinosaur mentioned this pull request Jan 5, 2022
@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

(edited)
@tomchristie, for some reason I’m still observing this behaviour when using content= with a file-like object with 0.22.0. I assumed this was only fixed for multipart uploads, but it seems this new code should also be used in this case?

@lovelydinosaur
Copy link
Contributor Author

Can you show me where you mean?

@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

I have this code:

async with NamedTemporaryFile() as tmpfile:
    debug(f"Buffering into {tmpfile.name}")
    async for data in request.body:
        await tmpfile.write(data)

    await tmpfile.seek(0)
    debug(f"Uploading to {uri} from {tmpfile.name}")
    return await client.post(uri, content=tmpfile, …)

(this uses aiofiles.tempfile)
and even with 0.22.0 I can see data being uploaded in tiny chunks except as I understand when the disk cache kicks in:

…
uploading 37 bytes
uploading 44 bytes
uploading 2 bytes
uploading 8 bytes
uploading 13 bytes
uploading 21 bytes
uploading 1820 bytes
uploading 2056 bytes
uploading 5129 bytes
uploading 1012 bytes
uploading 7 bytes
uploading 17 bytes
uploading 65476 bytes
uploading 262144 bytes
uploading 213018 bytes

@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

Oh, I see, aiofiles have methods named differently than what your code expects.
Wait, aread is a method of AsyncByteStream, it needs only __aiter__ from the underlying file object, so it’s supposed to work?

@andrewshadura
Copy link

@tomchristie, any ideas?

@lovelydinosaur
Copy link
Contributor Author

Rather than me dig into this myself, lemme point you in the right directions towards figuring this.
Might take a little longer, but I'm sure it'll be a valuable approach.

First up, you've given me a partial example. Can you give me a complete replication, but make it absolutely as simple as you can. Ideally I ought to be able to copy and paste your example to see the behaviour you're talking about.

(Once we've got that we'll work through the next steps...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf Issues relating to performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants