-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Game out a plan for a 1.1 format #198
Comments
It may actually be worth adding a stub 1.1 format now that has a trivial change as a hidden option just to really test things out. |
There is already a |
If we wanted a stupid optional feature we could have one that skips the 00-ff whiteouts in the image. That means its only going to work well (i.e. the basedir would not be visible) with kernels that have data-only overlayfs layers, but for those it would be more efficient. |
Other potential wishlist item for a trivial change to make things more efficient: more aggressive list of xattr prefixes. We should really have "prefixes" for the complete length of all of the overlayfs xattrs we output. |
The use of custom prefixes would be nice, but it does bump up the kernel requirements to 6.4. |
Having implemented a second erofs writer, this is something like my list of proposed changes for composefs erofs v1.1 file format:
|
Thanks, that's a good list!
Did you mean no extended inodes? |
No. Compact inodes don't have an mtime field, which means we need extended inodes. If you write a compact inode then the mtime is equal to the mtime set in the superblock, which means that we basically get to write a single compact inode in the general case*, and the rest of them will be extended. It just seems like it's not worth the trouble.
@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though. |
Yes, currently the EROFS core on-disk format is still the same as the initial version. I'm considering gathering all new ideas and requirements to refine a new revised on-disk format in a completely compatible way (and there shouldn't be any major change.) But I tend to land these on-disk changes in the exact one kernel version (IOWs, avoid changes scattered several versions, which is bad for all mkfses), I think I will sort them out in 2025. I will invite all to review these changes if interested to get a nicer solution for all use cases.. |
It occurs to me that the current order used by libcomposefs is harder to implement but probably has performance benefits. Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the Another proposal in terms of keeping inodes tightly packed, though (after some IRC conversation with @hsiangkao): it might be nice to substantially decrease the amount of inlining we do and then try our hardest to make sure that we always fit complete inodes into blocks. This means that We might also try to take a more holistic approach to allocating inodes within a single directory so that they all fit into a single page. This is getting into substantially more complicated territory, though, so it might make sense to take a pass on it. As it is, the current ordering that libcomposefs employs is already pretty good. We could also make inlining dependant on the alignment that we find ourselves in when we go to write the inode. For example: if we see that we could write a 2k inline section without inserting additional padding, just go ahead and do it. If not, then write the inode "flat plain" and store the data in a block. We might come up with some sort of a more dynamic approach for "amount of padding we'd require" vs "amount of space we'd waste by shoving the data into a block" with a heavy preference to avoiding additional padding in the inode area, but this is again starting to sound a bit too complicated for my tastes. We might also say more static things like "we always inline things less than 128 (or 256) bytes, even if we have to insert padding", knowing that the amount of padding we'd have to insert will be small. Another way we could keep inodes compact is to "share" large xattrs even if they're unique. And we could also make these decisions dynamically based on alignment and our ability to write the inode into a single block without padding. I suspect that there's again not too much benefit to be had here, though. |
Yes my recollection here was that Alex was specifically benchmarking |
It might be better to in-line directories of the top levels and symlinks anyway. |
I think the origin of compact mtimes is that if you create new I agree though that we will want to have general support for full
Is there any specific reason for this? It does make alignment a lot trickier.
Any particular reason?
This will mean that you cannot store a typical podman container store
This bumps the kernel erofs version requirements quite a lot.
Depth first will spread each directory inodes around a lot. I think that will hurt performance. |
I think if we keep the amount of data small (@cgwalters mentions that symlink targets longer than 1k are not portable anyway, plus we always store any file larger than 64 bytes externally) then we will sort of automatically get to a place where we don't need to inline large amounts of data. Sparse files don't get any actual blocks because they always use the special -1 "nul" chunk pointer, and the pointers themselves are always stored inline.
Because it's easier. The calculations that mkcomposefs perform here to find the lowest possible working value have no benefit, so it's easier to just avoid them and use a hard-coded value. I asked @hsiangkao about this and he confirmed that there is no benefit to a lower number.
Do we expect to have this as a usecase? Shouldn't podman rather be mounting composefs images directly instead of creating an overlayfs where the layers in that overlayfs are themselves an overlayfs? This seems a bit "too indirect" to me, to be honest... Also: maybe there is some better way to do accomplish this (ie: to get 0/0 chardevs visible)? Do you know of any?
I talked with Colin and I think we decided that we want to do something like this:
So that's our "meet in the middle". As for the required kernel version, I don't expect to approach this conservatively: I'd be completely OK with a 6.12 dependency, for example (since there were some recent fixes there with respect to handling of inline symlink targets that would allow us to avoid/disable workarounds in mkcomposefs). If you opt in to the 1.1 format then you agree that you require the newer kernel version. composefs-rs already requires kernel 6.12 (since it never uses a loopback device).
As noted above, I agree with this. Currently libcomposefs behaviour is better here and composefs-rs will need to change. |
This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these. That said, we could maybe create one such area of chunk nul pointers that fits the largest file necessary, and then reference that for each sparse file. If reusing blocks works with erofs.
Not everything can use composefs. For example, non-root podman cannot ever use it. I think it would be a bad idea to break existing software becasue "escaping is tricky".
No, I did the work to implement the escaping in the kernel because that was the only possible approach.
This will drop RHEL9 support, although I guess we can use version 1 for that. |
This is not my understanding. The chunk format is not the size of the chunk in bytes, but rather the size of the chunk in disk blocks. So a format of 31 is 2^31 on a 4k-block fs is 4096 * 2^31 = ~8TB per chunk pointer, which "ought to be enough for anybody". Even if it's not enough, we can inline up to ~1000 of those pointers, which would get us up to ~8PB.
I think the idea we had here was to create an API that non-root users can call to request mounting of trusted composefs images (trusted = we created them ourselves) with appropriate constraints (application of nosuid, and/or a suitable uidmap, etc). I also believe that we'll eventually (hopefully) be able to mount erofs images as non-root. @hsiangkao has already mentioned that they've enabled this flag on their internal kernel images... I agree though that just throwing up our hands and saying "we can't do this!" is kinda lame, particularly when we already have a working implementation. You have a point that this might be desired, so I don't intend to arbitrarily disable that. I think I might just deal with this by leaving the chardev/0/0 case as a |
Not sure which is true, @hsiangkao would know. But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.
I mean, that is fine as long as it fails to produce an image in that case. Otherwise adding the feature will break backwards compat. |
Currently almost all on-disk fields are represented in blocks, so @allisonkarlitskaya is correct here.
I don't get the point. how could we share these? (chunk indexes themselves cannot be shared between inodes.) |
I'm not even sure how we would do this "sharing" — as far as I know there's no difference between "inline chunk-based" and "block indirect chunk-based": there's just one "chunk based" format and in that case the chunks are expected to be listed inline (with the number of inline indees determined by the st_size of the file divided by the effective chunk size, rounded up). If we got too many of them to store inline maybe it would switch to a block reference, but even then I don't understand how it would work because the Also: we're talking about a very very small amount of inline data here: as mentioned, every file less than 8TiB gets encoded into a single 4-byte nul chunk pointer (-1)... |
Ok, yeah, then we're ok with just always inlining one (or a few) chunk null pointers.
Yeah, sorry. I was thinking we could use non-inline blocks for the chunk indexes, but that is not possible. |
...
Yeah, anyway, I guess currently one chunk index (8TiB) is large enough, but if it's a really concern, I could make I will try my best to address it this year. |
|
Also related to a 1.2 format is #288 - I have forgotten all context behind that now but my recollection is basically "eww" |
Yes, but as I said, I will ship with one single kernel version to avoid fragmented ondisk changes too. |
Let's assume 1.0 is released, and we discover something like a notable performance issue with the 1.0 format. Or maybe it's actually broken in an important corner case on big-endian (s390x) - something like that.
Say this is important enough to do a 1.1.
I think the way this would need to work is basically we add support for e.g.
--format=1.1
to the CLI/API - and then we generate both digests.We need to think through and verify a scenario like this would work:
Right?
The text was updated successfully, but these errors were encountered: