Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Game out a plan for a 1.1 format #198

Open
cgwalters opened this issue Sep 22, 2023 · 24 comments
Open

Game out a plan for a 1.1 format #198

cgwalters opened this issue Sep 22, 2023 · 24 comments
Milestone

Comments

@cgwalters
Copy link
Contributor

Let's assume 1.0 is released, and we discover something like a notable performance issue with the 1.0 format. Or maybe it's actually broken in an important corner case on big-endian (s390x) - something like that.

Say this is important enough to do a 1.1.

I think the way this would need to work is basically we add support for e.g. --format=1.1 to the CLI/API - and then we generate both digests.

We need to think through and verify a scenario like this would work:

  • Build server is updated with composefs 1.1 support
  • Client system (e.g. ostree/container tooling/RAUC/whatever) only has 1.0 support
  • Client fetches metadata (OCI image, whatever) that has both digests
  • Client ignores the 1.1 digest, synthesizes a 1.0 format, and successfully verifies it against the 1.0 digest
  • Client later than updates to tooling (podman/ostree/RAUC) that supports 1.1
  • Client synthesizes a 1.1 EROFS, and verifies using that digest

Right?

@cgwalters cgwalters added this to the 1.0 milestone Sep 22, 2023
@cgwalters
Copy link
Contributor Author

It may actually be worth adding a stub 1.1 format now that has a trivial change as a hidden option just to really test things out.

@alexlarsson
Copy link
Collaborator

There is already a uint32_t version in struct lcfs_write_options_s for this particular reason. But you're right, we should plumb it to the CLI and actually test it.

@alexlarsson
Copy link
Collaborator

If we wanted a stupid optional feature we could have one that skips the 00-ff whiteouts in the image. That means its only going to work well (i.e. the basedir would not be visible) with kernels that have data-only overlayfs layers, but for those it would be more efficient.

@allisonkarlitskaya
Copy link
Collaborator

Other potential wishlist item for a trivial change to make things more efficient: more aggressive list of xattr prefixes. We should really have "prefixes" for the complete length of all of the overlayfs xattrs we output.

@alexlarsson
Copy link
Collaborator

The use of custom prefixes would be nice, but it does bump up the kernel requirements to 6.4.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Dec 17, 2024

Having implemented a second erofs writer, this is something like my list of proposed changes for composefs erofs v1.1 file format:

  • no compact inodes
    • store a 1970 superblock mtime; or
    • use the root dir mtime; or
    • use the newest file?
  • set the 32-bit compatible ino field equal to the 64-bit nid value
  • regular files and symlinks can only be inlined
    • limit these to 2k (or 1k)
    • we use blocks only for directories
  • considering dropping our inlining limit to only cover really small directories (to help keep inodes small and compacted together)
  • fix the size calculation for directories to report the actual size of the last block (when not inlining) instead of rounding up to the block size multiple
  • very clearly specify inode alignment-adjustment algorithm for ensuring that inline data doesn't cross block boundaries
  • don't treat symlinks differently for alignment
  • use chunk format 31 for all sparse files
  • don't use a start offset for xattrs
  • don't try to calculate start offset for inodes (it's always 0)
  • don't store the whiteouts for 00-ff (requires data layer)
  • consider outright banning the presence of (0, 0) character devices
    • the escaping is extremely annoying (affecting all parent directories)
  • use custom xattr prefixes for all of our xattrs
    • metacopy
    • redirect
    • whiteout-related (still needed for c 0 0 files)
    • selinux
    • possibly also ones we see from the user?
  • change inode order:
    • directory and then all children, but depth first
    • current order is all children first and then descend
    • ideally we'd write directories after children but this is complicated by root_nid needing to be small. we could special-case it.
      • erofs is considering adding an extension for larger root nid
  • pin down sort order for shared xattrs
    • prefix index, then suffix, then value

@cgwalters
Copy link
Contributor Author

Thanks, that's a good list!

no compact inodes

Did you mean no extended inodes?

@allisonkarlitskaya
Copy link
Collaborator

Did you mean no extended inodes?

No. Compact inodes don't have an mtime field, which means we need extended inodes. If you write a compact inode then the mtime is equal to the mtime set in the superblock, which means that we basically get to write a single compact inode in the general case*, and the rest of them will be extended. It just seems like it's not worth the trouble.

  • the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

@hsiangkao
Copy link
Contributor

hsiangkao commented Dec 18, 2024

  • the general case being that all files have different mtime. In case we have a container image where a large chunk of the files have exactly the same mtime, all of those could be written as compact. libcomposefs looks at the oldest file in the system and uses that.

@hsiangkao is looking at adding a way to put mtime into compact inodes as a 32-bit relative offset to the value stored in the superblock (ie: the superblock time becomes an epoch). That would let you capture a moderately-sized range of values of mtimes that are close together (which is likely to cover a lot of cases we see in practice) instead of it being an all-or-nothing affair. I don't expect this feature to land in the kernel soon enough for us to be able to use it any time soon, though.

Yes, currently the EROFS core on-disk format is still the same as the initial version. I'm considering gathering all new ideas and requirements to refine a new revised on-disk format in a completely compatible way (and there shouldn't be any major change.)

But I tend to land these on-disk changes in the exact one kernel version (IOWs, avoid changes scattered several versions, which is bad for all mkfses), I think I will sort them out in 2025. I will invite all to review these changes if interested to get a nicer solution for all use cases..

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Dec 18, 2024

It occurs to me that the current order used by libcomposefs is harder to implement but probably has performance benefits. Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the ls -l case. Doing a depth-first approach would result in substantially more scattering. I think I'd probably rescind this recommended change.

Another proposal in terms of keeping inodes tightly packed, though (after some IRC conversation with @hsiangkao): it might be nice to substantially decrease the amount of inlining we do and then try our hardest to make sure that we always fit complete inodes into blocks. This means that fstat() of an O_PATH would always be a single-page operation. The only place we might get into trouble is with a large amount/size of xattrs, but we could treat that case as degenerate.

We might also try to take a more holistic approach to allocating inodes within a single directory so that they all fit into a single page. This is getting into substantially more complicated territory, though, so it might make sense to take a pass on it. As it is, the current ordering that libcomposefs employs is already pretty good.

We could also make inlining dependant on the alignment that we find ourselves in when we go to write the inode. For example: if we see that we could write a 2k inline section without inserting additional padding, just go ahead and do it. If not, then write the inode "flat plain" and store the data in a block. We might come up with some sort of a more dynamic approach for "amount of padding we'd require" vs "amount of space we'd waste by shoving the data into a block" with a heavy preference to avoiding additional padding in the inode area, but this is again starting to sound a bit too complicated for my tastes. We might also say more static things like "we always inline things less than 128 (or 256) bytes, even if we have to insert padding", knowing that the amount of padding we'd have to insert will be small.

Another way we could keep inodes compact is to "share" large xattrs even if they're unique. And we could also make these decisions dynamically based on alignment and our ability to write the inode into a single block without padding. I suspect that there's again not too much benefit to be had here, though.

@cgwalters
Copy link
Contributor Author

Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the ls -l case.

Yes my recollection here was that Alex was specifically benchmarking ls -lR because a targeted use case was booting as fast as possible, and for generic OSes we tend to have a lot of unused data.

@hsiangkao
Copy link
Contributor

Having all of the inodes present in one directory always immediately adjacent to each other (and therefore likely sharing only one or a few blocks) is probably nice for the ls -l case.

Yes my recollection here was that Alex was specifically benchmarking ls -lR because a targeted use case was booting as fast as possible, and for generic OSes we tend to have a lot of unused data.

It might be better to in-line directories of the top levels and symlinks anyway.
Also, ls -lR is a getdents() + stat() workload, but I guess typical booting workloads just open() + read() specific files: therefore inline data is useful since only directory data is needed for lookuping.

@alexlarsson
Copy link
Collaborator

  • no compact inodes

I think the origin of compact mtimes is that if you create new
from-scratch images with mkerofs they generally have the same
mtime. However, this was very useful for ostree images where all files
are mtime zero.

I agree though that we will want to have general support for full
mtime in the future, with OCI images as the base, so I agree here.

  • regular files and symlinks can only be inlined

Is there any specific reason for this? It does make alignment a lot trickier.
Also, we need blocks for the sparse files.

  • use chunk format 31 for all sparse files

Any particular reason?

  • consider outright banning the presence of (0, 0) character devices

This will mean that you cannot store a typical podman container store
in the image. Yes. the escaping is painful, but it is there for a
reason.

  • use custom xattr prefixes for all of our xattrs

This bumps the kernel erofs version requirements quite a lot.

  • change inode order:

Depth first will spread each directory inodes around a lot. I think that will hurt performance.

@allisonkarlitskaya
Copy link
Collaborator

allisonkarlitskaya commented Jan 10, 2025

* regular files and symlinks can only be inlined

Is there any specific reason for this? It does make alignment a lot trickier. Also, we need blocks for the sparse files.

I think if we keep the amount of data small (@cgwalters mentions that symlink targets longer than 1k are not portable anyway, plus we always store any file larger than 64 bytes externally) then we will sort of automatically get to a place where we don't need to inline large amounts of data.

Sparse files don't get any actual blocks because they always use the special -1 "nul" chunk pointer, and the pointers themselves are always stored inline.

* use chunk format 31 for all sparse files

Any particular reason?

Because it's easier. The calculations that mkcomposefs perform here to find the lowest possible working value have no benefit, so it's easier to just avoid them and use a hard-coded value. I asked @hsiangkao about this and he confirmed that there is no benefit to a lower number.

* consider outright banning the presence of (0, 0) character devices

This will mean that you cannot store a typical podman container store in the image. Yes. the escaping is painful, but it is there for a reason.

Do we expect to have this as a usecase? Shouldn't podman rather be mounting composefs images directly instead of creating an overlayfs where the layers in that overlayfs are themselves an overlayfs? This seems a bit "too indirect" to me, to be honest...

Also: maybe there is some better way to do accomplish this (ie: to get 0/0 chardevs visible)? Do you know of any?

* use custom xattr prefixes for all of our xattrs

This bumps the kernel erofs version requirements quite a lot.

I talked with Colin and I think we decided that we want to do something like this:

  • add a --format=1.1 flag (or similar) to mkcomposefs
  • enable the discussed changes in case 1.1 format is enabled, with the understanding that a newer kernel will be required
  • merge composefs-rs towards implementing the same 1.1 format

So that's our "meet in the middle". As for the required kernel version, I don't expect to approach this conservatively: I'd be completely OK with a 6.12 dependency, for example (since there were some recent fixes there with respect to handling of inline symlink targets that would allow us to avoid/disable workarounds in mkcomposefs). If you opt in to the 1.1 format then you agree that you require the newer kernel version. composefs-rs already requires kernel 6.12 (since it never uses a loopback device).

* change inode order:

Depth first will spread each directory inodes around a lot. I think that will hurt performance.

As noted above, I agree with this. Currently libcomposefs behaviour is better here and composefs-rs will need to change.

@alexlarsson
Copy link
Collaborator

alexlarsson commented Jan 13, 2025

* regular files and symlinks can only be inlined

Is there any specific reason for this? It does make alignment a lot trickier. Also, we need blocks for the sparse files.

I think if we keep the amount of data small (@cgwalters mentions that symlink targets longer than 1k are not portable anyway, plus we always store any file larger than 64 bytes externally) then we will sort of automatically get to a place where we don't need to inline large amounts of data.

Sparse files don't get any actual blocks because they always use the special -1 "nul" chunk pointer, and the pointers themselves are always stored inline.

This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these.

That said, we could maybe create one such area of chunk nul pointers that fits the largest file necessary, and then reference that for each sparse file. If reusing blocks works with erofs.

* consider outright banning the presence of (0, 0) character devices

This will mean that you cannot store a typical podman container store in the image. Yes. the escaping is painful, but it is there for a reason.

Do we expect to have this as a usecase? Shouldn't podman rather be mounting composefs images directly instead of creating an overlayfs where the layers in that overlayfs are themselves an overlayfs? This seems a bit "too indirect" to me, to be honest...

Not everything can use composefs. For example, non-root podman cannot ever use it. I think it would be a bad idea to break existing software becasue "escaping is tricky".

Also: maybe there is some better way to do accomplish this (ie: to get 0/0 chardevs visible)? Do you know of any?

No, I did the work to implement the escaping in the kernel because that was the only possible approach.

* use custom xattr prefixes for all of our xattrs

This bumps the kernel erofs version requirements quite a lot.

I talked with Colin and I think we decided that we want to do something like this:

  • add a --format=1.1 flag (or similar) to mkcomposefs
  • enable the discussed changes in case 1.1 format is enabled, with the understanding that a newer kernel will be required
  • merge composefs-rs towards implementing the same 1.1 format

So that's our "meet in the middle". As for the required kernel version, I don't expect to approach this conservatively: I'd be completely OK with a 6.12 dependency, for example (since there were some recent fixes there with respect to handling of inline symlink targets that would allow us to avoid/disable workarounds in mkcomposefs). If you opt in to the 1.1 format then you agree that you require the newer kernel version. composefs-rs already requires kernel 6.12 (since it never uses a loopback device).

This will drop RHEL9 support, although I guess we can use version 1 for that.

@allisonkarlitskaya
Copy link
Collaborator

This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these.

This is not my understanding. The chunk format is not the size of the chunk in bytes, but rather the size of the chunk in disk blocks. So a format of 31 is 2^31 on a 4k-block fs is 4096 * 2^31 = ~8TB per chunk pointer, which "ought to be enough for anybody". Even if it's not enough, we can inline up to ~1000 of those pointers, which would get us up to ~8PB.

Not everything can use composefs. For example, non-root podman cannot ever use it. I think it would be a bad idea to break existing software becasue "escaping is tricky".

I think the idea we had here was to create an API that non-root users can call to request mounting of trusted composefs images (trusted = we created them ourselves) with appropriate constraints (application of nosuid, and/or a suitable uidmap, etc).

I also believe that we'll eventually (hopefully) be able to mount erofs images as non-root. @hsiangkao has already mentioned that they've enabled this flag on their internal kernel images...

I agree though that just throwing up our hands and saying "we can't do this!" is kinda lame, particularly when we already have a working implementation. You have a point that this might be desired, so I don't intend to arbitrarily disable that. I think I might just deal with this by leaving the chardev/0/0 case as a todo!() in composefs-rs and we can implement it if it becomes necessary...

@alexlarsson
Copy link
Collaborator

This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these.

This is not my understanding. The chunk format is not the size of the chunk in bytes, but rather the size of the chunk in disk blocks. So a format of 31 is 2^31 on a 4k-block fs is 4096 * 2^31 = ~8TB per chunk pointer, which "ought to be enough for anybody". Even if it's not enough, we can inline up to ~1000 of those pointers, which would get us up to ~8PB.

Not sure which is true, @hsiangkao would know. But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.

Not everything can use composefs. For example, non-root podman cannot ever use it. I think it would be a bad idea to break existing software becasue "escaping is tricky".

I think the idea we had here was to create an API that non-root users can call to request mounting of trusted composefs images (trusted = we created them ourselves) with appropriate constraints (application of nosuid, and/or a suitable uidmap, etc).

I also believe that we'll eventually (hopefully) be able to mount erofs images as non-root. @hsiangkao has already mentioned that they've enabled this flag on their internal kernel images...

I agree though that just throwing up our hands and saying "we can't do this!" is kinda lame, particularly when we already have a working implementation. You have a point that this might be desired, so I don't intend to arbitrarily disable that. I think I might just deal with this by leaving the chardev/0/0 case as a todo!() in composefs-rs and we can implement it if it becomes necessary...

I mean, that is fine as long as it fails to produce an image in that case. Otherwise adding the feature will break backwards compat.

@hsiangkao
Copy link
Contributor

This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these.

This is not my understanding. The chunk format is not the size of the chunk in bytes, but rather the size of the chunk in disk blocks. So a format of 31 is 2^31 on a 4k-block fs is 4096 * 2^31 = ~8TB per chunk pointer, which "ought to be enough for anybody". Even if it's not enough, we can inline up to ~1000 of those pointers, which would get us up to ~8PB.

Not sure which is true, @hsiangkao would know.

Currently almost all on-disk fields are represented in blocks, so @allisonkarlitskaya is correct here.
I agree for all composefs sparse inodes, it has no difference to just use 2^31 blocksized chunks for all files.

But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.

I don't get the point. how could we share these? (chunk indexes themselves cannot be shared between inodes.)
If one sparse chunk means 8TiB, I guess it's large enough, is it still necessary to consider index sharing?

@allisonkarlitskaya
Copy link
Collaborator

Not sure which is true, @hsiangkao would know. But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.

I'm not even sure how we would do this "sharing" — as far as I know there's no difference between "inline chunk-based" and "block indirect chunk-based": there's just one "chunk based" format and in that case the chunks are expected to be listed inline (with the number of inline indees determined by the st_size of the file divided by the effective chunk size, rounded up). If we got too many of them to store inline maybe it would switch to a block reference, but even then I don't understand how it would work because the i_u field in the inode which stores the chunk format is what usually stores the reference to a block (but we can't use it anymore since it stores the chunk format).

Also: we're talking about a very very small amount of inline data here: as mentioned, every file less than 8TiB gets encoded into a single 4-byte nul chunk pointer (-1)...

@alexlarsson
Copy link
Collaborator

This isn't quite right. If you pick chunk format 31 (as you proposed) for all files, then each special -1 chunk pointer will reference an empty chunk of 2^31 == 2GB of empty space. A file that is 2048 GB will then need 1024 such chunk pointers, and with each being 4 bytes that will fill an entire page of such chunks pointers. We can hardly inline these.

This is not my understanding. The chunk format is not the size of the chunk in bytes, but rather the size of the chunk in disk blocks. So a format of 31 is 2^31 on a 4k-block fs is 4096 * 2^31 = ~8TB per chunk pointer, which "ought to be enough for anybody". Even if it's not enough, we can inline up to ~1000 of those pointers, which would get us up to ~8PB.

Not sure which is true, @hsiangkao would know.

Currently almost all on-disk fields are represented in blocks, so @allisonkarlitskaya is correct here. I agree for all composefs sparse inodes, it has no difference to just use 2^31 blocksized chunks for all files.

Ok, yeah, then we're ok with just always inlining one (or a few) chunk null pointers.

But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.

I don't get the point. how could we share these? (chunk indexes themselves cannot be shared between inodes.) If one sparse chunk means 8TiB, I guess it's large enough, is it still necessary to consider index sharing?

Yeah, sorry. I was thinking we could use non-inline blocks for the chunk indexes, but that is not possible.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jan 14, 2025

...

But still, even a few of these seem unnecessary if we can share a single non-inlined one for all sparse files.

I don't get the point. how could we share these? (chunk indexes themselves cannot be shared between inodes.) If one sparse chunk means 8TiB, I guess it's large enough, is it still necessary to consider index sharing?

Yeah, sorry. I was thinking we could use non-inline blocks for the chunk indexes, but that is not possible.

Yeah, anyway, I guess currently one chunk index (8TiB) is large enough, but if it's a really concern, I could make
FLAT_PLAIN format with a special NULL_ADDR (-1) to indicate the whole file is sparsed too, but
I tend to make it upstream together with another on-disk changes (e.g. add timestamps for 32-bit on-disk inodes, and 64-bit root_nid, etc., 64-bit blkaddrs) in one upstream version to avoid fragmented on-disk updates.

I will try my best to address it this year.

@allisonkarlitskaya
Copy link
Collaborator

I could make FLAT_PLAIN format with a special NULL_ADDR (-1) to indicate the whole file is sparsed too

NULL_ADDR for the i_u field, I guess you mean? That would indeed be a very cool thing to have.

@cgwalters
Copy link
Contributor Author

Also related to a 1.2 format is #288 - I have forgotten all context behind that now but my recollection is basically "eww"

@hsiangkao
Copy link
Contributor

I could make FLAT_PLAIN format with a special NULL_ADDR (-1) to indicate the whole file is sparsed too

NULL_ADDR for the i_u field, I guess you mean? That would indeed be a very cool thing to have.

Yes, but as I said, I will ship with one single kernel version to avoid fragmented ondisk changes too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants