Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LV node migration #314

Open
pschichtel opened this issue Jun 14, 2024 · 15 comments
Open

LV node migration #314

pschichtel opened this issue Jun 14, 2024 · 15 comments
Assignees
Labels
Backlog enhancement New feature or request to-be-scoped Need scoping

Comments

@pschichtel
Copy link

pschichtel commented Jun 14, 2024

Describe the problem/challenge you have

I'm hosting various clustered and stateful applications in kubernetes. Some of these applications require low-latency IO to perform well, like databases and message queues, that's way I use local PVs for these applications, which works great. This way I can put very fast SSDs into these servers and use them without network overhead.

My only pain-point with this setup is (unsurprisingly): The pods, once scheduled, are pinned to their node forever. The only way to move the pod is to delete both the PVC and the pod and hope that the scheduler doesn't decide to put it back onto the same node (sure, this can be helped with node selectors, affinities, anti affinities and taints, but that's even more complexity). An additional, possibly more serious depending on the application, is the fact that node failures can't be recovered from automatically. Even if the application is able to restore its state from remaining peers in its cluster, kubernetes won't execute the pod because it's pinned to a node that's unavailable.

Describe the solution you'd like

Currently, at least that's my current understanding, when kubernetes schedules the pod it works like this (simplified):

  • if volumeBindingMode is WaitForFirstConsumer, then k8s places the pod and then requests a PV
  • if volumeBindingMode is Immediate, then k8s places the pod on a node that can access the PV

The former means that lvm-localpv will create a LV on the node that's selected for the pod, the latter means k8s places the pod on the single node carries that LV that has been eagerly created. Either way, it ends with a pod pinned to a node.

What I would love to see is to make an LV available to all nodes in the cluster independent of where it is physically placed. If the LV is already allocated on a node and kubernetes happens to pick a different node, then just create a new LV on the new node, transfer the LV content over the network and delete the old LV. If the LV does not exist already, then it can simply be created on the node that was picked.

That would obviously significantly delay pod startup depending on the size of the volume and it might require a dedicated high-bandwidth network for the transfer as to not interrupt other communication in the kubernetes cluster, but for application clusters that are highly redundant and can cover a failed replica for a prolonged period, this could be perfectly fine.

And actually this could go one step further: Assuming that the application can restore its state from peers in its cluster, a feasible LV migration strategy would be to create a new empty LV without transferring data and let the application do the "transfer".

I could imagine this as a StorageClass option like dataMigrationMode with values:

  • Disabled (default): current behavior: pin the application to the node with the LV
  • Application: Just delete the LV on the old node and create a new one on the new node and let the application handle the migration
  • VolumeTransfer: Create a new LV and transfer data to it before mounting it.

Anything else you would like to add:

While the VolumeTransfer option would be awesome, it also probably quite involved. So being able to just get a new LV on a new node would probably easier. I guess this also requires applications to be well behaved and deployments well configured to not accidentally delete all the data during a rolling upgrade.

@avishnu
Copy link
Member

avishnu commented Jun 14, 2024

Thanks for detailing very clearly.
Have you considered the possibility of using a replicated storage like OpenEBS Mayastor for the use-case you have described above? You could setup the replica count as 2, which means volume target (nvme-based) will be writing synchronously to 2 replica endpoints. If one of the replica nodes goes down or becomes unavailable/unreachable, Mayastor will reconcile to spin up a new replica automatically, and starts rebuilding. This is transparent to the stateful application pod.

@pschichtel
Copy link
Author

pschichtel commented Jun 14, 2024

I did consider it, yes. What wasn't clear to me there: Do I get the guarantee that either pods will only be placed on nodes with existing replicas or will replicas automatically be moved to the pod? So, can I be sure that IO is always local? Because in some situations it would be worse to have an application member with significantly worse IO latency in the cluster, than to just not have the member available.

@pschichtel
Copy link
Author

Also: I assume mayastor guarantees consistency between replicas, which forces some write overhead, because the write must be replicated to at least one other replica. Not sure if async/eventually consistent replication is supported.

@pschichtel
Copy link
Author

pschichtel commented Jun 14, 2024

So my priority for these deployments is good write latency, which makes synchronous replicated storage basically a no-go. Async replication would be viable to speed up application recover, as it may not need need to start its recovery from scratch.

I see these types of applications:

  1. Applications that cannot restore member state at all (I don't have an example here)
  2. Applications that can restore some state (qdrant can restore its data, but it must retain its cluster membership state)
  3. Applications that can restore their entire state (postgres replicas, hashicorp vault, MinIO, ....)

Applications of type 1 would need a full copy of the old volume to restore a member. In case of a node failure that would not be possible. So these would need synchronous replication anyway, but I don't know of an example of an application like this and I'm not convinced there exists one.

Applications of type 2 would need some parts of the state to be able to restore the rest, I guess this will usually be some form of cluster membership/peer information similar to what qdrant does. These applications need some form of replication to recover a node failure, though these parts of the state don't see as much IO it might be fine to use async replication. I could also imagine splitting up the volume into the part that uses sync replication with mayastore and the part that uses localpv for the best latency. That would effectively turn part of the application into type 3 and part of it into type 1.

Applications of type 3 can restore cluster membership without any state, so these would be fine with just simply deleting the state entirely. They might benefit from async replication to be able start migrated pods quicker, but they don't need it. Imagine a MinIO node with terabytes of data.

So considering node failure, only type 3 applications would be able to work without some form of replication and these are the applications that are also fine with just deleting the PV.

I think with this in mind this feature request could be reduced to a simple option to disable the LV <--> node pinning that's currently happening. Mayastore or some other replicated storage system would be required for the other application types anyway.

@avishnu
Copy link
Member

avishnu commented Jun 18, 2024

Also: I assume mayastor guarantees consistency between replicas, which forces some write overhead, because the write must be replicated to at least one other replica. Not sure if async/eventually consistent replication is supported.

Yes, Mayastor, being a block storage solution, needs to maintain strict consistency between all replicas of a volume, which calls for synchronous replication.
If a replica lags behind due to temporary or permanent fault, a rebuild process is triggered alongside current write ops. Once the rebuild completes and eventual consistent state is reached, the rebuilt replica falls back to synchronous replication.
Another thing to note is the replication is parallel and not sequential, meaning, write ops get written to all replicas simultaneously, so the overhead is limited by the network bandwidth.

@niladrih
Copy link
Member

Related issue: openebs/dynamic-localpv-provisioner#87

@Alex130469
Copy link

When I was reading through the initial requirements, I wondered if Mayastor wouldn’t be able to to do the trick.

Let’s say a pod is running on node 1 with a locally attached PV and you want to transition to node 2 where currently is no PV locally present. Then with Mayastor you should be able to start the pod on node 2 with NVMe/TCP connection back to node 1, add a replica on node 2, and after it is in sync with the replica on node 1 you retire the replica on node 1 to have the pod access the data on node 2 locally again.

I think we can add/remove replicas to a Mayastor PV at any time and so should give you the mobility you are looking for.

@cbcoutinho
Copy link

I can confirm this works - using Mayastor to replicate/migrate data within a cluster and remove disks from a pool without any downtime.

I decommissioned the node by first increasing the replication factor of the PVs to 2, forcing another node to replicate the data, and finally draining the original node. Unfortunately "deleting" the DiskPool in mayastor is not technically straight forward as it requires some etcd hackery as described in this comment: openebs/mayastor#1656 (comment)

@pschichtel
Copy link
Author

So running mayastor with replication factor 1 guarantees local/non-network IO ?

@avishnu
Copy link
Member

avishnu commented Sep 23, 2024

So running mayastor with replication factor 1 guarantees local/non-network IO ?

There are few things one needs to be aware of:

  • A Mayastor volume comprises of an NVMe-oF over TCP target and one or more replicas specified as per the replication factor. The target or nexus is responsible for routing the application read/write I/Os to the replicas and ensuring consistency.
  • With replication factor 1, Mayastor will make a 'best effort' to place the replica and target nexus local to the node where the application pod is scheduled. However this is subject to the node fulfilling all the scheduling constraints like topology, affinity and available capacity.
  • Mayastor by design provides SAN block storage. The target nexus is exposed as an NVMe-oF over TCP/IP endpoint listening for connections on host IPv4 address. The connection between initiator (node where application is running) and nexus is over host IPv4 network, even if they are co-located on the same node. So the I/Os will be network local, but not non-network.
  • There is currently no workflow for automatic migration as mentioned in the above comment LV node migration #314 (comment)

@pschichtel
Copy link
Author

Thanks for the details explanation. In that case mayastor wouldn't serve my use-case.

@avishnu
Copy link
Member

avishnu commented Sep 23, 2024

Thanks for the details explanation. In that case mayastor wouldn't serve my use-case.

May I ask the reason?

@pschichtel
Copy link
Author

@avishnu as I wrote in the first section of the issue description: I'd like to avoid network IO for applications that require high IO performance and replicate state themselves. I would also prefer a node of e.g. a database cluster not to run rather then having it run with bad performance. So all the feature that make mayastor probably great as a replicated block store don't help here.

@tiagolobocastro
Copy link
Contributor

In case the application is scheduled on storage nodes and with a replication count of 1, then the volume stack is all placed on the same node.
So far so good, but now the application also needs to connect to the volume.
This is where we use nvme-tcp to connect the linux kernel initiator to our volume target. The target is on the same node, so although it's still a tcp connection it's not going out of the wire so to speak.
I would suggest giving it a whirl and see how it works out. Feedback, good or bad, would be welcome :)

We also have UBLK on the roadmap, which would allow us to connect to the volume in a more efficient way and without requiring a tcp connection, in fact no network at all.

@tiagolobocastro
Copy link
Contributor

I can confirm this works - using Mayastor to replicate/migrate data within a cluster and remove disks from a pool without any downtime.

I decommissioned the node by first increasing the replication factor of the PVs to 2, forcing another node to replicate the data, and finally draining the original node. Unfortunately "deleting" the DiskPool in mayastor is not technically straight forward as it requires some etcd hackery as described in this comment: openebs/mayastor#1656 (comment)

Awesome, thanks for the feedback @cbcoutinho!
We're currently going through all bugs/feature requests, and I'm definitely pushing for DiskPool deletion and move :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog enhancement New feature or request to-be-scoped Need scoping
Projects
None yet
Development

No branches or pull requests

6 participants