MergeTree is not a complete DDS by itself, but provides a reusable data structure for DDSes that must maintain a sequence of collaboratively edited items. MergeTree is used in both SharedSequence and SharedMatrix.
See GitHub for more details on the Fluid Framework and packages within.
The three basic operations provided by MergeTree are:
insert(start, segment)
remove(start, end)
annotate(start, end, propertySet)
MergeTrees represent a sequence as an ordered list of segments. Each segment contains one or more consecutive values in the sequence. For example, a SharedString contains segments of characters:
["The cat"], [" sat on the mat."]
Traversing all segments in order produces the current sequence as understood by the local client.
"The cat sat on the mat."
(Note that how the items contained in the MergeTree are grouped into segments is a MergeTree implementation detail and changes over time.)
To process operations like insertion and removal, the MergeTree maps positions in the sequence to the containing segment and offset of the position within the segment. While the MergeTree implementation uses a B+Tree to accelerate this mapping, to understand the semantics of the MergeTree it is easier to consider a naïve implementation that searches for the containing (segment, offset) by walking all segments in order. This naïve search subtracts the length of each segment from the desired position until it reaches the segment that contains the remaining offset.
position 10 -> { segment: [" sat on the mat."], offset: 2 }
Initially considering only local edit operations, insertion and deletion work by inserting new segments or tombstoning removed segments. Tombstoned segments retain their position in the sequence, but have a length of zero when traversing the tree.
When an insertion/deletion occurs at a position contained within an existing segment the original segment is "split". In the case of insertion, the newly inserted segment is inserted between the two halves of the original. In the case of removal, the removed part of the subdivided segment is tombstoned.
insert(12, "quietly") -> ["The cat"], [" sat "], ["quietly "], ["on the mat."]
remove(19, 30) -> ["The cat"], [" sat "], ["quietly"], [del: " "], [del: "on the mat"], ["."]
To support merging edit operations from remote clients, we need to extend our original search function
(position) -> (segment, offset)
to account for the state of a remote client's MergeTree at the time the
remote client performed the operation on its MergeTree.
Conceptually, this is done by adjusting our naive linear search for the (segment, offset) in the following way:
- Segments inserted "after" the remote client's operation are skipped (i.e., have length 0)
- Segments tombstoned "after" the remote client's operation, but were inserted "prior" are included (i.e., have their original length prior to tombstoning.)
...where "after" means the remote client's MergeTree had not yet applied the operation that inserted and/or tombstoned the segment.
For clients to be able to reason about which segment insertions/removals other clients have processed the MergeTree we do two things:
- The MergeTree tracks which client inserted/removed each segment and the sequence number (abbreviated "seq") assigned by the Fluid service to the insertion/removal operation.
- When sending a MergeTree op, the client includes the last seq# it has processed from the Fluid service. This number is known as an op's "reference sequence number" or "refSeq#"
The 'client' and 'refSeq' become new arguments to our search function:
(client, refSeq, position) -> (segment, offset)
A segment was inserted and/or removed on the remote client at the time client sent the operation if either:
- The referenced sequence number is greater than or equal the server-assigned sequence number of the operation that inserted/removed the segment.
- The client sent the operation that resulted in insertion/removal. (In which case, the client hadn't yet received their sequenced op from the server but was aware of the insertion/removal because the client produced it locally.)
If both above conditions are false, then the insertion/removal happened "after" the remote operation, and consequently should be ignored during the search.
Note that any locally applied operations that are still pending sequencing by the Fluid service are unknown to remote clients and should be ignored when processing remote ops.