Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is gesture forwarding tied to capture controller or to MediaStreamTrack or to DOM objects? #45

Open
youennf opened this issue Oct 25, 2024 · 23 comments

Comments

@youennf
Copy link

youennf commented Oct 25, 2024

CaptureController lives where capture was initiated.
MediaStreamTrack on the other hand can be transferred and lives where it is being rendered.
This makes it potentially possible for CaptureController and the getDisplayMedia track to live in different contexts.

Given gesture forwarding is tied to the track's preview, it seems it is more tied to MediaStreamTrack/HTMLVideoElement than CaptureController.

The question is then whether API should be tied to CaptureController or to HTML elements/MediaStreamTrack.
I would then to favour the latter.

@eladalon1983
Copy link
Member

I think the right mental model is that we are controlling the captured surface, and CaptureController is the proxy for that concept (in all APIs we introduce), whereas MediaStreamTrack is just a handle to get frames (similarly). Those frames might not even be coming directly from the captured surface; they might be going through some transformation first, such as getting annotated, cropped or adjusted for better contrast.

Is there genuine Web developer interest in displaying the video element somewhere other than in the document that first called getDisplayMedia()? I am not aware of such a need, so I'd rather not design for it. (Unless the current design actively prevented such later extensions, of course. I don't think this is the case, though.)

@jan-ivar
Copy link
Member

jan-ivar commented Nov 5, 2024

My mental model is this is about enabling user controls, not app controls. This suggests it might be logical to put the API on the DOM objects the user interacts with.

This is why I find @youennf's API in #49 appealing. E.g.:

videoElement.enableGestureForwarding = true;
div.enableGestureForwarding = true; // if on top
canvas.enableGestureForwarding = true; // recently drawn to with video

MediaStreamTracks can be cloned and transferred to workers, where gesture forwarding doesn't make sense, so I don't think that's the right place.

OTOH, CaptureController.forwardWheel(x) only supports one x, and x = null is how to stop fowarding (a bit surprising that).

It may be uncommon to have two preview elements, but if a website wants it, as a user I'd expect to be able to scroll both.

This to me suggests an element API.

@eladalon1983
Copy link
Member

div.enableGestureForwarding = true; // if on top

The reasons to use an async API have been previously presented here, and brought up in multiple other threads (example).

Putting the async question aside for a moment - assume for the sake of argument that we reshape this proposal to be div.setGestureForwarding() and it returns a promise - I'd still oppose this proposal, because it makes unnecessary and unhelpful assumptions about the target element:

  1. It assumes a single video element with a single capture. (If not, which one is being forwarded?)
  2. It assumes that the element is the owner of video element. (At first glance, appears to provide security guarantees; in practice, does not.)

I also think it's poor choice of API to expose on HTMLElement or anything similarly high-level anything so capture-specific. I don't think this is good API design.

OTOH, CaptureController.forwardWheel(x) only supports one x,

We have discussed the possibility of CaptureController.forwardGestures(element, gesturesDict), which would have allowed forwarding from multiple elements. But thinking of this some more, IMHO, it is preferable to only allow forwarding from a single element, unless Web developers indicate a clear use that benefits users. The cost to implementers, the (slight) incrase in API complexity for developers, and the (very slight) risk of abuse, all require something to counterbalance them; but so far, such a need has not been articulated.

It may be uncommon to have two preview elements, but if a website wants it, as a user I'd expect to be able to scroll both. [Emphasis mine - Elad]

If.

and x = null is how to stop fowarding (a bit surprising that).

Why is that surprising?
To name one precedent - mst.cropTo(null).

@jan-ivar
Copy link
Member

jan-ivar commented Nov 6, 2024

div.setGestureForwarding() ...

I agree promise can be discussed separately.

  1. It assumes a single video element with a single capture. (If not, which one is being forwarded?)

No, the idea is the developer declares where they expect forwarding, and the user agent makes it happen where it can happen:

  • videoElement.setGestureForwarding() and canvas.setGestureForwarding() forward to what they're playing/rendering
  • div.setGestureForwarding() would pass-through to videoElements or canvases underneath the div, without extending hit areas.

This would be a declarative API — putting the user agent in charge of where forwarding happens instead of the app — so reusing the same API name seemed simple, but if it's confusing I'm open to other ideas. Other ways to solve pass-through might be with CSS, something similar to .overlay { pointer-events: none; } maybe? The problem space here seems a bit outside WebRTC, so we might want to get input from DOM folks.

@jan-ivar
Copy link
Member

jan-ivar commented Nov 6, 2024

To name one precedent - mst.cropTo(null).

Can you point to a precedent you didn't invent?

@jan-ivar
Copy link
Member

jan-ivar commented Nov 6, 2024

Also, crop rectangles are inherently mutually exclusive with each other and the uncropped state, so that seems fine.

In contrast, an infinite number of video elements can play back a single capture, which suggests a different model.

@eladalon1983
Copy link
Member

eladalon1983 commented Nov 6, 2024

Assume:

<div id="div">
  <video id="vid1"></video>
  <video id="vid2"></video>
</div>
<script>
  const controller1 = new CaptureController;
  const controller2 = new CaptureController;
  vid1.srcObject = 
    await navigator.mediaDevices.getDisplayMedia({controller: controller1});
  vid2.srcObject = 
    await navigator.mediaDevices.getDisplayMedia({controller: controller2}); 
</script>

Contrast:

// Explicitly affects mutliple capture sessions.
controller1.forwardGestures(div);
controller2.forwardGestures(div);

With:

// Implicitly affects an unknown number of capture sessions.
div.enableGestureForwarding = true;

The former is much clearer and less error-prone.
The reader immediately understand everything.

Note that this is a silly case, though, as you'd not normally forward from one surface to multiple surfaces. And that's partially my point - which code snippet makes it clearer that something unreasonable is taking place? Which unreasonable code is more patently a developer mistake? The API which more readily exposes the developer's error is better.

I also have to repeat my own message:

I also think it's poor choice of API to expose on HTMLElement or anything similarly high-level anything so capture-specific. I don't think this is good API design.

We want APIs that are clear.

  • captureController.forwardWheel(element) is completely clear. It affects screen-capture and forwards wheel gestures from element to whichever surface captureController is capturing.
  • element.enableForwardGestures = true is inscrutable. What are gestures? How can they be forwarded? Where to? How are these relevant for HTML elements in general?

To name one precedent - mst.cropTo(null).

Can you point to a precedent you didn't invent?

The precedent cited was of an API standardized in this very Working Group, so I expect it to be convincing enough. But as per the request, I can point to other precedents, also standardized by this group, but not proposed by me - replaceTrack().

@eladalon1983
Copy link
Member

eladalon1983 commented Nov 6, 2024

P.S: Possibly captureController.forwardGesturesFrom(element) would clarify things even further.

@jan-ivar
Copy link
Member

jan-ivar commented Nov 6, 2024

// Implicitly affects an unknown number of capture sessions.
div.enableGestureForwarding = true;

Sorry no, that's an incorrect understanding of what I'm proposing. To turn on forwarding would require

vid1.enableGestureForwarding = vid2.enableGestureForwarding = true;

With the div API, I'm trying to be generous and address the emoji/overlay use case. The idea would be a way to let the app mark which overlapping elements (in a div) if any should not block scroll inputs to a video element.
But if it's confusing things let's keep things simple (your example did not include any overlays).

I also think it's poor choice of API to expose on HTMLElement or anything similarly high-level anything so capture-specific. I don't think this is good API design.

I disagree obviously. There's plenty of precedent:

@jan-ivar
Copy link
Member

jan-ivar commented Nov 6, 2024

In contrast, putting the API on the controller lets apps express things we don't want to support, like:

controller1.forwardGestures(vid2);
controller2.forwardGestures(vid1);

@eladalon1983
Copy link
Member

With the div API, I'm trying to be generous and address the emoji/overlay use case. The idea would be a way to let the app mark which overlapping elements (in a div) if any should not block scroll inputs to a video element.
But if it's confusing things let's keep things simple (your example did not include any overlays).

It is only confusing because a suboptimal API is proposed.
But with the proposal of CaptureController.forwardWheelFrom(element), it is neither confusing, nor do we need to compromise on functionality. So I don't see a reason to change away from that proposed API.

I disagree obviously. There's plenty of precedent:

All of these precedents are on HTMLMediaElement, not on HTMLElement. They are somewhat natural for a media-element, but would be quite confusing for a non-media-element.

In contrast, putting the API on the controller lets apps express things we don't want to support, like:

It's a small price to pay in order to allow forwarding from the application's choice of overlaid surface. It's a common cost for flexibility.

@eladalon1983 eladalon1983 transferred this issue from w3c/mediacapture-screen-share-extensions Nov 13, 2024
@eladalon1983 eladalon1983 changed the title [Capture control] Is gesture forwarding tied to capture controller or to MediaStreamTrack Is gesture forwarding tied to capture controller or to MediaStreamTrack Nov 13, 2024
@eladalon1983
Copy link
Member

Issue transferred; heads up to those discussion participants who might otherwise be looking for it elsewhere: @jan-ivar, @youennf

@jan-ivar
Copy link
Member

But with the proposal of CaptureController.forwardWheelFrom(element), it is neither confusing, nor do we need to compromise on functionality. So I don't see a reason to change away from that proposed API.

CaptureController.forwardWheelFrom(element) is confusing and limiting, because it's not possible to forward gestures from two different elements. E.g. the following doesn't do that (second call overrides the first):

captureController.forwardWheelFrom(videoElement1);
captureController.forwardWheelFrom(videoElement2);

A pivot to putting the API on the element solves this:

All of these precedents are on HTMLMediaElement, not on HTMLElement. They are somewhat natural for a media-element, but would be quite confusing for a non-media-element.

I'm not proposing putting anything on HTMLElement.

I propose putting the same API on both HTMLVideoElement and Canvas. The precedent is captureStream which exists on both HTMLMediaElement and Canvas.

Assuming I compromise on permission and promise, this might look like:

await videoElement1.forwardGestures(true);
await canvas2.forwardGestures(true);

This shape allows UAs to work on click-jacking mitigations as needed, and communicate needed restrictions to web developers, such as: UAs MAY impose requirements for forwarding to work, e.g.

  • videoElement1 must be playing back (a recent approximation of) the capture, in a manner deemed visible to users
  • canvas2 must have had a recent frame drawn to it from that capture, in a manner deemed visible to users

The exact restrictions can remain vague to allow UAs to experiment. It's in their interest not to break legitimate usage.

I see the emoji overlay as a separate (CSS) problem. Critically, I see no reason to expand the hit zones for wheel input:
image

@jan-ivar jan-ivar changed the title Is gesture forwarding tied to capture controller or to MediaStreamTrack Is gesture forwarding tied to capture controller or to MediaStreamTrack or to DOM objects? Dec 18, 2024
@jan-ivar
Copy link
Member

jan-ivar commented Jan 9, 2025

To simplify, please discount the canvas part of the proposal above for now (I was assuming some video conferencing sites might use canvas instead of a video element for MST presentation, but my evidence for this is poor, and we can always add this back later). — This should avoid confusion in use cases where a canvas is being used as an overlay.

I also got feedback that use cases might need this overlay to be clickable, so this might require some new CSS feature like maybe .overlay { wheel-events: none }. I'll try to reach out to some CSS folks for comment.

@eladalon1983
Copy link
Member

eladalon1983 commented Jan 10, 2025

I also got feedback that use cases might need this overlay to be clickable, so this might require some new CSS feature like maybe .overlay { wheel-events: none }. I'll try to reach out to some CSS folks for comment.

Assume a sample Web app that has a mostly-transparent canvas element overlaid on top of a video element. Scroll events should lead to the captured surface being scrolled, and click events should manipulate something on the canvas (such as annotations).

There exists at least one such application - Google Meet - which proves that this is an interesting pattern that Web developers are actually likely to employ. So this is not a purely academic discussion.

Let's examine whether limiting wheel-forwarding to video elements is at odds with this use case. Theoretically speaking, Web applications can make use of pointer-events: none to forward both scrolls and clicks from the canvas to the video element, then:

  • Scroll events are consumed by forwardWheel(), which makes the user agent forward these events from the video element to the captured tab.
  • Click events trigger an event handler on the video, which then "manually" computes the offset and reproduces a synthetic click event at the relevant offset back on the canvas. That is, if the canvas previously invoked foo(x, y), the video element can now do so.

Possible? Seems like it. (Modulo limitations we might hear from Web developers.)
Ergonomic? No.

We should weigh the hardship this places on Web developers against the security benefits conferred by limiting the API to video elements. Those benefits have not yet been articulated.

The exact restrictions can remain vague to allow UAs to experiment.

Agreed on this point - if we discover Web developers can still use the API if limited to video elements, and if we make this pivot, then we should still leave it to UAs to experiment with heuristics, and revisit specifying additional limitations at a later time.

await videoElement1.forwardGestures(true);

(The following is feedback on specific issues with the above proposal, and should not be misunderstood as endorsement of general thrust of that proposal.)

There is nothing in this shape to tie the video element to the captured surface, whereas the current API shape does. Recall the present shape of the API:

partial interface CaptureController {
  Promise<undefined> forwardWheel(HTMLElement element);
};

It'd be better to just s/HTMLElement/HTMLVideoElement in the original shape, then to make this change to HTMLVideoElement.forwardGestures().

it's not possible to forward gestures from two different elements

No Web developer has articulated this requirement, and I can't imagine a realistic use case where it would be necessary. If we were ever to determine it as necessary, then HTMLVideoElement.forwardWheel(controller, isOn) might be more reasonable. (The same caveat of not misreading this message as endorsement still applies.)

@jan-ivar
Copy link
Member

jan-ivar commented Jan 13, 2025

I agree with this use case. Thanks for the workaround idea. It's a clever reversal of typical manual forwarding one might expect a partially-interactive overlay to have to do using synthetic events: forwarding back to the overlay (backwarding?) to give the isTrusted property to the video element instead of the overlay.

The CSS folks I spoke with pushed back on a general .overlay { wheel-events: none } suggesting it may be a heavy lift. There's also #54. They suggested something like videoElement.forwardEvent(event) but there's likely a good reason that doesn't already exist in the platform (once an event goes into JavaScript, the browser has limited visibility into exactly what the page might do with it).

If I were to rank these ideas I'd probably put the backwarding idea first.

But the simplest way might be to just declare that this API gets first dibs on input even when underneath an overlay.

How does Chrome's current API work here? Are trusted events emitted/can JS preventDefault() the behavior? Given the security principles we want for this API, it might actually be advantageous to bypass JS here.

There is nothing in this shape to tie the video element to the captured surface, whereas the current API shape does.

That's a feature. The link is already implicit through video.srcObject. Creating competing relationships is confusing.

It'd be better to just s/HTMLElement/HTMLVideoElement in the original shape, then to make this change to HTMLVideoElement.forwardGestures().

Great, but that goes both ways: but if we can do captureController.forwardWheel(HTMLVideoElement element) then that seems indistinguishable from htmlVideoElement.forwardWheel()`, except for this (in my view undesirable) competing association.

@jan-ivar
Copy link
Member

jan-ivar commented Jan 13, 2025

it's not possible to forward gestures from two different elements

No Web developer has articulated this requirement, and I can't imagine a realistic use case where it would be necessary.

A developer could have multiple <video> elements representing the same capture but with different video effects applied, or styled or cropped differently for different parts of the UI. They want the user to be able to scroll from whichever preview the user interacts with.

Having the API on the video element is:

  • Cleaner: The video is already the place where the captured stream is displayed.
  • Less error-prone: Harder to accidentally forward gestures to the wrong capture.
  • More scalable: Straightforward for multiple videos, each controlling its own forwarding.
  • Idiomatic: Consistent with how other web APIs augment <video> functionality over time.

@jan-ivar
Copy link
Member

jan-ivar commented Jan 16, 2025

But the simplest way might be to just declare that this API gets first dibs on input even when underneath an overlay.

After some consideration, "first dibs" might be problematic and make it harder to extend support to #54 in the future.

My front-runner API is therefore await videoElement1.forwardGestures(true) forwarding all (e.g. wheel and touch) by default, with overlay solved by web developers through preventDefault() and "backwarding" to the overlay with synthetic events.

@eladalon1983
Copy link
Member

I have now had time to speak to Web developers and get their feedback, and they are extremely opposed to limiting to video elements. (My apologies if I gave the wrong impression during the editors’ meeting, btw. The meeting with Web developers occurred thereafter.)

The Web developers I am in contact with implemented annotations using a library.

Without libraries, “backwarding” consists merely of invoking myDrawOnClick() from video.onclick instead of from canvas.onclick – a possible change.

But with libraries, the application can’t rewire the handlers-to-functions mapping, because that mapping is unknown, abstracted away by the library. One could potentially redispatch clones of events, but use of pointer-events: ‘none’ means the handlers won’t be invoked.

Not to mention how non-ergonomic and error-prone it is to redispatch event-clones, and how it loses the isTrusted bit.


Let’s re-examine, again, what this limitation to video elements is even for. It was motivated by the expectation that the user agent could then ensure a faithful representation of the controlled surface to the user, suppressing gesture-forwarding when the representation is faithless. But this is demonstrably impossible with annotations, because by design they imply overlaying the video with arbitrary pixels. No matter how annotations are implemented, the user sees a faithless representation of the captured surface, by definition.


A developer could have multiple <video> elements representing the same capture but with different video effects applied

First, I am not aware of any Web applications that employ this theoretical pattern of (i) multiple preview tiles with (ii) effects on top. This hypothetical example does not appear realistic to me.

Second, the hypothetical use-case and the security rationale are in contradiction. Applying effects on top of videos renders the representation inherently faithless.


To summarize:

  • Limiting to video elements…
    • …does not gain any privacy/security benefits. (As argued on multiple threads.)
    • …does not facilitate any UA-level heuristics that could increase privacy/security. (As argued by this comment - annotations render all video elements faithless.)
    • …breaks some use cases (with libraries), and requires anti-patterns and non-idiomatic code structures (without libraries) to try to bypass the limitation.
  • Forwarding from multiple elements is not a requirement; no realistic use case for it has been shown.

@dontcallmedom-bot
Copy link

@jan-ivar
Copy link
Member

Who are these web developers and what library do they use? With more info we might be able help.

The annotation use case starts out transparent and is therefore not faithless. So I'm not swayed by appeal to futility. The current API seems overly permissive.

I'm also concerned we're committing a category error by baking the overlay problem into the initial design of this API, when it seems a fundamentally separate problem.

If we ignore the overlay problem for a second, it seems obvious this API belongs on video element, which probably means that's the right place.

I haven't yet seen this library, but it seems reasonable that any interactive overlay that needs to selectively forward events to what's underneath needs to create synthetic events and some awareness of what's underneath. "Backwarding" might present additional challenges to that, but doesn't seem that different either.

The overlay issue doesn't seem inherently unique to this api.

Lastly, I think @guidou made the point in the meeting that the same limitations could be applied regardless of where the API lives. I think that works in the other direction too: I didn't get a clear answer how Chrome intercept input today, but if the limitations (e.g. hit zone no larger than the video element underneath the given div) are implementable regardless, why can't we move the API to the video element and solve the overlay the way it's done today?

@eladalon1983
Copy link
Member

Who are these web developers and what library do they use? With more info we might be able help.

During the interim meeting, I have pointed out Google Meet as one such example, and there are more.

The exact library is irrelevant; the important point is that the libraries are not modifiable by the app developer. Enough information was provided to discuss this from first principles.

The annotation use case starts out transparent and is therefore not faithless. So I'm not swayed by appeal to futility.

If I understand you correctly, you mean that the gradual transition from zero occlusions to increasingly many occlusions leaves room for the UA to heuristically detect abuse. Did I understand you correctly? Why is restricting wheel-forwarding to occur from the video element required for that?

Note the priority of constituencies implies that if you can employ the same limitations either way, you should not put restrictions on Web developers.

I haven't yet seen this library, but it seems reasonable that any interactive overlay that needs to selectively forward events to what's underneath needs to create synthetic events and some awareness of what's underneath.

This is both irrelevant and incorrect, given (1) that libraries are not necessarily modifiable by the Web developer, and that (2) existing libraries might not allow selective forwarding of events.

You assume that existing libraries routinely support selective forwarding of events. But why would this be the case? Why would existing libraries be designed to accommodate a future hypothetical limitation of some future API?

During the interim meeting, you took an action item to demonstrate how the use case could be tackled without support from the library. I look forward to your demonstration.

The overlay issue doesn't seem inherently unique to this api.

If it is not unique, then you will have precedents to draw on in your demonstration of a solution, as per your action item from the interim meeting.

Lastly, I think @guidou made the point in the meeting that the same limitations could be applied regardless of where the API lives. I think that works in the other direction too: I didn't get a clear answer how Chrome intercept input today, but if the limitations (e.g. hit zone no larger than the video element underneath the given div) are implementable regardless, why can't we move the API to the video element and solve the overlay the way it's done today?

Guido said that it does not matter (1) to the user agent (2) where the API is exposed (HTMLVideoElement or CaptureController), for the purpose of implementing additional limitations and heuristics. It cannot be inferred from his argument that it also does not matter to the (1) Web application where (2) wheel events are forwarded from.

You say “solve the overlay the way it's done today.” The current solution is to forward from an element other than the video element, such as a div. Since you propose to block that solution, it is incumbent on you to show how the use case could be tackled instead. During the interim two days ago, you took an action item to do as much. I am looking forward to that.

@jan-ivar
Copy link
Member

jan-ivar commented Jan 24, 2025

Risks to end-users of wheel-jacking comes ahead of inconveniencing web developers in § 1.1. Put user needs first (Priority of Constituencies) (consent doesn't really change this).

Without libraries, “backwarding” consists merely of invoking myDrawOnClick() from video.onclick instead of from canvas.onclick – a possible change.

No, what I meant with forwarding/backwarding was dealing with an underlying interactive element. But it looks like I wrongly assumed this library was written to overlap another interactive element, but it sounds like it's not designed for that even and is stealing all input for itself, is that right? If so, that's not a very compelling limitation, for the following reason:

From a logical and conceptual perspective, wheel forwarding to a video preview of a live tab-capture is clearly interactive, and clearly a feature of the video element presenting it. This seems obvious down to the details of:

  1. coordinates, which need to align with the offset and scaling of the video presentation.
  2. Resizing or re-layout of the video element will affect the necessary transformation of coordinates to land in the right place in the captured tab.
  3. proposed restrictions, like video.srcObject matching a the right capture, and forwarding only working while it is playing (meaning end-of-forwarding should probably be tied to end of playback, or end-of-track at least, not end of capture which is different due to track cloning).

If the only reason to put it elsewhere is a library not designed to overlap an interactive element, then that is not a good reason.

I'm not opposed to solving the annotation use case, but it's secondary to the primary design of this API in my view.

To illustrate: worst case, the overlay problem seems solvable by something like this:

await videoElement1.forwardGestures(true, {overlay: element}});

If we can move the window of discussion there, then we can probably discuss whether the {overlay: div} argument is truly needed or if the user agent can figure out the overlapping element(s) itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants