-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is gesture forwarding tied to capture controller or to MediaStreamTrack or to DOM objects? #45
Comments
I think the right mental model is that we are controlling the captured surface, and CaptureController is the proxy for that concept (in all APIs we introduce), whereas MediaStreamTrack is just a handle to get frames (similarly). Those frames might not even be coming directly from the captured surface; they might be going through some transformation first, such as getting annotated, cropped or adjusted for better contrast. Is there genuine Web developer interest in displaying the video element somewhere other than in the document that first called |
My mental model is this is about enabling user controls, not app controls. This suggests it might be logical to put the API on the DOM objects the user interacts with. This is why I find @youennf's API in #49 appealing. E.g.: videoElement.enableGestureForwarding = true;
div.enableGestureForwarding = true; // if on top
canvas.enableGestureForwarding = true; // recently drawn to with video MediaStreamTracks can be cloned and transferred to workers, where gesture forwarding doesn't make sense, so I don't think that's the right place. OTOH, CaptureController.forwardWheel(x) only supports one x, and x = null is how to stop fowarding (a bit surprising that). It may be uncommon to have two preview elements, but if a website wants it, as a user I'd expect to be able to scroll both. This to me suggests an element API. |
The reasons to use an async API have been previously presented here, and brought up in multiple other threads (example). Putting the async question aside for a moment - assume for the sake of argument that we reshape this proposal to be
I also think it's poor choice of API to expose on HTMLElement or anything similarly high-level anything so capture-specific. I don't think this is good API design.
We have discussed the possibility of
Why is that surprising? |
I agree promise can be discussed separately.
No, the idea is the developer declares where they expect forwarding, and the user agent makes it happen where it can happen:
This would be a declarative API — putting the user agent in charge of where forwarding happens instead of the app — so reusing the same API name seemed simple, but if it's confusing I'm open to other ideas. Other ways to solve pass-through might be with CSS, something similar to |
Can you point to a precedent you didn't invent? |
Also, crop rectangles are inherently mutually exclusive with each other and the uncropped state, so that seems fine. In contrast, an infinite number of video elements can play back a single capture, which suggests a different model. |
Assume: <div id="div">
<video id="vid1"></video>
<video id="vid2"></video>
</div>
<script>
const controller1 = new CaptureController;
const controller2 = new CaptureController;
vid1.srcObject =
await navigator.mediaDevices.getDisplayMedia({controller: controller1});
vid2.srcObject =
await navigator.mediaDevices.getDisplayMedia({controller: controller2});
</script> Contrast: // Explicitly affects mutliple capture sessions.
controller1.forwardGestures(div);
controller2.forwardGestures(div); With: // Implicitly affects an unknown number of capture sessions.
div.enableGestureForwarding = true; The former is much clearer and less error-prone. Note that this is a silly case, though, as you'd not normally forward from one surface to multiple surfaces. And that's partially my point - which code snippet makes it clearer that something unreasonable is taking place? Which unreasonable code is more patently a developer mistake? The API which more readily exposes the developer's error is better. I also have to repeat my own message:
We want APIs that are clear.
The precedent cited was of an API standardized in this very Working Group, so I expect it to be convincing enough. But as per the request, I can point to other precedents, also standardized by this group, but not proposed by me - replaceTrack(). |
P.S: Possibly captureController.forwardGesturesFrom(element) would clarify things even further. |
Sorry no, that's an incorrect understanding of what I'm proposing. To turn on forwarding would require vid1.enableGestureForwarding = vid2.enableGestureForwarding = true; With the div API, I'm trying to be generous and address the emoji/overlay use case. The idea would be a way to let the app mark which overlapping elements (in a div) if any should not block scroll inputs to a video element.
I disagree obviously. There's plenty of precedent: |
In contrast, putting the API on the controller lets apps express things we don't want to support, like: controller1.forwardGestures(vid2);
controller2.forwardGestures(vid1); |
It is only confusing because a suboptimal API is proposed.
All of these precedents are on
It's a small price to pay in order to allow forwarding from the application's choice of overlaid surface. It's a common cost for flexibility. |
CaptureController.forwardWheelFrom(element) is confusing and limiting, because it's not possible to forward gestures from two different elements. E.g. the following doesn't do that (second call overrides the first): captureController.forwardWheelFrom(videoElement1);
captureController.forwardWheelFrom(videoElement2); A pivot to putting the API on the element solves this:
I'm not proposing putting anything on I propose putting the same API on both Assuming I compromise on permission and promise, this might look like: await videoElement1.forwardGestures(true);
await canvas2.forwardGestures(true); This shape allows UAs to work on click-jacking mitigations as needed, and communicate needed restrictions to web developers, such as: UAs MAY impose requirements for forwarding to work, e.g.
The exact restrictions can remain vague to allow UAs to experiment. It's in their interest not to break legitimate usage. I see the emoji overlay as a separate (CSS) problem. Critically, I see no reason to expand the hit zones for wheel input: |
To simplify, please discount the canvas part of the proposal above for now (I was assuming some video conferencing sites might use canvas instead of a video element for MST presentation, but my evidence for this is poor, and we can always add this back later). — This should avoid confusion in use cases where a canvas is being used as an overlay. I also got feedback that use cases might need this overlay to be clickable, so this might require some new CSS feature like maybe |
Assume a sample Web app that has a mostly-transparent canvas element overlaid on top of a video element. Scroll events should lead to the captured surface being scrolled, and click events should manipulate something on the canvas (such as annotations). There exists at least one such application - Google Meet - which proves that this is an interesting pattern that Web developers are actually likely to employ. So this is not a purely academic discussion. Let's examine whether limiting wheel-forwarding to video elements is at odds with this use case. Theoretically speaking, Web applications can make use of
Possible? Seems like it. (Modulo limitations we might hear from Web developers.) We should weigh the hardship this places on Web developers against the security benefits conferred by limiting the API to video elements. Those benefits have not yet been articulated.
Agreed on this point - if we discover Web developers can still use the API if limited to video elements, and if we make this pivot, then we should still leave it to UAs to experiment with heuristics, and revisit specifying additional limitations at a later time.
(The following is feedback on specific issues with the above proposal, and should not be misunderstood as endorsement of general thrust of that proposal.) There is nothing in this shape to tie the video element to the captured surface, whereas the current API shape does. Recall the present shape of the API: partial interface CaptureController {
Promise<undefined> forwardWheel(HTMLElement element);
}; It'd be better to just s/
No Web developer has articulated this requirement, and I can't imagine a realistic use case where it would be necessary. If we were ever to determine it as necessary, then |
I agree with this use case. Thanks for the workaround idea. It's a clever reversal of typical manual forwarding one might expect a partially-interactive overlay to have to do using synthetic events: forwarding back to the overlay (backwarding?) to give the isTrusted property to the video element instead of the overlay. The CSS folks I spoke with pushed back on a general If I were to rank these ideas I'd probably put the backwarding idea first. But the simplest way might be to just declare that this API gets first dibs on input even when underneath an overlay. How does Chrome's current API work here? Are trusted events emitted/can JS preventDefault() the behavior? Given the security principles we want for this API, it might actually be advantageous to bypass JS here.
That's a feature. The link is already implicit through
Great, but that goes both ways: but if we can do |
A developer could have multiple <video> elements representing the same capture but with different video effects applied, or styled or cropped differently for different parts of the UI. They want the user to be able to scroll from whichever preview the user interacts with. Having the API on the video element is:
|
After some consideration, "first dibs" might be problematic and make it harder to extend support to #54 in the future. My front-runner API is therefore |
I have now had time to speak to Web developers and get their feedback, and they are extremely opposed to limiting to video elements. (My apologies if I gave the wrong impression during the editors’ meeting, btw. The meeting with Web developers occurred thereafter.) The Web developers I am in contact with implemented annotations using a library. Without libraries, “backwarding” consists merely of invoking But with libraries, the application can’t rewire the handlers-to-functions mapping, because that mapping is unknown, abstracted away by the library. One could potentially redispatch clones of events, but use of Not to mention how non-ergonomic and error-prone it is to redispatch event-clones, and how it loses the Let’s re-examine, again, what this limitation to video elements is even for. It was motivated by the expectation that the user agent could then ensure a faithful representation of the controlled surface to the user, suppressing gesture-forwarding when the representation is faithless. But this is demonstrably impossible with annotations, because by design they imply overlaying the video with arbitrary pixels. No matter how annotations are implemented, the user sees a faithless representation of the captured surface, by definition.
First, I am not aware of any Web applications that employ this theoretical pattern of (i) multiple preview tiles with (ii) effects on top. This hypothetical example does not appear realistic to me. Second, the hypothetical use-case and the security rationale are in contradiction. Applying effects on top of videos renders the representation inherently faithless. To summarize:
|
This issue was discussed in WebRTC January 2025 meeting – 21 January 2025 (Consideration of limitation to HTMLVideoElement) |
Who are these web developers and what library do they use? With more info we might be able help. The annotation use case starts out transparent and is therefore not faithless. So I'm not swayed by appeal to futility. The current API seems overly permissive. I'm also concerned we're committing a category error by baking the overlay problem into the initial design of this API, when it seems a fundamentally separate problem. If we ignore the overlay problem for a second, it seems obvious this API belongs on video element, which probably means that's the right place. I haven't yet seen this library, but it seems reasonable that any interactive overlay that needs to selectively forward events to what's underneath needs to create synthetic events and some awareness of what's underneath. "Backwarding" might present additional challenges to that, but doesn't seem that different either. The overlay issue doesn't seem inherently unique to this api. Lastly, I think @guidou made the point in the meeting that the same limitations could be applied regardless of where the API lives. I think that works in the other direction too: I didn't get a clear answer how Chrome intercept input today, but if the limitations (e.g. hit zone no larger than the video element underneath the given div) are implementable regardless, why can't we move the API to the video element and solve the overlay the way it's done today? |
During the interim meeting, I have pointed out Google Meet as one such example, and there are more. The exact library is irrelevant; the important point is that the libraries are not modifiable by the app developer. Enough information was provided to discuss this from first principles.
If I understand you correctly, you mean that the gradual transition from zero occlusions to increasingly many occlusions leaves room for the UA to heuristically detect abuse. Did I understand you correctly? Why is restricting wheel-forwarding to occur from the video element required for that? Note the priority of constituencies implies that if you can employ the same limitations either way, you should not put restrictions on Web developers.
This is both irrelevant and incorrect, given (1) that libraries are not necessarily modifiable by the Web developer, and that (2) existing libraries might not allow selective forwarding of events. You assume that existing libraries routinely support selective forwarding of events. But why would this be the case? Why would existing libraries be designed to accommodate a future hypothetical limitation of some future API? During the interim meeting, you took an action item to demonstrate how the use case could be tackled without support from the library. I look forward to your demonstration.
If it is not unique, then you will have precedents to draw on in your demonstration of a solution, as per your action item from the interim meeting.
Guido said that it does not matter (1) to the user agent (2) where the API is exposed (HTMLVideoElement or CaptureController), for the purpose of implementing additional limitations and heuristics. It cannot be inferred from his argument that it also does not matter to the (1) Web application where (2) wheel events are forwarded from. You say “solve the overlay the way it's done today.” The current solution is to forward from an element other than the video element, such as a div. Since you propose to block that solution, it is incumbent on you to show how the use case could be tackled instead. During the interim two days ago, you took an action item to do as much. I am looking forward to that. |
Risks to end-users of wheel-jacking comes ahead of inconveniencing web developers in § 1.1. Put user needs first (Priority of Constituencies) (consent doesn't really change this).
No, what I meant with forwarding/backwarding was dealing with an underlying interactive element. But it looks like I wrongly assumed this library was written to overlap another interactive element, but it sounds like it's not designed for that even and is stealing all input for itself, is that right? If so, that's not a very compelling limitation, for the following reason: From a logical and conceptual perspective, wheel forwarding to a video preview of a live tab-capture is clearly interactive, and clearly a feature of the video element presenting it. This seems obvious down to the details of:
If the only reason to put it elsewhere is a library not designed to overlap an interactive element, then that is not a good reason. I'm not opposed to solving the annotation use case, but it's secondary to the primary design of this API in my view. To illustrate: worst case, the overlay problem seems solvable by something like this: await videoElement1.forwardGestures(true, {overlay: element}}); If we can move the window of discussion there, then we can probably discuss whether the |
CaptureController lives where capture was initiated.
MediaStreamTrack on the other hand can be transferred and lives where it is being rendered.
This makes it potentially possible for CaptureController and the getDisplayMedia track to live in different contexts.
Given gesture forwarding is tied to the track's preview, it seems it is more tied to MediaStreamTrack/HTMLVideoElement than CaptureController.
The question is then whether API should be tied to CaptureController or to HTML elements/MediaStreamTrack.
I would then to favour the latter.
The text was updated successfully, but these errors were encountered: