Streaming AskAsync response #625

cz35iek · 2023-10-18T07:19:50Z

cz35iek
Oct 18, 2023

Hi,

We're using your library in our project. It get us up to speed preety fast in these new AI topics, so thank you for that :).

I was wondering if there is an option to recieve streaming response from our semantic memory service. This could improve user experience, to recieve some first tokens of response instead of waiting for whole thing. OpenAI chatgpt is working this way.

BR,
Dawid

dluc · 2023-10-19T23:49:08Z

dluc
Oct 19, 2023
Maintainer

+1 that's on the list, I agree it's highly needed for responsive UI. We don't have a timeline yet, but it's high on the priority list

0 replies

JohnGalt1717 · 2024-01-03T01:30:29Z

JohnGalt1717
Jan 3, 2024

Here here.

Since virtually all of the LLMs do this by default it would be nice to just add AskStreamingAsync and return IAsyncEnumerable. I think just one method needs to be added to the ITextGenerator and then the included ones updated.

0 replies

roldengarm · 2024-01-25T06:37:01Z

roldengarm
Jan 25, 2024

@dluc We also need this, GPT4 takes 10+ seconds which is not a good user experience, people expect streaming results.

I may be able to assist for the implementation. I've analysed the solution and I think it's a bit more than just adding an ITextGenerator as @JohnGalt1717 suggested, but I may be wrong.
My proposal would be the following, of course open to your feedback :)

SearchClient

Add another method AskStreamingAsync based on AskAsync here
The results to stream are currently generated here, so that could be replaced with e.g. a yield return.
To be decided: Right now, it returns a POCO object, which would be problematic when streaming via HTTP, which would work best with just stream of text. How would we return e.g. RelevantSources? Or would it just return the generated answer?

For the API endpoint

End point to be added here
This should call AskStreamingAsync in the SearchClient
To be decided how to stream the results. Probably easiest is with Server Side Events; SignalR/Websockets seems overkill.

For the WebClient

Add another method AskStreamingAsync here
This should return an AsyncEnumerable

Lots of moving parts, but would be awesome to have that feature!

0 replies

JonathanVelkeneers · 2024-03-29T11:49:51Z

JonathanVelkeneers
Mar 29, 2024

@dluc Any update on this feature?
I'm willing to give these changes a try if your don't have to time for it right now as I really need this feature to make my application responsive enough.

It seems that SearchClient.cs:AskAsync() already consumes an IAsyncEnumerable<string> but buffers this text to include citations etc.

As @roldengarm mentioned above it might not be the best solution to send the relevant sources with each part of the stream. Maybe a good solution would be to just return an answer (IAsyncEnumerable<string>) seeing as the relevant sources can be fetched beforehand by the calling code using SearchClient.cs:SearchAsync().

0 replies

dluc · 2024-05-01T21:35:08Z

dluc
May 1, 2024
Maintainer

@JonathanVelkeneers I should be able to look at this next week - been busy with the new file download feature (just merged)

2 replies

roldengarm Jun 5, 2024

Hi @dluc I see you've moved this to a discussion. Just wondering if there's any update on this please?

dluc Jun 6, 2024
Maintainer

I did an initial review but I feel like it needs more testing. I'm also trying to decide whether to setup a new endpoint or reuse the existing one with a param. Then there's the web client to consider. A few things I need to review :-)

chaelli · 2024-06-06T18:54:29Z

chaelli
Jun 6, 2024

To be honest - I'm not sure it makes sense to keep the idea of returning a object at all. We cannot have a value for NoResult for example - as we just don't know if there will be one. So I'm pretty much with @JonathanVelkeneers on that - if you need the sources (and cannot tell the LLM to directly include it in the response), you might need to get them separately.
with that the whole solution would get simpler. We might just need to make it clear and call it AsyTextStreamingAsync or so. I'll look into this a bit further and hope we can just have that in parallel for the different use cases

5 replies

dluc Jun 6, 2024
Maintainer

OpenAI streams data using an object too, allowing to combine multiple pieces of information. I the endpoint streams only the response text, then most clients will send a second request to get the sources, which would increase costs and latency.

chaelli Jun 6, 2024

this is true - but aren't they just streaming what the LLM creates?

and we could probably add {"text": before the streaming begins and after we finish, we close the string and return the relevant sources in json format inside the same json. But if this is great to be parsed dynamically?

the way I could imagine it is pretty simple => https://github.com/microsoft/kernel-memory/pull/652/files
you can try this out with the following JS in the browser console:

var response = await fetch("http://localhost:9001/askstreaming", {
    "headers": {
        "accept": "application/json",
        "authorization": "...",
        "cache-control": "no-cache",
        "content-type": "application/json",
    },
    "body": "{\n  \"index\": \"...\",\n  \"question\": \"Your question here\"\n}",
    "method": "POST",
    "mode": "cors",
    "credentials": "include"
});
await readStream(response.body.getReader());

async function readStream(reader) {
    const decoder = new TextDecoder();
    let buffer = '';

    let responseText = '';

    try {
        while (true) {
            const {
                done,
                value
            } = await reader.read();

            if (buffer.length === 0) {
                buffer += decoder.decode(value, {
                    stream: true
                });
                if (buffer.length > 0) {
                    buffer = buffer.substring(1);
                }
            } else {
                buffer += decoder.decode(value, {
                    stream: true
                });
            }


            try {
                // Try to parse the buffer as JSON
                if (buffer.endsWith(']')) {
                    buffer = buffer.slice(0, -1);
                }
                let messages = JSON.parse(`[${buffer}]`);
                console.log("Additional text:", messages.join()); // Debugging log

                buffer = ''; // Clear the buffer after processing
            } catch (e) {
                // Parsing failed, likely due to incomplete data, so continue reading
                console.log("Incomplete data, waiting for more... data was: ", buffer);
            }
            if (done) {
                break;
            }
        }
    } catch (error) {
        console.error('Stream reading error:', error);
    }
}

it's a different usecase - for sure - but I think it's very relevant to have this case solved.

dluc Jun 6, 2024
Maintainer

we could probably add {"text": before the streaming begins and after we finish, we c

streaming JSON is a bit different, each token sent to the browser needs to be valid json, otherwise clients will wait until the end of the JSON, defeating the purpose of streaming.

Here's an example, see also SSE docs on the web:

data: { "sources": [...], "content": "Lorem "}

data: { "content_update": "ipsum "}

data: { "content_update": "dolor sit "}

data: { "content_update": "amet", total_tokens: 50 }

chaelli Jun 7, 2024

definitely, when using SSE - and yes, that makes the consumer more straight forward. I didn't use sse - just chunked encoding. This makes it very simple - but it does not support (as far as I understand) to send complete jsons every time.
As far as I have seen #400 also does not use SSE - is this the main reason why this cannot be merged yet? if so - can we just use this and fix the possible null reference issues?

public static void AddAskStreamEndpoint(
        this IEndpointRouteBuilder builder, string apiPrefix = "/", IEndpointFilter? authFilter = null)
    {
        RouteGroupBuilder group = builder.MapGroup(apiPrefix);

        // Ask streaming endpoint
        var route = group.MapPost(Constants.HttpAskStreamEndpoint, async Task (
                HttpContext context,
                MemoryQuery query,
                IKernelMemory service,
                ILogger<KernelMemoryWebAPI> log,
                CancellationToken cancellationToken) =>
            {
                log.LogTrace("New search request, index '{0}', minRelevance {1}", query.Index, query.MinRelevance);
                context.Response.Headers.Add("Content-Type", "text/event-stream");
                var response = context.Response;

                await foreach (var ma in service.AskStreamingAsync(
                        question: query.Question,
                        index: query.Index,
                        filters: query.Filters,
                        minRelevance: query.MinRelevance,
                        cancellationToken: cancellationToken))
                {
                    await response.WriteAsync($"data: {JsonSerializer.Serialize(new { message = ma })}\n\n", cancellationToken).ConfigureAwait(false);
                    await response.Body.FlushAsync(cancellationToken).ConfigureAwait(false);
                }
            })
            .Produces<IAsyncEnumerable<MemoryAnswer>>(StatusCodes.Status200OK)
            .Produces<ProblemDetails>(StatusCodes.Status401Unauthorized)
            .Produces<ProblemDetails>(StatusCodes.Status403Forbidden);

        if (authFilter != null) { route.AddEndpointFilter(authFilter); }
    }

chaelli Jul 2, 2024

just found this: https://github.com/microsoft/ai-chat-protocol/tree/main/spec#streaming-response
no SSE - but the "old" version I tried in the beginning... I'll try that again - might even be much easier in practice (SSE does not support auth headers, SSE is also not supported by certain APIM versions), Will be interesting to see how it works "out there"

xuzeyu91 · 2024-07-09T03:56:07Z

xuzeyu91
Jul 9, 2024

I now use kernel by first searching and then splicing prompt InvokeStreamingAsync. It would be better if the Ask could support streaming

0 replies

dluc · 2024-12-01T11:52:07Z

dluc
Dec 1, 2024
Maintainer

hi all, FYI response streaming is now available in main. For details see #726 which includes also a couple of examples (see 001 and 002).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming AskAsync response #625

{{title}}

Replies: 8 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Streaming AskAsync response #625

Replies: 8 comments · 7 replies

dluc Oct 19, 2023 Maintainer

dluc May 1, 2024 Maintainer

dluc Jun 6, 2024 Maintainer

dluc Jun 6, 2024 Maintainer

dluc Jun 6, 2024 Maintainer

dluc Dec 1, 2024 Maintainer

Replies: 8 comments 7 replies

dluc
Oct 19, 2023
Maintainer

dluc
May 1, 2024
Maintainer

dluc Jun 6, 2024
Maintainer

dluc Jun 6, 2024
Maintainer

dluc Jun 6, 2024
Maintainer

dluc
Dec 1, 2024
Maintainer