Updated Specification and Documentation to support Audio Modality. #93

evalstate · 2024-12-01T12:09:48Z

This change supports discussion #88 and includes Audio Modality in the specification.

Motivation and Context

This would enable integration with models that support Audio as input/output in context such as gpt-4o-audio-preview.
https://platform.openai.com/docs/guides/audio

How Has This Been Tested?

This has been tested using the Inspector tool, with local type extensions:

// Define the AudioContent schema
export const AudioContentSchema = z.object({
  type: z.literal("audio"),
  data: z.string().base64(),
  mimeType: z.string(),
}).passthrough();

// Extend the CallToolResult schema to include audio content
export const ExtendedCallToolResultSchema = ResultSchema.extend({
  content: z.array(
    z.discriminatedUnion("type", [
      TextContentSchema,
      ImageContentSchema,
      AudioContentSchema,
      EmbeddedResourceSchema,
    ])
  ),
  isError: z.boolean().default(false).optional(),
});

// Export the types
export type AudioContent = z.infer<typeof AudioContentSchema>;
export type ExtendedCallToolResult = z.infer<typeof ExtendedCallToolResultSchema>;

The Inspector application was updated to render the Audio player for this type:

              {item.type === "image" && (
                <img
                  src={`data:${item.mimeType};base64,${item.data}`}
                  ...
              {item.type === "audio" && (
                  <audio
                  controls
                  src={`data:${item.mimeType};base64,${item.data}`}

The Server produced this JSON:

      const audioBase64 = await generateSpeech(text);
      return {
        content: [{
          type: "audio",
            mimeType: "audio/wav",
            data: audioBase64
        }]
      };
    }

I was unable to find the process to build the TypeScript SDK from the Schema, hence the approach of extending types.

Ultimately I would like to integrate this in to my Chat application supporting gpt-4o (and potential new models) with Audio support.

Breaking Changes

No. However:

The Client Reference Implementation (Claude Desktop) does not support audio.
The SDKs will require updating to include the extended type.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

I believe that adding the "Audio" type is appropriate as it seems congruent the way that text/image modalities are typically handled.

modalities: ["text", "audio"],

jspahrsummers

Thank you! This makes sense to me, and seems like a clean extension to the protocol.

@dsp-ant Any thoughts?

docs/specification/client/sampling.md

jspahrsummers · 2024-12-02T13:24:08Z

We probably want to rev the protocol version, since this would be a new type that receivers may not be expecting.

Co-authored-by: Justin Spahr-Summers <[email protected]>

dsp-ant · 2024-12-02T16:12:33Z

This seems reasonable to me. I am not accepting mostly so we don't accidentally merge this. We need to first rev the protocol and add ways to handle revisions in the current protocol.

evalstate · 2024-12-02T17:55:36Z

This seems reasonable to me. I am not accepting mostly so we don't accidentally merge this. We need to first rev the protocol and add ways to handle revisions in the current protocol.

@dsp-ant - I couldn't find the tool/script to generate new versions of the SDKs from the spec for testing - are they available?

jspahrsummers · 2025-01-10T10:44:40Z

We've now created a separate place for the draft version of the spec. Can you please move this there?

I couldn't find the tool/script to generate new versions of the SDKs from the spec for testing - are they available?

We use Claude to update the SDKs in response to spec changes—e.g., by giving it the current SDK interfaces and a diff of what changed in the schema.

Updated Specification and Docs to support Audio Modality.

424f280

jspahrsummers previously approved these changes Dec 2, 2024

View reviewed changes

docs/specification/client/sampling.md Outdated Show resolved Hide resolved

jspahrsummers added this to the After: 2024-11-05 milestone Dec 2, 2024

evalstate dismissed jspahrsummers’s stale review via 02f6f84 December 2, 2024 13:34

evalstate and others added 2 commits December 2, 2024 13:34

Update docs/specification/client/sampling.md

02f6f84

Co-authored-by: Justin Spahr-Summers <[email protected]>

Merge branch 'main' into feature/audio-modality

a826742

Merge branch 'main' into feature/audio-modality

41c60ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Specification and Documentation to support Audio Modality. #93

Updated Specification and Documentation to support Audio Modality. #93

evalstate commented Dec 1, 2024

jspahrsummers left a comment

jspahrsummers commented Dec 2, 2024 •

edited

Loading

dsp-ant commented Dec 2, 2024

evalstate commented Dec 2, 2024

jspahrsummers commented Jan 10, 2025

Updated Specification and Documentation to support Audio Modality. #93

Are you sure you want to change the base?

Updated Specification and Documentation to support Audio Modality. #93

Conversation

evalstate commented Dec 1, 2024

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

jspahrsummers left a comment

Choose a reason for hiding this comment

jspahrsummers commented Dec 2, 2024 • edited Loading

dsp-ant commented Dec 2, 2024

evalstate commented Dec 2, 2024

jspahrsummers commented Jan 10, 2025

jspahrsummers commented Dec 2, 2024 •

edited

Loading