Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Specification and Documentation to support Audio Modality. #93

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

evalstate
Copy link

This change supports discussion #88 and includes Audio Modality in the specification.

Motivation and Context

This would enable integration with models that support Audio as input/output in context such as gpt-4o-audio-preview.
https://platform.openai.com/docs/guides/audio

How Has This Been Tested?

This has been tested using the Inspector tool, with local type extensions:

// Define the AudioContent schema
export const AudioContentSchema = z.object({
  type: z.literal("audio"),
  data: z.string().base64(),
  mimeType: z.string(),
}).passthrough();

// Extend the CallToolResult schema to include audio content
export const ExtendedCallToolResultSchema = ResultSchema.extend({
  content: z.array(
    z.discriminatedUnion("type", [
      TextContentSchema,
      ImageContentSchema,
      AudioContentSchema,
      EmbeddedResourceSchema,
    ])
  ),
  isError: z.boolean().default(false).optional(),
});

// Export the types
export type AudioContent = z.infer<typeof AudioContentSchema>;
export type ExtendedCallToolResult = z.infer<typeof ExtendedCallToolResultSchema>;

The Inspector application was updated to render the Audio player for this type:

              {item.type === "image" && (
                <img
                  src={`data:${item.mimeType};base64,${item.data}`}
                  ...
              {item.type === "audio" && (
                  <audio
                  controls
                  src={`data:${item.mimeType};base64,${item.data}`}

image

The Server produced this JSON:

      const audioBase64 = await generateSpeech(text);
      return {
        content: [{
          type: "audio",
            mimeType: "audio/wav",
            data: audioBase64
        }]
      };
    }

I was unable to find the process to build the TypeScript SDK from the Schema, hence the approach of extending types.

Ultimately I would like to integrate this in to my Chat application supporting gpt-4o (and potential new models) with Audio support.

Breaking Changes

No. However:

  1. The Client Reference Implementation (Claude Desktop) does not support audio.
  2. The SDKs will require updating to include the extended type.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

I believe that adding the "Audio" type is appropriate as it seems congruent the way that text/image modalities are typically handled.

modalities: ["text", "audio"],

jspahrsummers
jspahrsummers previously approved these changes Dec 2, 2024
Copy link
Member

@jspahrsummers jspahrsummers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This makes sense to me, and seems like a clean extension to the protocol.

@dsp-ant Any thoughts?

docs/specification/client/sampling.md Outdated Show resolved Hide resolved
@jspahrsummers
Copy link
Member

jspahrsummers commented Dec 2, 2024

We probably want to rev the protocol version, since this would be a new type that receivers may not be expecting.

@dsp-ant
Copy link
Member

dsp-ant commented Dec 2, 2024

This seems reasonable to me. I am not accepting mostly so we don't accidentally merge this. We need to first rev the protocol and add ways to handle revisions in the current protocol.

@evalstate
Copy link
Author

This seems reasonable to me. I am not accepting mostly so we don't accidentally merge this. We need to first rev the protocol and add ways to handle revisions in the current protocol.

@dsp-ant - I couldn't find the tool/script to generate new versions of the SDKs from the spec for testing - are they available?

@jspahrsummers
Copy link
Member

We've now created a separate place for the draft version of the spec. Can you please move this there?

I couldn't find the tool/script to generate new versions of the SDKs from the spec for testing - are they available?

We use Claude to update the SDKs in response to spec changes—e.g., by giving it the current SDK interfaces and a diff of what changed in the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants