-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sonnet 3.5 #11
Comments
I sort of achieved this by: using Anthropic
const system_prompt = """You are a Julia programmer. You should answer with julia code in codeblocks like
```julia
code
```
Don't provide examples!
Don't annotate the function arguments with types.
After you created the code, you can stop.
"""
function gen_reply(model, task::HumanEvalTask; chain_of_thought=false, kw...)
conversation = [Dict("role"=>"system" , "content"=>system_prompt), Dict("role"=>"user" , "content"=>task.prompt)]
r = ai_ask_safe(conversation; model)
[r.content]
end
evaluate("claude-3-5-sonnet-20240620") |
Still it's strange to me that This is the result I got with the prompt you provided above claude-3-5-sonnet-20240620.zip The result is close to the one I provided in README with the default system prompt (I updated the leaderboard earlier this week). |
I don't understand why does any test fails for you. :( Have you tried with the system prompt I provided? |
Yes, I already added the result file in my last comment #11 (comment) |
I will attach mine tomorrow hopefully. |
Thanks!
This is also what I want to figure out;) |
claude-3-5-sonnet-20240620.zip What is your guess? |
OK, I unzip that file and put it under the julia> include("src/evaluation.jl")
julia> evaluate("claude-3-5-sonnet-20240620-cvikli") And the result is
And the final score is:
So I guess the reason might be that the |
Pff. Ok, sorry for the mistake then! I am not sure what I did wrong then. |
Here it is results.zip |
Hey,
I just wanted to note that Sonnet 3.5 probably knows the whole human eval... Or it is really great at coding.
So I ran the test using https://github.com/svilupp/PromptingTools.jl
The results:
I know this isn't an issue. But it works well and could be pointed out.
Also just to see comparisons, it isn't exactly giving the reference solutions back. It actually solve literally every problem differently see:
Problem 002:
The reference solution:
The Claude version:
For problem 13:
Also Claude:
Test 76:
Reference:
Claude Sonnet 3.5:
The text was updated successfully, but these errors were encountered: