Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sonnet 3.5 #11

Open
Cvikli opened this issue Aug 13, 2024 · 10 comments
Open

Sonnet 3.5 #11

Cvikli opened this issue Aug 13, 2024 · 10 comments

Comments

@Cvikli
Copy link

Cvikli commented Aug 13, 2024

Hey,
I just wanted to note that Sonnet 3.5 probably knows the whole human eval... Or it is really great at coding.
So I ran the test using https://github.com/svilupp/PromptingTools.jl

The results:
image
image
image

I know this isn't an issue. But it works well and could be pointed out.

Also just to see comparisons, it isn't exactly giving the reference solutions back. It actually solve literally every problem differently see:

Problem 002:
The reference solution:

function truncate_number(number::Float64)::Float64
    return number % 1.0
end

The Claude version:

function truncate_number(number)
    return number - floor(number)
end

For problem 13:

function greatest_common_divisor(a::Int, b::Int)::Int
    while b != 0
        a, b = b, a % b
    end

    return a
end

Also Claude:

function greatest_common_divisor(a, b)
    while b != 0
        a, b = b, a % b
    end
    return abs(a)
end

Test 76:
Reference:

function is_simple_power(x::Number, n::Number)::Bool
    n == 1 && return x == 1

    x, n = BigFloat(x), BigFloat(n)
    power = 1
    while power < x
        power = power * n
    end

    isapprox(power, x)
end

Claude Sonnet 3.5:

function is_simple_power(x, n)
    if x == 1
        return true
    elseif n == 1 || x < n
        return false
    end

    while x % n == 0
        x = x ÷ n
    end

    return x == 1
end
@Cvikli
Copy link
Author

Cvikli commented Aug 13, 2024

I sort of achieved this by:
Due to the code contains ` I cannot show you accurately....

using Anthropic

const system_prompt = """You are a Julia programmer. You should answer with julia code in codeblocks like 
```julia
code
```
Don't provide examples! 
Don't annotate the function arguments with types. 
After you created the code, you can stop. 
"""
function gen_reply(model, task::HumanEvalTask; chain_of_thought=false, kw...)
    conversation = [Dict("role"=>"system" , "content"=>system_prompt), Dict("role"=>"user" , "content"=>task.prompt)]
    r = ai_ask_safe(conversation; model)
    [r.content]
end

evaluate("claude-3-5-sonnet-20240620")

@findmyway
Copy link
Contributor

Still it's strange to me that claude-3-5-sonnet-20240620 passed all the tests.

This is the result I got with the prompt you provided above claude-3-5-sonnet-20240620.zip

The result is close to the one I provided in README with the default system prompt (I updated the leaderboard earlier this week).

@Cvikli
Copy link
Author

Cvikli commented Aug 17, 2024

I don't understand why does any test fails for you. :(

Have you tried with the system prompt I provided?

@findmyway
Copy link
Contributor

Have you tried with the system prompt I provided?

Yes, I already added the result file in my last comment #11 (comment)

@Cvikli
Copy link
Author

Cvikli commented Aug 18, 2024

I will attach mine tomorrow hopefully.
I cannot get what is different in your solution and mine. Why I got 100% then if we use the same LLM with the same system prompt?

@findmyway
Copy link
Contributor

I will attach mine tomorrow hopefully.

Thanks!

I cannot get what is different in your solution and mine. Why I got 100% then if we use the same LLM with the same system prompt?

This is also what I want to figure out;)

@Cvikli
Copy link
Author

Cvikli commented Aug 19, 2024

claude-3-5-sonnet-20240620.zip
I am attaching it.

What is your guess?

@findmyway
Copy link
Contributor

findmyway commented Aug 20, 2024

OK, I unzip that file and put it under the generations folder. Then I removed the results folder to reevaluate the results.

julia> include("src/evaluation.jl")

julia> evaluate("claude-3-5-sonnet-20240620-cvikli")

And the result is

Test Summary:                                                                       |   Pass  Fail  Error   Total     Time
HumanEval                                                                           | 112176  5366   3206  120748  1m31.7s
  test                                                                              | 112176  5366   3206  120748         
...
    test/evalplus                                                                   | 110794  5302   3178  119274         
...

And the final score is:

model temperature evalplus basic
claude-3-5-sonnet-20240620-cvikli 0.0 0.732 0.799
claude-3-5-sonnet-20240620-tj 0.0 0.707 0.793

So I guess the reason might be that the results were incorrectly generated.

@Cvikli
Copy link
Author

Cvikli commented Aug 20, 2024

Pff. Ok, sorry for the mistake then!

I am not sure what I did wrong then.
could you attach the results yet just for me to see which one was wrong in the tests?

@findmyway
Copy link
Contributor

Here it is results.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants