Sonnet 3.5 #11

Cvikli · 2024-08-13T11:58:03Z

Hey,
I just wanted to note that Sonnet 3.5 probably knows the whole human eval... Or it is really great at coding.
So I ran the test using https://github.com/svilupp/PromptingTools.jl

The results:

I know this isn't an issue. But it works well and could be pointed out.

Also just to see comparisons, it isn't exactly giving the reference solutions back. It actually solve literally every problem differently see:

Problem 002:
The reference solution:

function truncate_number(number::Float64)::Float64
    return number % 1.0
end

The Claude version:

function truncate_number(number)
    return number - floor(number)
end

For problem 13:

function greatest_common_divisor(a::Int, b::Int)::Int
    while b != 0
        a, b = b, a % b
    end

    return a
end

Also Claude:

function greatest_common_divisor(a, b)
    while b != 0
        a, b = b, a % b
    end
    return abs(a)
end

Test 76:
Reference:

function is_simple_power(x::Number, n::Number)::Bool
    n == 1 && return x == 1

    x, n = BigFloat(x), BigFloat(n)
    power = 1
    while power < x
        power = power * n
    end

    isapprox(power, x)
end

Claude Sonnet 3.5:

function is_simple_power(x, n)
    if x == 1
        return true
    elseif n == 1 || x < n
        return false
    end

    while x % n == 0
        x = x ÷ n
    end

    return x == 1
end

The text was updated successfully, but these errors were encountered:

Cvikli · 2024-08-13T12:57:13Z

I sort of achieved this by:
~~Due to the code contains ` I cannot show you accurately....~~

using Anthropic

const system_prompt = """You are a Julia programmer. You should answer with julia code in codeblocks like 
```julia
code
```
Don't provide examples! 
Don't annotate the function arguments with types. 
After you created the code, you can stop. 
"""
function gen_reply(model, task::HumanEvalTask; chain_of_thought=false, kw...)
    conversation = [Dict("role"=>"system" , "content"=>system_prompt), Dict("role"=>"user" , "content"=>task.prompt)]
    r = ai_ask_safe(conversation; model)
    [r.content]
end

evaluate("claude-3-5-sonnet-20240620")

findmyway · 2024-08-17T12:51:19Z

Still it's strange to me that claude-3-5-sonnet-20240620 passed all the tests.

This is the result I got with the prompt you provided above claude-3-5-sonnet-20240620.zip

The result is close to the one I provided in README with the default system prompt (I updated the leaderboard earlier this week).

Cvikli · 2024-08-17T15:13:46Z

I don't understand why does any test fails for you. :(

Have you tried with the system prompt I provided?

findmyway · 2024-08-18T03:44:17Z

Have you tried with the system prompt I provided?

Yes, I already added the result file in my last comment #11 (comment)

Cvikli · 2024-08-18T20:03:20Z

I will attach mine tomorrow hopefully.
I cannot get what is different in your solution and mine. Why I got 100% then if we use the same LLM with the same system prompt?

findmyway · 2024-08-19T00:38:01Z

I will attach mine tomorrow hopefully.

Thanks!

I cannot get what is different in your solution and mine. Why I got 100% then if we use the same LLM with the same system prompt?

This is also what I want to figure out;)

Cvikli · 2024-08-19T10:53:14Z

claude-3-5-sonnet-20240620.zip
I am attaching it.

What is your guess?

findmyway · 2024-08-20T03:42:47Z

OK, I unzip that file and put it under the generations folder. Then I removed the results folder to reevaluate the results.

julia> include("src/evaluation.jl")

julia> evaluate("claude-3-5-sonnet-20240620-cvikli")

And the result is

Test Summary:                                                                       |   Pass  Fail  Error   Total     Time
HumanEval                                                                           | 112176  5366   3206  120748  1m31.7s
  test                                                                              | 112176  5366   3206  120748         
...
    test/evalplus                                                                   | 110794  5302   3178  119274         
...

And the final score is:

model	temperature	evalplus	basic
claude-3-5-sonnet-20240620-cvikli	0.0	0.732	0.799
claude-3-5-sonnet-20240620-tj	0.0	0.707	0.793

So I guess the reason might be that the results were incorrectly generated.

Cvikli · 2024-08-20T07:18:48Z

Pff. Ok, sorry for the mistake then!

I am not sure what I did wrong then.
could you attach the results yet just for me to see which one was wrong in the tests?

findmyway · 2024-08-20T11:34:20Z

Here it is results.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sonnet 3.5 #11

Sonnet 3.5 #11

Cvikli commented Aug 13, 2024 •

edited

Loading

Cvikli commented Aug 13, 2024 •

edited by findmyway

Loading

findmyway commented Aug 17, 2024

Cvikli commented Aug 17, 2024

findmyway commented Aug 18, 2024

Cvikli commented Aug 18, 2024

findmyway commented Aug 19, 2024

Cvikli commented Aug 19, 2024

findmyway commented Aug 20, 2024 •

edited

Loading

Cvikli commented Aug 20, 2024

findmyway commented Aug 20, 2024

Sonnet 3.5 #11

Sonnet 3.5 #11

Comments

Cvikli commented Aug 13, 2024 • edited Loading

Cvikli commented Aug 13, 2024 • edited by findmyway Loading

findmyway commented Aug 17, 2024

Cvikli commented Aug 17, 2024

findmyway commented Aug 18, 2024

Cvikli commented Aug 18, 2024

findmyway commented Aug 19, 2024

Cvikli commented Aug 19, 2024

findmyway commented Aug 20, 2024 • edited Loading

Cvikli commented Aug 20, 2024

findmyway commented Aug 20, 2024

Cvikli commented Aug 13, 2024 •

edited

Loading

Cvikli commented Aug 13, 2024 •

edited by findmyway

Loading

findmyway commented Aug 20, 2024 •

edited

Loading