Enter Claude 2

As at time of writing (14th July) the large language model (LLM) space is still dominated by chatGPT. Both in terms of conversation in the press, and use cases.

Anthropic AI has just released what could be considered the first real contender versus chatGPT 4.

Standing up against each other, it is probably important to remember that there is a usage cap of 25 interactions per 3 hours on chatGPT 4, which is a serious limitation in some of its use.

I thought i would be interesting to use one of the student prompts that I have been recently teaching, its of moderate complexity and is a day to day productivity activity type prompt. Then ask chatGPT4 + wolfram, to score the responses versus the prompt, using a second scoring prompt.

Before we dive deeper into that, there probably is a flaw in using chatGPT to score, it seems to be a much more generous marker than I was, as a human baseline. Perhaps I am curmudgeonly !

Despite insisting on a criteria that given a certain condition zero should be scored, this wasn’t achieved reilably.

The prompt concepts for the test

The basic idea of the prompt was to have a well formatted output which solved a productivity problem for a household, in this case build out a recipe list for 5 days, and a combined shopping list.

It provided some dietary constraints, and some format output requirements, the level of cooking complexity.

The output of the prompt was then fed in (copy pasta!!) to a chatGPT4 + wolfram plug in session, the same score and prompt session was used for each test. This prompt asked chatGPT to review the output of the prompt (previous execution) vs the prompts requirements. Rated against 6 Criteria.

Adherance, Completeness, Recipe Quality, Clarity. Summary of ingredients and accuracy of the response.

I gave examples into each criteria to help lead and scale the scoring. This worked to some extent, but not amazingly well. A topic for future work?

Chat gpt 3.5 turbo

Whilst everyone talks about chatGPT4 as this is the free version, I thought it would be interesting to baseline here.

So scoring was pretty good, it also provided a closing comment.

Overall, the response is of high quality and adheres to the prompt well. However, it could be improved by providing the total quantity of each ingredient needed for all the recipes in the shopping list.

The human view (mine!), I liked the recipes. The avoided the prohibited ingredients, response was fast. It really did a weaker job for the combined shopping list, failing for example to count up the amount of chicken. Some of the subtle requirements were lost.

But all in all a solid productive response, which would certainly have saved me time.

chatGPT 4.0 plugin - wolfram

I would (as the human in the loop here) tend to agree. Now there may be some unintentional bias here, as the prompt writing style was probably subtly optimised to suit chatGPT4’s foibles. Here is chat GPT4’s summary of itself, which is a bit meta.

Interestingly enough wolfram was never called to check the maths. Which is why iIadded it in, and in previous prompting sessions has been used to.

Overall, the response is excellent and fulfills all the requirements of the prompt. It scores a perfect 10 in all categories.

Claude 2

The new contender, key differences, 100k token context (3x gpt4) so it should have no trouble.

I actually (as the human) think that the scoring was a little generous. The clarity of layout, and subtle format was weaker. The prompt used ‘layout as a typical recipe card’, which is vague but chatGPT did a much better layout on. The prose in the recipes also was a little more ‘punchy to the point’, as opposed to more leading and instructional.

Still usable, still worked.

Overall, the response is of high quality and mostly adheres to the prompt. The recipes are well-detailed and easy to follow, and the shopping list is comprehensive. However, the response could be improved by providing the total quantity of each ingredient needed for all the recipes and explaining some cooking terms.

Conclusions

Claude is a pretty decent competitor. I really need to explore the impact of the 100k context, perhaps driving a multi-study data driven action is the way to solve this. For me though chatGPT4 still has the overall edge in a practical use case, providing you can stay within its usage constraints.

Its nice to have a contender.

I should probably do this test with an opensource LLM, but for now the weekend and the supermarket are calling.

Previous
Previous

Upgrade your Laptop

Next
Next

The content desert