Unexpected space character #2346

eldarkurtic · 2024-09-25T07:05:50Z

Hi,
While running leaderboard_mmlu_pro evals I've noticed an unexpected space character. Here is an example request:

2024-09-25:06:46:53,199 INFO     [evaluator_utils.py:200] Request: Instance(request_type='loglikelihood', doc={'question_id': 70, 'question': 'Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.', 'options': ['Safe practices, Fear, Jealousy, Trivial', 'Unsafe practices, Distress, Joy, Trivial', 'Safe practices, Wants, Jealousy, Trivial', 'Safe practices, Distress, Fear, Trivial', 'Unsafe practices, Wants, Jealousy, Serious', 'Safe practices, Distress, Jealousy, Serious', 'Safe practices, Wants, Fear, Serious', 'Unsafe practices, Wants, Fear, Trivial', 'Unsafe practices, Distress, Fear, Serious'], 'answer': 'I', 'answer_index': 8, 'cot_content': '', 'category': 'business', 'src': 'ori_mmlu-business_ethics'}, arguments=("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe symmetric group $S_n$ has $\n\\factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.\nFind the characteristic of the ring 2Z.\nA. 0\nB. 30\nC. 3\nD. 10\nE. 12\nF. 50\nG. 2\nH. 100\nI. 20\nJ. 5\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nLet V be the set of all real polynomials p(x). Let transformations T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret (ST)(p(x)) as S(T(p(x))). Which of the following is true?\nA. ST + TS is the identity map of V onto itself.\nB. TS = 0\nC. ST = 1\nD. ST - TS = 0\nE. ST = T\nF. ST = 0\nG. ST = TS\nH. ST - TS is the identity map of V onto itself.\nI. TS = T\nJ. ST = S\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nH<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nLet A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) \\in A}?\nA. -5\nB. 0\nC. -3\nD. -7\nE. -4\nF. -6\nG. -1\nH. -2\nI. -9\nJ. N/A\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nE<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nA tank initially contains a salt solution of 3 grams of salt dissolved in 100 liters of water. A salt solution containing 0.02 grams of salt per liter of water is sprayed into the tank at a rate of 4 liters per minute. The sprayed solution is continually mixed with the salt solution in the tank, and the mixture flows out of the tank at a rate of 4 liters per minute. If the mixing is instantaneous, how many grams of salt are in the tank after 100 minutes have elapsed?\nA. 3 + e^-2\nB. 2 - e^-4\nC. 2 - e^-2\nD. 3 + e^-4\nE. 2 + e^-3\nF. 2 - e^-3\nG. 3 - e^-2\nH. 2 + e^-2\nI. 2 + e^-4\nJ. 2\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nA total of 30 players will play basketball at a park. There will be exactly 5 players on each team. Which statement correctly explains how to find the number of teams needed?\nA. Multiply 5 by 5 to find 25 teams.\nB. Divide 30 by 5 to find 6 teams.\nC. Add 5 to 30 to find 35 teams.\nD. Subtract 30 from 5 to find -25 teams.\nE. Divide 5 by 30 to find 0.1667 teams.\nF. Add 5 to 30 then divide by 2 to find 17.5 teams.\nG. N/A\nH. N/A\nI. N/A\nJ. N/A\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nB<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTypical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.\nA. Safe practices, Fear, Jealousy, Trivial\nB. Unsafe practices, Distress, Joy, Trivial\nC. Safe practices, Wants, Jealousy, Trivial\nD. Safe practices, Distress, Fear, Trivial\nE. Unsafe practices, Wants, Jealousy, Serious\nF. Safe practices, Distress, Jealousy, Serious\nG. Safe practices, Wants, Fear, Serious\nH. Unsafe practices, Wants, Fear, Trivial\nI. Unsafe practices, Distress, Fear, Serious\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", ' I'), idx=8, metadata=('leaderboard_mmlu_pro', 0, 1), resps=[], filtered_resps={}, task_name='leaderboard_mmlu_pro', doc_id=0, repeats=1)

This is a 5-shot example, so looking at the first shot in arguments, the correct answer is formatted as:

The symmetric group $S_n$ has $\n\\factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.\nFind the characteristic of the ring 2Z.\nA. 0\nB. 30\nC. 3\nD. 10\nE. 12\nF. 50\nG. 2\nH. 100\nI. 20\nJ. 5\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA<|eot_id|>

More specifically, notice that the correct answer is presented as: <|end_header_id|>\n\nA<|eot_id|> (no space before A).

Unfortunately, contrary to few-shot examples, the answer of the actual question has a space character:
...<|end_header_id|>\n\n", ' I').

Before trying to do down the rabbit hole to find where this diff is coming from, I wanted to reach out here in case you are already familiar with this?
My guess is that this is probably coming from the infamous add_prefix_space "feature" of HF-tokenizers and the fact that answers from few-shot samples are tokenized as part of a larger sequence, whereas the answer of the actual question is tokenized on its own as a single character.

The text was updated successfully, but these errors were encountered:

baberabb · 2024-09-25T11:17:10Z

Hi! This is because we default to target_delimiter=" ". That was a natural choice for base models, but we should think about the best way to handle this when the chat template takes care of the formatting.

cc: @NathanHB @clefourrier @haileyschoelkopf

eldarkurtic · 2024-09-25T14:49:20Z

Since at the moment I am mostly running leaderboard tasks, I have measured what the impact is from this subtle change with " " in front of the target answer. Here are results:

Without the space, the scores now perfectly match with HF Leaderboard scores (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Notice that with space, 70B model is almost as bad as the 8B one, which definitely seems unexpected.

The only change I made was to add target_delimiter: "" into yaml configs for leaderboard tasks. In case this is an acceptable fix, let me know and I can open a PR with changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected space character #2346

Unexpected space character #2346

eldarkurtic commented Sep 25, 2024

baberabb commented Sep 25, 2024

eldarkurtic commented Sep 25, 2024 •

edited

Loading

Unexpected space character #2346

Unexpected space character #2346

Comments

eldarkurtic commented Sep 25, 2024

baberabb commented Sep 25, 2024

eldarkurtic commented Sep 25, 2024 • edited Loading

eldarkurtic commented Sep 25, 2024 •

edited

Loading