Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected space character #2346

Open
eldarkurtic opened this issue Sep 25, 2024 · 2 comments
Open

Unexpected space character #2346

eldarkurtic opened this issue Sep 25, 2024 · 2 comments

Comments

@eldarkurtic
Copy link
Contributor

Hi,
While running leaderboard_mmlu_pro evals I've noticed an unexpected space character. Here is an example request:

2024-09-25:06:46:53,199 INFO     [evaluator_utils.py:200] Request: Instance(request_type='loglikelihood', doc={'question_id': 70, 'question': 'Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.', 'options': ['Safe practices, Fear, Jealousy, Trivial', 'Unsafe practices, Distress, Joy, Trivial', 'Safe practices, Wants, Jealousy, Trivial', 'Safe practices, Distress, Fear, Trivial', 'Unsafe practices, Wants, Jealousy, Serious', 'Safe practices, Distress, Jealousy, Serious', 'Safe practices, Wants, Fear, Serious', 'Unsafe practices, Wants, Fear, Trivial', 'Unsafe practices, Distress, Fear, Serious'], 'answer': 'I', 'answer_index': 8, 'cot_content': '', 'category': 'business', 'src': 'ori_mmlu-business_ethics'}, arguments=("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThe symmetric group $S_n$ has $\n\\factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.\nFind the characteristic of the ring 2Z.\nA. 0\nB. 30\nC. 3\nD. 10\nE. 12\nF. 50\nG. 2\nH. 100\nI. 20\nJ. 5\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nLet V be the set of all real polynomials p(x). Let transformations T, S be defined on V by T:p(x) -> xp(x) and S:p(x) -> p'(x) = d/dx p(x), and interpret (ST)(p(x)) as S(T(p(x))). Which of the following is true?\nA. ST + TS is the identity map of V onto itself.\nB. TS = 0\nC. ST = 1\nD. ST - TS = 0\nE. ST = T\nF. ST = 0\nG. ST = TS\nH. ST - TS is the identity map of V onto itself.\nI. TS = T\nJ. ST = S\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nH<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nLet A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) \\in A}?\nA. -5\nB. 0\nC. -3\nD. -7\nE. -4\nF. -6\nG. -1\nH. -2\nI. -9\nJ. N/A\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nE<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nA tank initially contains a salt solution of 3 grams of salt dissolved in 100 liters of water. A salt solution containing 0.02 grams of salt per liter of water is sprayed into the tank at a rate of 4 liters per minute. The sprayed solution is continually mixed with the salt solution in the tank, and the mixture flows out of the tank at a rate of 4 liters per minute. If the mixing is instantaneous, how many grams of salt are in the tank after 100 minutes have elapsed?\nA. 3 + e^-2\nB. 2 - e^-4\nC. 2 - e^-2\nD. 3 + e^-4\nE. 2 + e^-3\nF. 2 - e^-3\nG. 3 - e^-2\nH. 2 + e^-2\nI. 2 + e^-4\nJ. 2\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nA total of 30 players will play basketball at a park. There will be exactly 5 players on each team. Which statement correctly explains how to find the number of teams needed?\nA. Multiply 5 by 5 to find 25 teams.\nB. Divide 30 by 5 to find 6 teams.\nC. Add 5 to 30 to find 35 teams.\nD. Subtract 30 from 5 to find -25 teams.\nE. Divide 5 by 30 to find 0.1667 teams.\nF. Add 5 to 30 then divide by 2 to find 17.5 teams.\nG. N/A\nH. N/A\nI. N/A\nJ. N/A\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nB<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTypical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.\nA. Safe practices, Fear, Jealousy, Trivial\nB. Unsafe practices, Distress, Joy, Trivial\nC. Safe practices, Wants, Jealousy, Trivial\nD. Safe practices, Distress, Fear, Trivial\nE. Unsafe practices, Wants, Jealousy, Serious\nF. Safe practices, Distress, Jealousy, Serious\nG. Safe practices, Wants, Fear, Serious\nH. Unsafe practices, Wants, Fear, Trivial\nI. Unsafe practices, Distress, Fear, Serious\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", ' I'), idx=8, metadata=('leaderboard_mmlu_pro', 0, 1), resps=[], filtered_resps={}, task_name='leaderboard_mmlu_pro', doc_id=0, repeats=1)

This is a 5-shot example, so looking at the first shot in arguments, the correct answer is formatted as:

The symmetric group $S_n$ has $\n\\factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements.\nFind the characteristic of the ring 2Z.\nA. 0\nB. 30\nC. 3\nD. 10\nE. 12\nF. 50\nG. 2\nH. 100\nI. 20\nJ. 5\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nA<|eot_id|>

More specifically, notice that the correct answer is presented as: <|end_header_id|>\n\nA<|eot_id|> (no space before A).

Unfortunately, contrary to few-shot examples, the answer of the actual question has a space character:
...<|end_header_id|>\n\n", ' I').

Before trying to do down the rabbit hole to find where this diff is coming from, I wanted to reach out here in case you are already familiar with this?
My guess is that this is probably coming from the infamous add_prefix_space "feature" of HF-tokenizers and the fact that answers from few-shot samples are tokenized as part of a larger sequence, whereas the answer of the actual question is tokenized on its own as a single character.

@baberabb
Copy link
Contributor

Hi! This is because we default to target_delimiter=" ". That was a natural choice for base models, but we should think about the best way to handle this when the chat template takes care of the formatting.

cc: @NathanHB @clefourrier @haileyschoelkopf

@eldarkurtic
Copy link
Contributor Author

eldarkurtic commented Sep 25, 2024

Since at the moment I am mostly running leaderboard tasks, I have measured what the impact is from this subtle change with " " in front of the target answer. Here are results:
image

Without the space, the scores now perfectly match with HF Leaderboard scores (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Notice that with space, 70B model is almost as bad as the 8B one, which definitely seems unexpected.

The only change I made was to add target_delimiter: "" into yaml configs for leaderboard tasks. In case this is an acceptable fix, let me know and I can open a PR with changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants