-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add SWE-bench fullset support #3477
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: does the evaluation need so many images? 🙀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each instance per image, so yes 😢 - that's why we need a good infra to run this at scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, that is crazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xingyaoww . This is exactly what I was looking for...
if args.set == 'full-test':
dataset = load_dataset('princeton-nlp/SWE-bench', split='test')
elif args.set == 'lite-test':
dataset = load_dataset('princeton-nlp/SWE-bench_Lite', split='test')
It would be awesome if you could add 'princeton-nlp/SWE-bench', split='dev'
and 'princeton-nlp/SWE-bench_Lite', split='dev')
as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jatinganhotra, I tried to look into this by adding dev set -- but was blocked by princeton-nlp/SWE-bench#199. Will try to add support for dev again once that issue is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being addressed by #3478
Looks good, thanks! |
What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Give a summary of what the PR does, explaining any non-trivial design decisions
run_infer.sh
without running apull instance docker
first ->run_infer.sh
should be able to automatically pull them.Other references