[SWE-bench] Util: Compare files modified between gold patches and OpenDevin patches #2934
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://aider.chat/2024/05/22/swe-bench-lite.html says it finds correct files to edit in 70.3% of benchmark tests. This PR adds a utility script that computes the number for OpenDevin.
Usage Example:
claude-3-5-sonnet@20240620_maxiter_30_N_v1.8-no-hint
result says OpenDevin found correct files to edit in 64% of benchmark tests. The real number might be even lower because I consider success as long as OpenDevin modifies the correct file even though it might have modified other files erroneously and caused the test to fail.