Skip to content

Commit

Permalink
feat: add SWE-bench fullset support (#3477)
Browse files Browse the repository at this point in the history
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* add push script

* handle the case when ret push is an generator

* update pbar
  • Loading branch information
xingyaoww committed Sep 3, 2024
1 parent 57ad058 commit d283420
Show file tree
Hide file tree
Showing 6 changed files with 2,515 additions and 23 deletions.
45 changes: 29 additions & 16 deletions evaluation/swe_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,27 +19,16 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
OpenHands now support using the [official evaluation docker](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) for both **[inference](#run-inference-on-swe-bench-instances) and [evaluation](#evaluate-generated-patches)**.
This is now the default behavior.

### Download Docker Images

**(Recommended for reproducibility)** If you have extra local space (e.g., 100GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared by running:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
```

If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
```

## Run Inference on SWE-Bench Instances

Make sure your Docker daemon is running, and you have pulled the [instance-level docker image](#openhands-swe-bench-instance-level-docker-support).
Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-Bench set you are running on) for the [instance-level docker image](#openhands-swe-bench-instance-level-docker-support).

When the `run_infer.sh` script is started, it will automatically pull the relavant SWE-Bench images. For example, for instance ID `django_django-11011`, it will try to pull our pre-build docker image `sweb.eval.x86_64.django_s_django-11011` from DockerHub. This image will be used create an OpenHands runtime image where the agent will operate on.

```bash
./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 300
./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 300 30 1 princeton-nlp/SWE-bench_Lite test
```

where `model_config` is mandatory, and the rest are optional.
Expand All @@ -57,6 +46,8 @@ in order to use `eval_limit`, you must also set `agent`.
default, it is set to 30.
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
default, it is set to 1.
- `dataset`, a huggingface dataset name. e.g. `princeton-nlp/SWE-bench` or `princeton-nlp/SWE-bench_Lite`, specifies which dataset to evaluate on.
- `dataset_split`, split for the huggingface dataset. e.g., `test`, `dev`. Default to `test`.

There are also two optional environment variables you can set.
```
Expand Down Expand Up @@ -95,6 +86,28 @@ After running the inference, you will obtain a `output.jsonl` (by default it wil

## Evaluate Generated Patches

### Download Docker Images

**(Recommended for reproducibility)** If you have extra local space (e.g., 200GB), you can try pull the [instance-level docker images](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level) we've prepared by running:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance
```

If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env
```

If you want to evaluate on the full SWE-Bench test set:

```bash
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance full
```

### Run evaluation

With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.

**This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
Expand Down
27 changes: 23 additions & 4 deletions evaluation/swe_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@
AppConfig,
SandboxConfig,
get_llm_config_arg,
get_parser,
load_from_env,
parse_arguments,
)
from openhands.core.logger import openhands_logger as logger
from openhands.core.main import create_runtime, run_controller
Expand Down Expand Up @@ -109,6 +109,11 @@ def get_config(
if USE_INSTANCE_IMAGE:
# We use a different instance image for the each instance of swe-bench eval
base_container_image = get_instance_docker_image(instance['instance_id'])
logger.info(
f'Using instance container image: {base_container_image}. '
f'Please make sure this image exists. '
f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
)
else:
base_container_image = SWE_BENCH_CONTAINER_IMAGE
logger.info(f'Using swe-bench container image: {base_container_image}')
Expand Down Expand Up @@ -411,12 +416,26 @@ def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:


if __name__ == '__main__':
args = parse_arguments()
parser = get_parser()
parser.add_argument(
'--dataset',
type=str,
default='princeton-nlp/SWE-bench',
help='data set to evaluate on, either full-test or lite-test',
)
parser.add_argument(
'--split',
type=str,
default='test',
help='split to evaluate on',
)
args, _ = parser.parse_known_args()

# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenHands's repo
dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
swe_bench_tests = filter_dataset(dataset['test'].to_pandas(), 'instance_id')
dataset = load_dataset(args.dataset, split=args.split)
logger.info(f'Loaded dataset {args.dataset} with split {args.split}')
swe_bench_tests = filter_dataset(dataset.to_pandas(), 'instance_id')

llm_config = None
if args.llm_config:
Expand Down
Loading

0 comments on commit d283420

Please sign in to comment.