Use LLM to analyze ML-Bench failure cases #2399

super-dainiu · 2024-06-11T17:53:58Z

In this PR, I tried to provide LLM with some context of OpenDevin interaction histories. Then, we classify it into error classes defined in the ML-Bench paper.

no functional change

* initial attempt at a browsing only agent * add browsing agent * update * implement agent * update * fix comments * remove unnecessary things from memory extras * update image processing --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update

Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

* add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

…-AI#1944) * add metrics for total_cost * make lint * refact codeact * change metrics into llm * add costs list, add into state * refactor log completion * refactor and test others * make lint * Update opendevin/core/metrics.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/llm/llm.py Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor * add code --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updated-dependencies: - dependency-name: litellm dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* update .gitignore * Rename the confusing 'INFO' style to 'DETAIL' * override str and repr * feat: api_key desensitize * feat: add SensitiveDataFilter in file handler * tweak regex, add tests * more tweaks, include other attrs * add env vars, those with equivalent config * fix tests * tests are invaluable --------- Co-authored-by: Shimada666 <649940882@qq.com>

updated-dependencies: - dependency-name: react-dom dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: "@types/react-dom" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updated-dependencies: - dependency-name: "@reduxjs/toolkit" dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updated-dependencies: - dependency-name: husky dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updated-dependencies: - dependency-name: tailwind-merge dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updated-dependencies: - dependency-name: i18next dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* refactor session mgmt * defer file handling to runtime * add todo * refactor sessions a bit more * remove messages logic from FE * fix up socket handshake * refactor frontend auth a bit * first pass at redoing file explorer * implement directory suffix * fix up file tree * close agent on websocket close * remove session saving * move file refresh * remove getWorkspace * plumb path/code differently * fix build issues * fix the tests * fix npm build * add session rehydration * fix event serialization * logspam * fix user message rehydration * add get_event fn * agent state restoration * change history tracking for codeact * fix responsiveness of init * fix lint * lint * delint * fix prop * update tests * logspam * lint * fix test * revert codeact * change fileService to use API * fix up session loading * delint * delint * fix integration tests * revert test * fix up access to options endpoints * fix initial files load * delint * fix file initialization * fix mock server * fixl int * fix auth for html * Update frontend/src/i18n/translation.json Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> * refactor sessions and sockets * avoid reinitializing the same session * fix reconnect issue * change up intro message * more guards on reinit * rename agent_session * delint * fix a bunch of tests * delint * fix last test * remove code editor context * fix build * fix any * fix dot notation * Update frontend/src/services/api.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix up error handling * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update opendevin/server/session/agent.py Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * Update frontend/src/services/session.ts Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> * fix build errs * fix else * add closed state * delint * Update opendevin/server/session/session.py Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> --------- Co-authored-by: Xingyao Wang <xingyao6@illinois.edu> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* Add ruff for shared mutable defaults (B) * Apply B006, B008 on current files, except fast API * Update agenthub/SWE_agent/prompts.py Co-authored-by: Graham Neubig <neubig@gmail.com> * fix unintended behavior change * this is correct, tell Ruff to leave it alone --------- Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

…ew agents as deprecated (All-Hands-AI#1888) * Add MacOS to integration tests * Switch back to python 3.11 * Install Docker for macos pipeline * regenerate.sh: Use environmental variable for sandbox type * Pack different agents' tests into a single check * Fix CodeAct tests * Reduce file match and extensive debug logs * Add TEST_IN_CI mode that reports codecov * Small fix: don't quit if reusing old responses failed * Merge codecov results * Fix typos * Remove coverage merge step - codecov automatically does that * Make mac integration tests as optional - too slow * Fix codecov args * Add comments in yaml * Include sandbox type in codecov report name * Fix codecov report merge * Revert renaming of test_matrix_success * Remove SWEAgent and PlannerAgent from tests * Mark planner agent and SWE agent as deprecated * CodeCov: Ignore planner and sweagent * Revert "Remove SWEAgent and PlannerAgent from tests" This reverts commit 040cb3b. * Remove all tests for SWE Agent * Only keep basic tests for MonologueAgent and PlannerAgent * Mark SWE Agent as deprecated, and ignore code coverage for it --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

…ll-Hands-AI#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

…All-Hands-AI#1863) * Refactor monologue to use the messages in state history * add messages, clean up * fix monologue * update integration tests * move private method * update SWE agent to use the history from State * integration tests for SWE agent * rename monologue to initial_thoughts, since that is what it is

…aybe creating a new session with no session files stored). (All-Hands-AI#1994)

… main

tangxiangru · 2024-06-12T07:20:29Z

@xingyaoww @li-boxuan please review and merge if it's good for you

xingyaoww

LGTM! It will be nice if we can build a generalized version of this error analyzer so we can analyze errors really easily for other tasks as well.

Ubuntu and others added 30 commits May 23, 2024 16:46

add ml-bench w/o exec env

a0bdeae

fix

efd0bc5

fix

5729199

Merge branch 'main' of https://github.com/super-dainiu/OpenDevin

55c60d2

fix typos (All-Hands-AI#1956)

01f81e1

no functional change

Refactored Logs (All-Hands-AI#1939)

a42dcdb

Update README.md SWE-bench score (All-Hands-AI#1959)

64c500f

* Update README.md SWE-bench score Our most recent results on swe-bench lite are 25%, so this updates the README accordingly. * Update

fix: llm is_local function logic error (All-Hands-AI#1961)

7f3b4b7

Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

doc: update documentation about poetry update (All-Hands-AI#1962)

638da19

* add doc * Update Development.md --------- Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

doc: add more cmd in unit test documentation (All-Hands-AI#1963)

4bcacc3

--- (All-Hands-AI#1975)

e8fb4dd

updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

fix All-Hands-AI#1960 (All-Hands-AI#1964)

fa80ebe

Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (A…

9f6eed4

…ll-Hands-AI#1987) Co-authored-by: jianghongwei <jianghongwei@58.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

Save CI cycles for backend tests (All-Hands-AI#1985)

a08f8c0

Fix typo in prompt (All-Hands-AI#1992)

56eb1a7

fix: catch session file not existed exception when init EventStream(m…

5b0af10

…aybe creating a new session with no session files stored). (All-Hands-AI#1994)

add ml-bench in readme

9f89abc

super-dainiu and others added 22 commits June 3, 2024 05:53

use try except

785c0cc

modify raise exception

3251f34

add int

c0fe06c

update README

fdc8a7a

longer time

e278e8c

fix existing issues

6a47e00

Merge branch 'main' into main

1e63622

fix existing issue

03be34a

Merge branch 'main' of https://github.com/super-dainiu/OpenDevin into…

4f2c91f

… main

new docker image

f89b293

add metrics of cost

b13bdfd

add result parsing cost

9fa1c5b

fix

e5d879f

Merge branch 'main' into main

f287a87

fix

f2b9d7c

update summarize

2049365

fix

4137a73

fix continued inference

b9c89f8

add analyze

9057460

Merge branch 'OpenDevin:main' into main

54b4ec8

Merge branch 'main' of https://github.com/super-dainiu/OpenDevin into…

eff2244

… main

update readme

624737b

super-dainiu marked this pull request as ready for review June 11, 2024 17:56

super-dainiu and others added 4 commits June 11, 2024 18:14

use 4o

561f31f

Merge branch 'main' into main

1cf2e39

add eval output

805e856

Merge branch 'main' of https://github.com/super-dainiu/OpenDevin into…

8289014

… main

xingyaoww approved these changes Jun 12, 2024

View reviewed changes

xingyaoww merged commit 563bc41 into All-Hands-AI:main Jun 13, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LLM to analyze ML-Bench failure cases #2399

Use LLM to analyze ML-Bench failure cases #2399

super-dainiu commented Jun 11, 2024

tangxiangru commented Jun 12, 2024

xingyaoww left a comment

Use LLM to analyze ML-Bench failure cases #2399

Use LLM to analyze ML-Bench failure cases #2399

Conversation

super-dainiu commented Jun 11, 2024

tangxiangru commented Jun 12, 2024

xingyaoww left a comment

Choose a reason for hiding this comment