Minimal Docker Sandbox with GPT-3.5 Execution Example #48

xingyaoww · 2024-03-18T15:30:03Z

A minimalistic implementation of Docker Sandbox with less than 100 LOC and requires only built-in libraries. It exposes a .execute API to run arbitrary bash commands inside the docker container.

It requires docker to be installed in the system, and, will potentially need some adjustment when switch to other OS. I hope it could be a potential starting point for future development.

Run it: python3 opendevin/sandbox/docker.py

Example screenshot:

xingyaoww · 2024-03-18T15:40:13Z

Just checking some existing PRs and realize this is somewhat similar to @geohotstan's #29, with the difference being the command we run is docker -it instead of /bin/bash so it is containerized.

geohotstan · 2024-03-18T16:44:12Z

I opened #29 because I saw you mention it in slack hehe 😄

increase timeout to avoid return too early for long running commands;

xingyaoww · 2024-03-18T16:56:14Z

Just tweak the container a bit. Using the CodeAct idea, I also added a minimal working example (less than 100 LOC with litellm that prompts gpt-3.5-turbo-0125 to write a flask server, install flask library, and start the server. Example screenshots:

Most of the things are working as expected, except at the end, the model did not follow the instruction to stop the interaction by outputting <execute> exit </execute> as instructed. This should be fixable by either (1) including a complete in-context example like this, OR (2) collect some interaction data like this and fine-tune a model (like this, a more complex route).

opendevin/sandbox/Dockerfile

opendevin/sandbox/docker.py

penberg · 2024-03-18T17:37:42Z

I am seeing the following error when I exit the container:

penberg@vonneumann OpenDevin % python3 opendevin/sandbox/docker.py
Interactive Docker container started. Type 'exit' or use Ctrl+C to exit.
root@1149541e85f2:/# exit
Exiting...
Container killed.
Exception ignored in: <function DockerInteractive.__del__ at 0x100a02b60>
Traceback (most recent call last):
  File "/Users/penberg/src/penberg/OpenDevin/opendevin/sandbox/docker.py", line 80, in __del__
  File "/Users/penberg/src/penberg/OpenDevin/opendevin/sandbox/docker.py", line 70, in close
OSError: [Errno 9] Bad file descriptor

xingyaoww · 2024-03-18T17:53:56Z

Hey @penberg, thanks for the review! I tried to address these issues, do you mind helping me test if the "Bad file descriptor" issue goes away? My Linux setup does not throw similar errors at this moment.

penberg · 2024-03-18T18:02:46Z

Works great now, thanks @xingyaoww!

neubig · 2024-03-20T02:24:34Z

Thanks a bunch for this! Just a question for @xingyaoww and @rbren , how does this PR play together with #35? They seem to overlap a little, and I was wondering if it'd be necessary to combine them together to get the best of both worlds?

xingyaoww · 2024-03-20T05:52:39Z

Thanks @neubig! That is actually a pretty good question! I feel it ultimately comes down to the roadmap/structure of this project.

First pass at a control loop #35 relies on open-source libraries like langchain and llama-index, which can be more familiar to the general community and potentially can work better in terms of SWE-Bench performance.
On the other hand, my example script uses our recent code-act idea that relies on LLM to perform most actions and requires more from the LLM itself: LLM needs to be capable enough to do all the stuff autonomously, instead of stuck in an infinite loop. The benefit of this route is that it does not impose external prompting and a control loop, so the final implementation could be minimalistic and easy to understand. The downside is that its performance may not be as good as langchain-style / metaGPT-style, since it requires more from the model itself.

At this stage, and considering the amount of interest and potential number of contributors from the community, it might be beneficial for the project to consider both routes and work on both in parallel until some future evaluation milestones. What I think is important is how we structure the project to allow such parallel development.

Here are my two cents about how we might organize this project.

We should first define a clear and straightforward Agent abstraction (e.g., base class) that everyone agrees on. It should have every method necessary to reproduce Devin's operation (and of course, we can update this when needed, but we shouldn't unless absolutely necessary). For example, (1) it receives initial instruction from the human (e.g., run(instruction: str) -> success: bool), and when stopped, the container should have all the necessary files modified as per instruction for human or eval-harness to test; (2) optionally, humans should be able to chat with the agent during the execution process (.run) to potentially alter its plan - this could needs use some multi-threading in actual implementations.
Based on that abstraction, we can structure this project into the following folders:
- frontend: just like the current setup.
- backend: will need to orchestrate with both front-end (including database for chat messages, user login & management, etc) and the defined Agent abstraction (but not necessarily need to know what is under the hood).
- opendevin: a python package where we put all the shared abstraction (e.g., Agent), components and tools (e.g., sandbox, web browser, search API, selenium).
- eval: an evaluation harness that takes in an Agent object and produces a set of metrics, e.g., SWE-Bench success rate, cost (i.e., number of tokens, $ cost), etc.
- research: In this folder, there may exist multiple implementations of Agent. For example, research/langchain, research/metagpt, research/codeact, etc. Contributors from different backgrounds and interests can choose to contribute to any (or all!) of these directions.
Except research, other folders (frontend, backend, eval, opendevin) can be developed like a normal open-source project (people contribute to the same codebase). research can be more flexible (multiple ideas get developed in parallel) and we welcome people with different ideas to explore.
We can set milestones. For example, the first critical step would be (1) get the Agent abstraction out, and (2) set up a working (containerized) SWE-Bench evaluation harness against the Agent abstraction. Ideally (if not too costly), we can set these as automated Github workflows for automatic evaluation. Eventually, we can just fairly compare different agent implementations in research with the same evaluation harness, and collectively choose to go forward with one of the research agent implementations. Or even better, all the mature (defined by a lower-bound SWE-Bench performance requirement) agent implementations can co-exist since they might have different cost-effectiveness trade-off, e.g., some may perform really well but are costly to run (i.e., consume too much tokens), some framework may not get high score, but can work smoothly with OSS model on a local laptop.

Going back to these two PRs, if we eventually adopt a project structure like this, I think we can safely merge both into main, re-organize all shared components to opendevin/, and put different implementation of control loops under research and properly document them.

rbren · 2024-03-21T00:40:11Z

There are two interesting things here that I think we should take advantage of, but might be hard to merge this PR as-is

Docker sandbox - we need to formalize how we sandbox things. Currently we just run everything inside a Docker container, which will be annoying for e.g. developing the backend. We'll want the server to start an agent, plus a sandbox container for running commands, and get them talking together.
Minimalist agent - would be great to get this adapted to the Agent interface that Xingyao put together!

…ith correct permission

xingyaoww · 2024-03-21T05:40:14Z

I updated a few things:

Right now the DockerInteractive supports pass in workspace_dir, mount it and set that to cwd, and switch to a user that have permission to directly write to the directory.
I set the --network=host, so that the server started by the agent will be accessible from outside the container.
I port the minimalistic agent into a general codeact_agent and can confirm that it works as expected
while doing that, i adjust a few arguments for Agent, which i also did for langchains_agent and make sure it works.

Regarding @rbren's comment: I completely agree! Some of the adjustments i did above address some of these issues:

Minimalist agent - addressed
For docker sandbox, currently, an agent (running outside docker) can run commands by interacting with this sandbox, so I think it shouldn't be too hard to adapt langchains_agent to use this. Once this PR got merged, we can starts an issue and PR to adapt langchains_agent to use the DockerInteractive component we have now, so that only the execution requests are executed into the container, and all the LLM requests are performed outside the container.

rbren · 2024-03-21T12:49:59Z

OK awesome, that's exactly what I'm looking for! I have a CommandManager in this PR which is running everything with subprocess--will be great to adapt it to use the docker sandbox

rbren · 2024-03-21T12:52:57Z

opendevin/sandbox/docker.py

+            output_str = output_str.lstrip(user_input).lstrip()
+        return output_str
+
+    def execute(self, cmd: str) -> str:


If I'm reading this right, to run multiple commands concurrently (e.g. node server.js and curl localhost:3000) we'll want to instantiate multiple sandboxes with different IDs. I think that will work well.

That's a great TODO for the next i guess! We could also consider docker run the container in the backend, and starts multiple shell sessions to attach docker attach to the same container so that we save some resources (but those processes could potentially interfere with each other).

rbren · 2024-03-21T12:54:55Z

opendevin/sandbox/docker.py

+        self.timeout: int = timeout
+
+        if container_image is None:
+            container_image = self.CONTAINER_IMAGE


This is super helpful--I imagine folks are going to want to define custom images so that the LLM doesn't have to e.g. install nodejs or rust every time it starts a new task.

We'll probably want to make this configurable in the UI.

* minimal docker sandbox * make container_image as an argument (fall back to ubuntu); increase timeout to avoid return too early for long running commands; * add a minimal working (imperfect) example * fix typo * change default container name * attempt to fix "Bad file descriptor" error * handle ctrl+D * add Python gitignore * push sandbox to shared dockerhub for ease of use * move codeact example into research folder * add README for opendevin * change container image name to opendevin dockerhub * move folder; change example to a more general agent * update Message and Role * update docker sandbox to support mounting folder and switch to user with correct permission * make network as host * handle erorrs when attrs are not set yet * convert codeact agent into a compatible agent * add workspace to gitignore * make sure the agent interface adjustment works for langchain_agent

minimal docker sandbox

dd0ff16

xingyaoww added 2 commits March 18, 2024 11:50

make container_image as an argument (fall back to ubuntu);

b41efe1

increase timeout to avoid return too early for long running commands;

add a minimal working (imperfect) example

79d1c1d

xingyaoww changed the title ~~Minimal Docker Sandbox~~ Minimal Docker Sandbox with GPT-3.5 Execution Example Mar 18, 2024

penberg reviewed Mar 18, 2024

View reviewed changes

opendevin/sandbox/Dockerfile Outdated Show resolved Hide resolved

penberg reviewed Mar 18, 2024

View reviewed changes

opendevin/sandbox/docker.py Outdated Show resolved Hide resolved

xingyaoww added 3 commits March 18, 2024 12:45

fix typo

4f2eb53

change default container name

1dcb0b4

attempt to fix "Bad file descriptor" error

415efea

handle ctrl+D

652af30

xingyaoww added 6 commits March 20, 2024 03:06

add Python gitignore

10f02a6

Merge commit '10f02a660e3c5c17bba685522d17a353974c049f' into sandbox

1cf3540

push sandbox to shared dockerhub for ease of use

7754e37

move codeact example into research folder

1747269

add README for opendevin

386abe3

change container image name to opendevin dockerhub

43d4927

huybery added the enhancement New feature or request label Mar 20, 2024

xingyaoww mentioned this pull request Mar 20, 2024

Abstraction that allows us to develop different agents, frontend, backend, and evaluation in parallel #68

Merged

4 tasks

xingyaoww added 3 commits March 20, 2024 22:49

Merge commit '0380070e98e8d59efa411fc74dd95aa49f0ea752' into sandbox

b1551a0

move folder; change example to a more general agent

cea4b4b

update Message and Role

135f3ee

xingyaoww added 6 commits March 20, 2024 23:40

update docker sandbox to support mounting folder and switch to user w…

f65c0b7

…ith correct permission

make network as host

e2b4b90

handle erorrs when attrs are not set yet

c10a2d9

convert codeact agent into a compatible agent

45e4c6d

add workspace to gitignore

6ebffca

make sure the agent interface adjustment works for langchain_agent

8815aa9

rbren reviewed Mar 21, 2024

View reviewed changes

rbren approved these changes Mar 21, 2024

View reviewed changes

xingyaoww merged commit 2de75d4 into All-Hands-AI:main Mar 21, 2024

xingyaoww deleted the sandbox branch March 21, 2024 13:55

This was referenced Mar 21, 2024

Refactor agent interface a bit #74

Merged

Which shell sandbox env to use? #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal Docker Sandbox with GPT-3.5 Execution Example #48

Minimal Docker Sandbox with GPT-3.5 Execution Example #48

xingyaoww commented Mar 18, 2024 •

edited

Loading

xingyaoww commented Mar 18, 2024

geohotstan commented Mar 18, 2024

xingyaoww commented Mar 18, 2024 •

edited

Loading

penberg commented Mar 18, 2024 •

edited

Loading

xingyaoww commented Mar 18, 2024

penberg commented Mar 18, 2024

neubig commented Mar 20, 2024

xingyaoww commented Mar 20, 2024 •

edited

Loading

rbren commented Mar 21, 2024

xingyaoww commented Mar 21, 2024

rbren commented Mar 21, 2024

rbren Mar 21, 2024

xingyaoww Mar 21, 2024

rbren Mar 21, 2024

Minimal Docker Sandbox with GPT-3.5 Execution Example #48

Minimal Docker Sandbox with GPT-3.5 Execution Example #48

Conversation

xingyaoww commented Mar 18, 2024 • edited Loading

xingyaoww commented Mar 18, 2024

geohotstan commented Mar 18, 2024

xingyaoww commented Mar 18, 2024 • edited Loading

penberg commented Mar 18, 2024 • edited Loading

xingyaoww commented Mar 18, 2024

penberg commented Mar 18, 2024

neubig commented Mar 20, 2024

xingyaoww commented Mar 20, 2024 • edited Loading

rbren commented Mar 21, 2024

xingyaoww commented Mar 21, 2024

rbren commented Mar 21, 2024

rbren Mar 21, 2024

Choose a reason for hiding this comment

xingyaoww Mar 21, 2024

Choose a reason for hiding this comment

rbren Mar 21, 2024

Choose a reason for hiding this comment

xingyaoww commented Mar 18, 2024 •

edited

Loading

xingyaoww commented Mar 18, 2024 •

edited

Loading

penberg commented Mar 18, 2024 •

edited

Loading

xingyaoww commented Mar 20, 2024 •

edited

Loading