Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Docker Sandbox with GPT-3.5 Execution Example #48

Merged
merged 22 commits into from
Mar 21, 2024

Conversation

xingyaoww
Copy link
Contributor

@xingyaoww xingyaoww commented Mar 18, 2024

A minimalistic implementation of Docker Sandbox with less than 100 LOC and requires only built-in libraries. It exposes a .execute API to run arbitrary bash commands inside the docker container.

It requires docker to be installed in the system, and, will potentially need some adjustment when switch to other OS. I hope it could be a potential starting point for future development.

Run it: python3 opendevin/sandbox/docker.py

Example screenshot:

image

@xingyaoww
Copy link
Contributor Author

Just checking some existing PRs and realize this is somewhat similar to @geohotstan's #29, with the difference being the command we run is docker -it instead of /bin/bash so it is containerized.

@geohotstan
Copy link
Contributor

I opened #29 because I saw you mention it in slack hehe 😄

@xingyaoww
Copy link
Contributor Author

xingyaoww commented Mar 18, 2024

Just tweak the container a bit. Using the CodeAct idea, I also added a minimal working example (less than 100 LOC with litellm that prompts gpt-3.5-turbo-0125 to write a flask server, install flask library, and start the server. Example screenshots:

image image

Most of the things are working as expected, except at the end, the model did not follow the instruction to stop the interaction by outputting <execute> exit </execute> as instructed. This should be fixable by either (1) including a complete in-context example like this, OR (2) collect some interaction data like this and fine-tune a model (like this, a more complex route).

@xingyaoww xingyaoww changed the title Minimal Docker Sandbox Minimal Docker Sandbox with GPT-3.5 Execution Example Mar 18, 2024
@penberg
Copy link
Contributor

penberg commented Mar 18, 2024

I am seeing the following error when I exit the container:

penberg@vonneumann OpenDevin % python3 opendevin/sandbox/docker.py
Interactive Docker container started. Type 'exit' or use Ctrl+C to exit.
root@1149541e85f2:/# exit
Exiting...
Container killed.
Exception ignored in: <function DockerInteractive.__del__ at 0x100a02b60>
Traceback (most recent call last):
  File "/Users/penberg/src/penberg/OpenDevin/opendevin/sandbox/docker.py", line 80, in __del__
  File "/Users/penberg/src/penberg/OpenDevin/opendevin/sandbox/docker.py", line 70, in close
OSError: [Errno 9] Bad file descriptor

@xingyaoww
Copy link
Contributor Author

Hey @penberg, thanks for the review! I tried to address these issues, do you mind helping me test if the "Bad file descriptor" issue goes away? My Linux setup does not throw similar errors at this moment.

@penberg
Copy link
Contributor

penberg commented Mar 18, 2024

Works great now, thanks @xingyaoww!

@neubig
Copy link
Contributor

neubig commented Mar 20, 2024

Thanks a bunch for this! Just a question for @xingyaoww and @rbren , how does this PR play together with #35? They seem to overlap a little, and I was wondering if it'd be necessary to combine them together to get the best of both worlds?

@xingyaoww
Copy link
Contributor Author

xingyaoww commented Mar 20, 2024

Thanks @neubig! That is actually a pretty good question! I feel it ultimately comes down to the roadmap/structure of this project.

  • First pass at a control loop #35 relies on open-source libraries like langchain and llama-index, which can be more familiar to the general community and potentially can work better in terms of SWE-Bench performance.
  • On the other hand, my example script uses our recent code-act idea that relies on LLM to perform most actions and requires more from the LLM itself: LLM needs to be capable enough to do all the stuff autonomously, instead of stuck in an infinite loop. The benefit of this route is that it does not impose external prompting and a control loop, so the final implementation could be minimalistic and easy to understand. The downside is that its performance may not be as good as langchain-style / metaGPT-style, since it requires more from the model itself.

At this stage, and considering the amount of interest and potential number of contributors from the community, it might be beneficial for the project to consider both routes and work on both in parallel until some future evaluation milestones. What I think is important is how we structure the project to allow such parallel development.

Here are my two cents about how we might organize this project.

  1. We should first define a clear and straightforward Agent abstraction (e.g., base class) that everyone agrees on. It should have every method necessary to reproduce Devin's operation (and of course, we can update this when needed, but we shouldn't unless absolutely necessary). For example, (1) it receives initial instruction from the human (e.g., run(instruction: str) -> success: bool), and when stopped, the container should have all the necessary files modified as per instruction for human or eval-harness to test; (2) optionally, humans should be able to chat with the agent during the execution process (.run) to potentially alter its plan - this could needs use some multi-threading in actual implementations.
  2. Based on that abstraction, we can structure this project into the following folders:
    • frontend: just like the current setup.
    • backend: will need to orchestrate with both front-end (including database for chat messages, user login & management, etc) and the defined Agent abstraction (but not necessarily need to know what is under the hood).
    • opendevin: a python package where we put all the shared abstraction (e.g., Agent), components and tools (e.g., sandbox, web browser, search API, selenium).
    • eval: an evaluation harness that takes in an Agent object and produces a set of metrics, e.g., SWE-Bench success rate, cost (i.e., number of tokens, $ cost), etc.
    • research: In this folder, there may exist multiple implementations of Agent. For example, research/langchain, research/metagpt, research/codeact, etc. Contributors from different backgrounds and interests can choose to contribute to any (or all!) of these directions.
  3. Except research, other folders (frontend, backend, eval, opendevin) can be developed like a normal open-source project (people contribute to the same codebase). research can be more flexible (multiple ideas get developed in parallel) and we welcome people with different ideas to explore.
  4. We can set milestones. For example, the first critical step would be (1) get the Agent abstraction out, and (2) set up a working (containerized) SWE-Bench evaluation harness against the Agent abstraction. Ideally (if not too costly), we can set these as automated Github workflows for automatic evaluation. Eventually, we can just fairly compare different agent implementations in research with the same evaluation harness, and collectively choose to go forward with one of the research agent implementations. Or even better, all the mature (defined by a lower-bound SWE-Bench performance requirement) agent implementations can co-exist since they might have different cost-effectiveness trade-off, e.g., some may perform really well but are costly to run (i.e., consume too much tokens), some framework may not get high score, but can work smoothly with OSS model on a local laptop.

Going back to these two PRs, if we eventually adopt a project structure like this, I think we can safely merge both into main, re-organize all shared components to opendevin/, and put different implementation of control loops under research and properly document them.

@rbren
Copy link
Collaborator

rbren commented Mar 21, 2024

There are two interesting things here that I think we should take advantage of, but might be hard to merge this PR as-is

  • Docker sandbox - we need to formalize how we sandbox things. Currently we just run everything inside a Docker container, which will be annoying for e.g. developing the backend. We'll want the server to start an agent, plus a sandbox container for running commands, and get them talking together.
  • Minimalist agent - would be great to get this adapted to the Agent interface that Xingyao put together!

@xingyaoww
Copy link
Contributor Author

I updated a few things:

  • Right now the DockerInteractive supports pass in workspace_dir, mount it and set that to cwd, and switch to a user that have permission to directly write to the directory.
  • I set the --network=host, so that the server started by the agent will be accessible from outside the container.
  • I port the minimalistic agent into a general codeact_agent and can confirm that it works as expected
  • while doing that, i adjust a few arguments for Agent, which i also did for langchains_agent and make sure it works.

Regarding @rbren's comment: I completely agree! Some of the adjustments i did above address some of these issues:

  • Minimalist agent - addressed
  • For docker sandbox, currently, an agent (running outside docker) can run commands by interacting with this sandbox, so I think it shouldn't be too hard to adapt langchains_agent to use this. Once this PR got merged, we can starts an issue and PR to adapt langchains_agent to use the DockerInteractive component we have now, so that only the execution requests are executed into the container, and all the LLM requests are performed outside the container.

@rbren
Copy link
Collaborator

rbren commented Mar 21, 2024

OK awesome, that's exactly what I'm looking for! I have a CommandManager in this PR which is running everything with subprocess--will be great to adapt it to use the docker sandbox

output_str = output_str.lstrip(user_input).lstrip()
return output_str

def execute(self, cmd: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, to run multiple commands concurrently (e.g. node server.js and curl localhost:3000) we'll want to instantiate multiple sandboxes with different IDs. I think that will work well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great TODO for the next i guess! We could also consider docker run the container in the backend, and starts multiple shell sessions to attach docker attach to the same container so that we save some resources (but those processes could potentially interfere with each other).

self.timeout: int = timeout

if container_image is None:
container_image = self.CONTAINER_IMAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super helpful--I imagine folks are going to want to define custom images so that the LLM doesn't have to e.g. install nodejs or rust every time it starts a new task.

We'll probably want to make this configurable in the UI.

@xingyaoww xingyaoww merged commit 2de75d4 into All-Hands-AI:main Mar 21, 2024
@xingyaoww xingyaoww deleted the sandbox branch March 21, 2024 13:55
xcodebuild pushed a commit to xcodebuild/OpenDevin that referenced this pull request Mar 31, 2024
* minimal docker sandbox

* make container_image as an argument (fall back to ubuntu);
increase timeout to avoid return too early for long running commands;

* add a minimal working (imperfect) example

* fix typo

* change default container name

* attempt to fix "Bad file descriptor" error

* handle ctrl+D

* add Python gitignore

* push sandbox to shared dockerhub for ease of use

* move codeact example into research folder

* add README for opendevin

* change container image name to opendevin dockerhub

* move folder; change example to a more general agent

* update Message and Role

* update docker sandbox to support mounting folder and switch to user with correct permission

* make network as host

* handle erorrs when attrs are not set yet

* convert codeact agent into a compatible agent

* add workspace to gitignore

* make sure the agent interface adjustment works for langchain_agent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants