Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Don't merge][Review] A solution to prevent zombie containers locally and in CI #12276

Closed
wants to merge 3 commits into from

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Aug 21, 2018

Description

This PR adds mechanisms on the build scripts to cleanup docker containers when there is a cancellation, either from the user by sending SIGINT / SIGTERM or when the process inside the container ends with an error.

It also propagates the environment variables from Jenkins which are used by the process tree killer to identify runaway processes so processes inside the container are killed when the Jenkins job is stopped.

With this patch we catch SIGINT and SIGTERM and install a handler on atexit which cleans (stops) docker containers which were created during execution of the build.

We also switch to python docker API for better management of running containers.

@aaronmarkham @marcoabreu @lebeg @KellenSunderland

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@larroy larroy requested a review from szha as a code owner August 21, 2018 16:05
@larroy larroy force-pushed the zombies branch 5 times, most recently from 9d43167 to 55abd3c Compare August 21, 2018 16:46
@larroy
Copy link
Contributor Author

larroy commented Aug 21, 2018

@mxnet-label-bot: [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Aug 21, 2018
ci/build.py Outdated
@@ -23,7 +23,7 @@
"""

__author__ = 'Marco de Abreu, Kellen Sunderland, Anton Chernov, Pedro Larroy'
__version__ = '0.1'
__version__ = '0.8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bumping to 0.8 directly?

Copy link
Contributor Author

@larroy larroy Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it's getting close to 1.0, by features and maturity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@KellenSunderland KellenSunderland Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: but I'd suggest 0.2. We can jump straight to 1.0 whenever we like but until then I'd just make minor bumps.

Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall feedback: can you do a pass with pylint and mypy, and try and separate concerns into different atomic PRs.

ci/build.py Outdated

def under_ci() -> bool:
""":return: True if we run in Jenkins."""
return 'JOB_NAME' in os.environ

def get_platforms(path: Optional[str] = "docker"):

def git_cleanup() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could remove useful files from a developers desktop silently. I suspect this could surprise many users. Could you potentially lose work if for example you have a new file you haven't checked in that you want tested.

This change should also be in its own PR so that we can review it effectively. I believe we could address docker zombies without making this change.

ci/build.py Outdated
cmd.extend(["--cache-from", tag,])
cmd.extend(["-t", tag, get_dockerfiles_path()])

@retry(subprocess.CalledProcessError, tries=num_retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool but should be in a different PR. Reverting your zombie containers fix would mean we'd have to re-implement your retry refactor and vice-versa.


return docker_run_cmd
docker_cmd_list.extend(command)
docker_cmd = ' \\\n\t'.join(docker_cmd_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be open to not putting these on different lines with \ ? I've had to copy and paste these and each time I have to paste it into a text editor, removed the \ and newlines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works now as expected with the space, can be cut and pasted no problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, I'll give it a shot again.

ci/build.py Outdated
'-e', 'CCACHE_TEMPDIR=/tmp/ccache', # temp dir should be local and not shared
'-e', "CCACHE_DIR=/work/ccache", # this path is inside the container as /work/ccache is mounted
'-e', "CCACHE_LOGFILE=/tmp/ccache.log", # a container-scoped log, useful for ccache verification.
'-ti',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! Not specific to zombie containers, should be in its own PR.

ci/build.py Outdated
@@ -383,7 +604,8 @@ def use_cache():

./build.py -a

Builds for all platforms and leaves artifacts in build_<platform>
Builds for all platforms and leaves artifacts in build_<platform>. **WARNING** it performs git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment on this.

ci/build.py Outdated

if ret != 0:
logging.critical("Execution of %s failed with status: %d", command, ret)
return(ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant ()

ci/build.py Outdated
logging.warning("Cleaning up containers")
else:
return
docker_client = docker.from_env()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is docker_client used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️

ci/build.py Outdated
docker_client = docker.from_env()
try:
stop_timeout = int(os.environ.get("DOCKER_STOP_TIMEOUT", self.docker_stop_timeout))
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e unused

dry_run: bool = False,
interactive: bool = False) -> str:
interactive: bool = False) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is interactive used?

ci/build.py Outdated
command=command, docker_registry=args.docker_registry,
local_ccache_dir=args.ccache_dir, cleanup=cleanup)

if ret != 0:
Copy link
Contributor

@KellenSunderland KellenSunderland Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we return 0 on success?

larroy pushed a commit to larroy/mxnet that referenced this pull request Aug 22, 2018
larroy pushed a commit to larroy/mxnet that referenced this pull request Aug 22, 2018
larroy pushed a commit to larroy/mxnet that referenced this pull request Aug 22, 2018
larroy pushed a commit to larroy/mxnet that referenced this pull request Aug 22, 2018
larroy pushed a commit to larroy/mxnet that referenced this pull request Aug 22, 2018
@aaronmarkham
Copy link
Contributor

Can you provide any usage/testing info?
What would be a good way to try this out locally: break/bite something and turn it into a zombie, and then kill it?

@larroy
Copy link
Contributor Author

larroy commented Aug 23, 2018

Depends on PR #12296

@larroy larroy force-pushed the zombies branch 2 times, most recently from 123e39f to 633ce92 Compare August 23, 2018 16:20
@larroy
Copy link
Contributor Author

larroy commented Aug 24, 2018

Made the change smaller with @KellenSunderland in a different PR: #12336

marcoabreu pushed a commit that referenced this pull request Aug 27, 2018
* Separate minor refactoring from #12276 in a prior PR

* Address CR comments
@larroy larroy closed this Aug 30, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
* Separate minor refactoring from apache#12276 in a prior PR

* Address CR comments
@larroy larroy deleted the zombies branch November 15, 2018 18:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants