Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reload TLS certificate without restarting server #5516

Merged
merged 14 commits into from
Jun 8, 2020

Conversation

vcrfxia
Copy link
Contributor

@vcrfxia vcrfxia commented May 31, 2020

Description

Implements the equivalent functionality the ksqlDB server used to inherit from rest-utils (confluentinc/rest-utils#156) in Vert.x. When a change to the specified watch location is detected, the Vert.x HttpServers are recreated in order to reload the certificates. It's necessary to recreate rather than simply restart the servers because Vert.x only loads TLS certs at the time that the KeyStoreHelper is created, which is only when servers are created.

This PR also ports over the TRUST_STORE_TYPE and KEY_STORE_TYPE configs which were missed in the rest-utils -> Vert.x migration.

Testing done

Unit + manual tests.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vcrfxia vcrfxia requested a review from a team as a code owner May 31, 2020 00:51
@vcrfxia vcrfxia requested a review from purplefox May 31, 2020 00:51
public class FileWatcher implements Runnable {

private static final Logger log = LoggerFactory.getLogger(FileWatcher.class);
private static final ExecutorService executor = Executors.newFixedThreadPool(1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general design principle best not to rely on statics as they pollute the global namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you probably don't need a custom executor at all - you can use a Vert.x worker thread for this.

try {
FileWatcher.onFileChange(
watchLocation,
() -> serverVerticles.forEach(ServerVerticle::restartServer)
Copy link
Contributor

@purplefox purplefox May 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more complex than it needs to be. Instead of individually starting and stopping each http server in each ServerVerticle I think it would be simpler just to add a restart() method to Server which simply stopped the Server and started it again, e.g.:

public void restart() {
  stop();
  start();
}

I haven't tried this, but seems like it should work (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. (I had this initially.) As explained in the PR description:

It's necessary to recreate rather than simply restart the servers because Vert.x only loads TLS certs at the time that the KeyStoreHelper is created, which is only when servers are created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to Server.start/stop not ServerVerticle.start/stop.

If you do that, it will take care of creating new Http Servers for you, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand your suggestion now. We'll still need custom logic to ensure the HTTP servers are restarted with the same ports, in the event that port 0 is specified (as is the case in our tests). It's not clear to me that this is actually simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a commit where I've made your suggested change: vcrfxia@612b215
The new test, TlsTest#shouldReloadCert() fails because the server ports changed during the restart but the client still tries to connect to the old port.

If we're OK with the ports changing on restart, we can update the test to recreate the client after the restart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The only use case for setting port 0 at the moment is in our tests, so we're not going to worry about preserving ports across restarts. As such, I've moved the restart logic from ServerVerticle to Server as suggested, and updated the test to re-create clients in order to use the most up-to-date port.

@@ -99,10 +100,39 @@ public void stop(final Promise<Void> stopPromise) {
}
}

// Creates a new server, rather than simply stopping and restarting, in order to reload TLS certs
public void restartServer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previous comment, seems overcomplex - just stopped and starting the Server should do the trick?

Copy link
Contributor Author

@vcrfxia vcrfxia Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above -- I initially misunderstood the suggestion. I've made the switch now.

@purplefox
Copy link
Contributor

Generally looks good! I think it can be made a little simpler though :)

@vcrfxia
Copy link
Contributor Author

vcrfxia commented Jun 1, 2020

Hey @purplefox , now that I've moved the restart logic from ServerVerticle to Server, the newly added test is failing to stop the Server with:

io.confluent.ksql.util.KsqlException: Failure in stopping API server
	at io.confluent.ksql.api.server.Server.stop(Server.java:179)
	at io.confluent.ksql.api.server.Server.restart(Server.java:187)
	at io.confluent.ksql.api.server.FileWatcher.handleNextWatchNotification(FileWatcher.java:108)
	at io.confluent.ksql.api.server.FileWatcher.run(FileWatcher.java:78)
	at io.confluent.ksql.api.server.FileWatcher.lambda$onFileChange$0(FileWatcher.java:66)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:316)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:347)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
	at io.confluent.ksql.api.server.Server.stop(Server.java:177)
	... 9 more
[2020-06-01 11:59:08,297] ERROR Failed to shutdown server (io.confluent.ksql.api.BaseApiTest:119)
io.confluent.ksql.util.KsqlException: Failure in stopping API server
	at io.confluent.ksql.api.server.Server.stop(Server.java:179)
	at io.confluent.ksql.api.BaseApiTest.stopServer(BaseApiTest.java:117)
	at io.confluent.ksql.api.BaseApiTest.tearDown(BaseApiTest.java:108)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.RunAfters.invokeMethod(RunAfters.java:46)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Unknown deployment
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
	at io.confluent.ksql.api.server.Server.stop(Server.java:177)
	... 28 more
Caused by: java.lang.IllegalStateException: Unknown deployment
	at io.vertx.core.impl.DeploymentManager.undeployVerticle(DeploymentManager.java:262)
	at io.vertx.core.impl.VertxImpl.lambda$undeploy$27(VertxImpl.java:712)
	at io.vertx.core.Future.lambda$compose$3(Future.java:363)
	at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105)
	at io.vertx.core.impl.FutureImpl.onComplete(FutureImpl.java:83)
	at io.vertx.core.Future.compose(Future.java:359)
	at io.vertx.core.Future.compose(Future.java:331)
	at io.vertx.core.impl.VertxImpl.undeploy(VertxImpl.java:710)
	at io.confluent.ksql.api.server.Server.stop(Server.java:173)
	... 28 more

which is weird since start() and stop() are both synchronized, and start() waits for the futures to complete. Seems like we might be hitting eclipse-vertx/vert.x#2049, which hasn't had updates since 2017. Can you spot my mistake?

@purplefox
Copy link
Contributor

Hey @vcrfxia , I had a quick look at the failing test. What I think is happening here is:

The undeploy error is a bit of a red herring - this actually happens when the server is closed in the tearDown of the test, not in the restart, and I think it happens because the stop() method in the restart fails because the thread waiting for the futures to complete is interrupted. If you clear the deploymentIds in a finally you will see what I mean, and see the real exception.

I suspect the thread is being interrupted (haven't checked this), because a Vert.x worker thread is being used to run the file watcher so Server.stop() is actually being called from a Vert.x worker thread, and the stop() method also shuts down the worker executor!

If you change the file watcher to run on a different thread then I'm guessing it will fix it (my bad for suggesting to use a worker thread for this before!).

Also, as an aside, we should make sure the file watcher gets shuts down when the rest app is closed.

@vcrfxia
Copy link
Contributor Author

vcrfxia commented Jun 2, 2020

Thanks @purplefox , that explanation makes a lot of sense. The test is passing on the latest iteration now, though my approach for allowing the file watcher to restart the server without restarting the file watcher feels rather hacky. Let me know if you've got better ideas. Thanks!

@vcrfxia vcrfxia requested a review from purplefox June 3, 2020 17:23
Copy link
Contributor

@purplefox purplefox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine - if you're in a rush to get this merged I'd say it's ok to merge as is and follow up changes in a different PR, up to you.

public synchronized void shutdown() {
shutdown = true;
log.info("Stopped watching for TLS cert changes.");
if (thread != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to interrupt the thread here? If shutdown() is called from this same thread then you set the shutdown flag which means the run() method will return exit ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the redundancy as extra insurance that we wouldn't leak resources (in case something changes with the FileWatcher code and shutdown() is no longer foolproof) but I think that was overkill. I've removed the interrupt() and simplified the Server code as you suggested.

start(true);
}

private synchronized void start(final boolean startFileWatcher) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the startFileWatcher flag is necessary if you don't interrupt the watcher thread (see previous comment). This means you can simplify things and have a simple start/stop method as before and always create a new filewatcher in start().

stop(true);
}

private synchronized void stop(final boolean stopFileWatcher) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, parameter is unnecessary

*/
public synchronized void start() {
log.info("Starting file watcher to watch for changes: " + file);
thread = new Thread(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this if FileWatcher extends Thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this is a lot better. Thanks for the suggestion! I thought what I was doing didn't feel right, heh :)

/**
* Stops watching the file and closes the watch service
*/
public synchronized void shutdown() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be synchronized if we don't have a thread member and shutdown is volatile

try {
callback.run();
} catch (Exception e) {
log.warn("Hit error callback on file change", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should log at error unless this is expected in normal operation?

);
} finally { // restore cert regardless of failure above so as to not affect other tests
// When: load valid store
ServerKeyStore.loadValidServerKeyStore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to test reloading the server more than once - i.e. changing the cert more than once, as there might be a bug in the filewatcher/reloading that only manifests after the reload has been done once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test already reloads the server twice: once with a bad cert, and then again with a good cert. Are you suggesting we should repeat the cycle twice, for a total of four reloads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, if you're already reloading more than once that seems fine :)

/**
* Closes the file watcher
*/
public void shutdown() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @purplefox , now that FileWatcher extends Thread, do you think it makes sense to get rid of this shutdown() method and instead call interrupt() to terminate the file watcher? I'm wondering whether it's canonical to have a "clean" shutdown method like this, or if interrupt() is understood to be used for that purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging PR for now -- will make this improvement in a follow-up (if we think it'd be an improvement).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to shut cleanly without interrupting threads if you can. Interrupting threads is only necessary if the thread is blocked on something that is interruptable. In our case we only call shutdown() from the thread itself so we can guarantee it's not blocked so there's no need to interrupt anything.
If the method was intended to be used elsewhere from a different thread then we should call Thread.interrupt() but that's not the case here.
As a sanity check you could add this in shutdown() to make sure that is the case

if (this != Thread.currentThread()) {
   throw new IllegalStateException("Can only be called from file watcher thread");
}

@vcrfxia vcrfxia merged commit a5920b0 into confluentinc:master Jun 8, 2020
@vcrfxia vcrfxia deleted the auto-cert-reload branch June 8, 2020 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants