Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for WFCORE-7097 and Fix for WFCORE-7098 #6283

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jfdenise
Copy link
Contributor

@bstansberry
Copy link
Contributor

@jamezp Please review as you're kind of an SME on this.

@jfdenise @yersan @jamezp I put the 27.x label on this mostly to get your attention so you can think whether this needs to be in WF 35 or not. I suspect the only urgency around this is the intermittent failure WFCORE-7097 mentions, and then the bootable jar failures we are seeing in full WF in ts/int/elytron-oidc-client. But those don't force us to do something quickly if we don't think that's the right thing to do; those both may have workarounds.

Comment on lines +37 to +41
if (Files.notExists(cleanupMarker)) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this. Can you explain why we were seeing an issue if the file does not exist here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case, the process is started but the cleanup already occurred (a timeout + a previous process running).

}
// Do a last cleanup, in case the cleanupMarker still exists (could have been deleted by running process).
if (Files.exists(cleanupMarker)) {
cleanup();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially launch another process. We should probably just invoke the deleteDirectory() at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is done in purpose. On Windows we need the the external process. The cleanup waits until the process terminate (with a timeout).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that if the process is running while the other the bootable JAR process is still running that we terminate that process and start a new one? I'm just a little confused on what we gain here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is in case, the previous process didn't complete the deletion for some reason (timeout and process forcibly terminated from the caller thread), we start a new process to finalize.

@wildfly-ci
Copy link

Core -> Full Integration Build 14432 outcome was UNKNOWN using a merge of a7eede0
Summary: Canceled (Error while applying patch; cannot find commit 12f2330 in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:40

@wildfly-ci
Copy link

Core -> Full Integration Build 14131 outcome was UNKNOWN using a merge of a7eede0
Summary: Canceled (Error while applying patch; cannot find commit 12f2330 in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:17

@wildfly-ci
Copy link

Core -> WildFly Preview Integration Build 14213 outcome was UNKNOWN using a merge of a7eede0
Summary: Canceled (Error while applying patch; cannot find commit 12f2330 in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:16

@wildfly-ci
Copy link

Core -> Full Integration Build 14437 outcome was UNKNOWN using a merge of 5e6e784
Summary: Canceled (Error while applying patch; cannot find commit 36fce9b in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:16

@wildfly-ci
Copy link

Core -> Full Integration Build 14136 outcome was UNKNOWN using a merge of 5e6e784
Summary: Canceled (Error while applying patch; cannot find commit 36fce9b in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:22

@wildfly-ci
Copy link

Core -> Full Integration Build 14137 outcome was FAILURE using a merge of 5e6e784
Summary: Tests failed: 1 (1 new), passed: 4407, ignored: 55 Build time: 03:37:39

Failed tests

org.jboss.as.test.clustering.cluster.ejb.stateful.StatefulTimeoutTestCase.timeout: java.lang.AssertionError: expected:<4> but was:<0>
	at org.jboss.as.test.clustering.cluster.ejb.stateful.StatefulTimeoutTestCase.timeout(StatefulTimeoutTestCase.java:88)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
------- Stdout: -------
 [0m14:34:15,207 INFO  [org.jboss.modules] (main) JBoss Modules version 2.1.6.Final
 [0m [0m14:34:16,223 INFO  [org.jboss.msc] (main) JBoss MSC version 1.5.5.Final
 [0m [0m14:34:16,239 INFO  [org.jboss.threads] (main) JBoss Threads version 2.4.0.Final
 [0m [0m14:34:16,416 INFO  [org.jboss.as] (MSC service thread 1-3) WFLYSRV0049: WildFly 35.0.0.Final-SNAPSHOT (WildFly Core 27.0.0.Final-SNAPSHOT) starting
 [0m [0m14:34:18,321 INFO  [org.wildfly.security] (Controller Boot Thread) ELY00001: WildFly Elytron version 2.6.0.Final
 [0m [0m14:34:19,723 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0039: Creating http management service using socket-binding (management-http)
 [0m [0m14:34:19,767 INFO  [org.xnio] (MSC service thread 1-2) XNIO version 3.8.16.Final
 [0m [0m14:34:19,793 INFO  [org.xnio.nio] (MSC service thread 1-2) XNIO NIO Implementation Version 3.8.16.Final
 [0m [0m14:34:19,844 INFO  [org.jboss.as.connector.subsystems.datasources] (ServerService Thread Pool -- 32) WFLYJCA0004: Deploying JDBC-compliant driver class org.h2.Driver (version 2.2)
 [0m [0m14:34:19,968 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 40) WFLYCLINF0001: Activating Infinispan subsystem.
 [0m [0m14:34:19,991 INFO  [org.jboss.remoting] (MSC service thread 1-4) JBoss Remoting version 5.0.30.Final
 [0m [0m14:34:20,011 INFO  [org.wildfly.extension.io] (ServerService Thread Pool -- 41) WFLYIO001: Worker 'default' has auto-configured to 8 IO threads with 64 max task threads based on your 4 available processors
 [0m [0m14:34:20,081 INFO  [org.jboss.as.jaxrs] (ServerService Thread Pool -- 42) WFLYRS0016: RESTEasy version 6.2.11.Final
 [0m [0m14:34:20,101 INFO  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 44) WFLYCLJG0001: Activating JGroups subsystem. JGroups version 5.3.13
 [0m [0m14:34:20,113 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-6) WFLYJCA0018: Started Driver service with driver-name = h2
 [0m [0m14:34:20,125 INFO  [org.jboss.as.connector] (MSC service thread 1-4) WFLYJCA0009: Starting Jakarta Connectors Subsystem (WildFly/IronJacamar 3.0.10.Final)
 [0m [0m14:34:20,134 INFO  [org.jboss.as.naming] (ServerService Thread Pool -- 48) WFLYNAM0001: Activating Naming Subsystem
 [0m [33m14:34:20,217 WARN  [org.wildfly.extension.elytron] (MSC service thread 1-2) WFLYELY00023: KeyStore file '/opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-ejb-1/standalone/configuration/application.keystore' does not exist. Used blank.
 [0m [33m14:34:20,279 WARN  [org.jboss.as.txn] (ServerService Thread Pool -- 53) WFLYTX0013: The node-identifier attribute on the /subsystem=transactions is set to the default value. This is a danger for environments running multiple servers. Please make sure the attribute value is unique.
 [0m [0m14:34:20,295 INFO  [org.jboss.as.ejb3] (MSC service thread 1-5) WFLYEJB0482: Strict pool mdb-strict-max-pool is using a max instance size of 16 (per class), which is derived from the number of CPUs on this host.
 [0m [0m14:34:20,298 INFO  [org.jboss.as.ejb3] (MSC service thread 1-7) WFLYEJB0481: Strict pool slsb-strict-max-pool is using a max instance size of 16 (per class), which is derived from thread worker pool sizing.
 [0m [33m14:34:20,323 WARN  [org.wildfly.extension.elytron] (MSC service thread 1-5) WFLYELY01084: KeyStore /opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-ejb-1/standalone/configuration/application.keystore not found, it will be auto-generated on first use with a self-signed certificate for host localhost
 [0m [0m14:34:20,426 INFO  [org.jboss.as.naming] (MSC service thread 1-5) WFLYNAM0003: Starting Naming Service
 [0m [0m14:34:20,503 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-3) WFLYUT0003: Undertow 2.3.18.Final starting
 [0m [33m14:34:20,760 WARN  [org.jboss.as.domain.http.api.undertow] (MSC service thread 1-4) WFLYDMHTTP0003: Unable to load console module for slot main, disabling console
 [0m [0m14:34:20,793 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0012: Started server default-server.
 [0m [0m14:34:20,802 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) Queuing requests.
 [0m [0m14:34:20,803 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0018: Host default-host starting
 [0m [0m14:34:20,857 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-1) WFLYUT0006: Undertow HTTP listener default listening on [::1]:8080
 [0mnode-1 2024-12-17 14:34:21,005 INFO  [org.jboss.as.server.deployment.scanner] (MSC service thread 1-8) WFLYDS0013: Started FileSystemDeploymentService for directory /opt/buildAgent/work/e8e0dd9c7c4ba60/full/testsuite/integration/clustering/target/wildfly-clustering-ejb-1/standalone/deployments
node-1 2024-12-17 14:34:21,108 INFO  [org.jboss.as.ejb3] (MSC service thread 1-2) WFLYEJB0493: Jakarta Enterprise Beans subsystem suspension complete
node-1 2024-12-17 14:34:21,300 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-4) WFLYJCA0001: Bound data source [java:jboss/datasources/ExampleDS]
node-1 2024-12-17 14:34:21,601 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
node-1 2024-12-17 14:34:21,611 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://[::1]:9990/management
node-1 2024-12-17 14:34:21,611 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0054: Admin console is not enabled
node-1 2024-12-17 14:34:21,612 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly 35.0.0.Final-SNAPSHOT (WildFly Core 27.0.0.Final-SNAPSHOT) started in 7345ms - Started 286 of 660 services (449 services are lazy, passive or on-demand) - Server configuration file in use: standalone-full-ha.xml - Minimum feature stability level: community


@wildfly-ci

This comment was marked as off-topic.

@wildfly-ci
Copy link

Core -> WildFly Preview Integration Build 14220 outcome was UNKNOWN using a merge of 5e6e784
Summary: Canceled (Error while applying patch; cannot find commit 36fce9b in the https://github.com/wildfly/wildfly-core.git repository, possible reason: refs/pull/6283/merge branch was updated and the commit selected for the ... Build time: 00:00:39

@wildfly-ci

This comment was marked as off-topic.

@@ -45,28 +46,13 @@ class InstallationCleaner implements Runnable {
}

@Override
public void run() {
public synchronized void run() {
Copy link
Collaborator

@yersan yersan Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing something; this task is submitted by a SingleThreadExecutor in the BootableJar shutdown hook.
If we are marking it as synchronized, it only means that we could have more than one Bootable Jar instance for the same server home, right?
If that's true, the same marked file is also shared as a marker across the multiple Bootable JAR instances launched from the same home, wouldn't that be an issue after all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this task is submitted by a SingleThreadExecutor in the BootableJar shutdown hook.
If we are marking it as synchronized, it only means that we could have more than one Bootable Jar instance for the same server home, right?

Well, even in that case, we are creating new instances of this InstallationCleaner on each shuftdown hook, so I don't get why the synchronized is required (or nice to have ) at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yersan , we could have a timeout on the calling thread. The task (running in its own thread) is not yet done, and the calling thread will attempt to do a cleanup again. We need to synchronize at this point to avoid multiple cleanup in //. synchronize enforces it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfdenise ok, so it is not to allow dealing with muiltiple Bootable JARs from the same server home.

Ok, in that case, should not be InstallationCleaner.cleanup() method be the one that should be synchronized?

That's the method in common with the InstallationCleaner.run() and InstallationCleaner.cleanupTimeout() which are the entry points for the submitted task and the explicit cleaner.cleanupTimeout()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yersan The key piece is: Files.notExists(cleanupMarker)
We need to have all threads to share a common view on it. So all entry points to it should be synchronized.

@jfdenise
Copy link
Contributor Author

FYI, I am constantly re-running the 2 bootable JAR jobs, my goal is to run 10 times on the 2 platforms with no issues.

@jfdenise
Copy link
Contributor Author

@yersan , 10 green runs on each platform. I will stop testing.

@yersan yersan requested a review from jamezp December 18, 2024 16:12
@yersan
Copy link
Collaborator

yersan commented Dec 18, 2024

@jamezp Can you review again? Thanks!

Copy link
Member

@jamezp jamezp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving, but I think we might need to revisit this. I don't want to block fixing CI though.

I don't think this necessarily wrong, I think we're just overly using resources. We have a shutdown hook which attempts to cleanup the resources. On Windows these typically can't be cleaned up until the server process is ended. However, we attempt to wait for the process to end, then we launch another process as a final clean up. I think we could stream line this a bit, but I'd need to think it through a little more.

@@ -153,7 +159,9 @@ private void newProcess() throws IOException {
.redirectError(ProcessBuilder.Redirect.INHERIT)
.redirectOutput(ProcessBuilder.Redirect.INHERIT)
.directory(new File(System.getProperty("user.dir")));
builder.start();
process = builder.start();
process.waitFor(environment.getTimeout(), TimeUnit.SECONDS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds a bit inappropriate, since this method could be invoked directly from the shutdown hook thread. The shutdown hook API says that is inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

I guess my question would be if this is somehow killed completely by the JVM, if there could be possibilities of having the started process running around.

In any case, if decided, we can move on and see how it behaves in the CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yersan , I was thinking to it more, and I think that we shouldn't merge it. Although I am confident on the Linux fix, it requires more on Windows front.

@jfdenise jfdenise marked this pull request as draft December 19, 2024 09:59
@yersan yersan added the hold Do not merge this PR label Dec 19, 2024
@jfdenise
Copy link
Contributor Author

@jamezp , When testing with a lot of corner cases (with complex scheduling scenarii) on Windows, I came to the conclusion that we need, in the forked process, to wait for the server process to terminate prior to delete the installation. That is the only way to ensure that nothing is left behind and the installation is actually deleted.

@jfdenise jfdenise marked this pull request as ready for review December 20, 2024 11:50
@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@wildfly-ci

This comment was marked as outdated.

@jamezp
Copy link
Member

jamezp commented Dec 20, 2024

@jamezp , When testing with a lot of corner cases (with complex scheduling scenarii) on Windows, I came to the conclusion that we need, in the forked process, to wait for the server process to terminate prior to delete the installation. That is the only way to ensure that nothing is left behind and the installation is actually deleted.

@jfdenise Yes. That is what it's supposed to be doing currently. I guess I should read the Jira's to see what problem we're trying to solve.

One thing I'm not sure why I'd originally done, is create an Executor in the shutdown hook and launch the deletion in a new thread. That seems odd to me. However, the new process is correct. It's the only way on Windows that file locks will be removed.

@jamezp
Copy link
Member

jamezp commented Dec 20, 2024

Looking at the PermissionsDeploymentTestCase test, my guess is the testWithConfiguredMaxBootThreads is failing because it runs last on Windows. What is likely happening is on the ServerController.stop(), the new process is launched and the bootable JAR is being deleted when the new test starts. The new process is deleting files the second test is starting to extract.

The ServerController has some specific deleting of files for the bootable JAR. I think this is likely a timing issue.

@yersan yersan added 28.x and removed 27.x labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
28.x hold Do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants