Runaway Opendeploy java process
Here's the situation:
On out TeamSite PROD server, we run TS 6.5/SP1 with OpenDeploy 6.0.1 on a Solaris 8 machine with 2 CPU's. Just before midnight on the 3rd August, we noticed that deployments started taking 10 -20 times longer than typical.
The attached image show some of the symptoms: we have this pretty typical hourly scheduled deployment that takes <10s normally (except for a small spike at about 6am every morning). However, the image clearly shows a problem after this 3rd Aug time, where this deployment is taking about 3 minutes. THERE ARE NO ERRORS in the logs, and deployments are completing. It is just that all our deployments are now painfully slow. Strange.
Normally, in near idle mode, the Load Average for this server is < 0.1. However, now the server has a load average >1 at all times. We have noticed that the following process is typically sitting at 50% of 1 CPU at all times:
/usr/iw/OpenDeployNG/jre/bin/java -Xmx256M -Xms64M -Djava.security.manager -Dja....etc.
Restarting the services (OpenDeploy and TeamSite) does not remove the problem. (I did not reboot the server...yet!)
Digging deeper reveals some interesting and confusing facts:
1. A TRUSS of this process, shows an immense amount of reading and writing. For example:
4395: psargs: /usr/iw/OpenDeployNG/jre/bin/java -Xmx256M -Xms64M -Djava.securi
4395: read(10, " 0 0 0 0 0 0 0 1 4 3 d 4".., 8192) = 8192
4395: read(10, " 0 0 0 0 1 1 4 2 7 d 6 f".., 8192) = 8192
4395: read(10, " 0 6 5 6 e 6 4 6 5 7 0 6".., 8192) = 8192
4395: read(10, " 0 0 0 0 0 1 0 0 0 0 0 0".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: read(10, " 6 7 4 e 6 1 6 d 6 5 7 4".., 8192) = 8192
4395: read(10, " d 6 5 7 3 7 3 6 1 6 7 6".., 8192) = 8192
4395: read(10, " 0 1 0 c 0 0 0 0 7 8 7 2".., 8192) = 8192
4395: read(10, " 7 4 7 5 7 3 7 4 0 0 0 7".., 8192) = 8192
4395: read(10, " 0 7 2 6 9 6 f 7 2 6 9 7".., 8192) = 8192
4395: read(10, " 4 0 0 0 a 4 9 5 7 4 d 6".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: poll(0xE4B015E8, 0, 100) = 0
4395: read(10, " 0 0 6 5 0 0 6 e 0 0 7 4".., 8192) = 8192
4395: read(10, " 3 7 3 6 1 6 7 6 5 2 e 5".., 8192) = 8192
4395: read(10, " 5 4 4 7 4 0 0 0 7 4 9 5".., 8192) = 8192
4395: read(10, " 6 a 6 d 7 3 2 e 6 3 6 c".., 8192) = 8192
4395: read(10, " 2 4 5 4 1 2 f 7 7 6 1 7".., 8192) = 8192
4395: read(10, " a 6 d 7 3 2 e 6 d 6 5 7".., 8192) = 8192
4395: read(10, " 6 7 6 5 2 e 4 d 6 5 7 3".., 8192) = 8192
4395: read(10, " 7 0 6 5 6 e 6 4 6 5 7 0".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: read(10, " f 7 0 6 9 6 3 0 0 0 0 0".., 8192) = 8192
4395: read(10, " 3 7 3 6 9 6 e 6 7 2 d 6".., 8192) = 8192
4395: read(10, " 7 4 6 9 6 5 7 3 0 0 0 0".., 8192) = 8192
4395: read(10, " 0 2 6 9 7 8 7 0 7 7 0 9".., 8192) = 8192
4395: read(10, " 4 1 5 f 3 1 2 e 6 7 6 9".., 8192) = 8192
4395: read(10, " 0 7 6 0 0 6 5 0 0 6 e 0".., 8192) = 8192
4395: poll(0xE4B015E8, 0, 100) = 0
4395: poll(0xF9781B68, 0, 50) = 0
4395: read(10, " 2 e 6 5 7 8 6 f 6 c 6 1".., 8192) = 8192
4395: read(10, " 7 4 6 9 6 f 6 e 7 4 0 0".., 8192) = 8192
4395: read(10, " 2 2 e 6 a 6 d 7 3 2 e 6".., 8192) = 8192
4395: read(10, " 2 4 b 4 1 5 2 4 5 4 1 2".., 8192) = 8192
4395: read(10, " 6 f 6 c 6 1 6 2 2 e 6 a".., 8192) = 8192
4395: poll(0xE3A015C8, 0, 300) = 0
4395: read(10, " 0 1 0 1 0 0 0 0 0 0 0 1".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: read(10, " 8 7 3 8 0 2 0 0 0 1 4 9".., 8192) = 8192
4395: read(10, " 7 4 2 e 4 a 6 d 7 3 4 4".., 8192) = 8192
4395: read(10, " 6 5 6 4 3 1 3 0 6 3 3 7".., 8192) = 8192
4395: read(10, " 0 7 8 7 3 7 2 0 0 2 8 6".., 8192) = 8192
4395: read(10, " 6 a 6 1 7 6 6 1 2 e 7 5".., 8192) = 8192
4395: read(10, " 7 0 6 1 7 4 6 8 7 4 0 0".., 8192) = 8192 etc.
and
4395: write(11, " 1 8 6 f 7 0 6 5 6 e 6 4".., 8192) = 8192
4395: write(11, " f 7 2 6 7 2 e 6 5 7 8 6".., 8192) = 8192
4395: write(11, " a c e d 0 0 0 5 7 3 7 2".., 8192) = 8192
4395: write(11, " 6 5 7 4 0 0 0 4 4 6 4 9".., 8192) = 8192
4395: write(11, " 0 0 0 0 0 0 0 0 1 0 0 0".., 8192) = 8192
4395: write(11, " 5 6 7 4 e 6 1 6 d 6 5 7".., 8192) = 8192
4395: write(11, " 6 5 7 3 7 3 6 1 6 7 6 5".., 8192) = 8192
4395: write(11, " a e d 7 3 7 2 0 0 2 4 6".., 8192) = 8192
4395: write(11, " 1 7 6 6 1 2 e 6 c 6 1 6".., 8192) = 8192
4395: write(11, " 2 4 6 f 7 2 6 7 2 e 6 5".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: write(11, " 3 5 3 0 2 d 3 4 3 1 3 2".., 8192) = 8192
4395: write(11, " 0 0 0 7 8 7 0 7 7 1 0 0".., 8192) = 8192
4395: poll(0xE4B015E8, 0, 100) = 0
4395: write(11, " 0 0 0 0 0 0 0 0 0 1 0 1".., 8192) = 8192
4395: write(11, " 4 5 5 4 4 5 7 4 0 0 0 a".., 8192) = 8192
4395: write(11, " 6 3 6 c 6 9 6 5 6 e 7 4".., 8192) = 8192
4395: write(11, " S S A G E S V A L U E".., 8192) = 8192
4395: write(11, " 1 2 e 7 5 7 4 6 9 6 c 2".., 8192) = 8192
4395: write(11, " 7 8 7 2 0 0 0 e 6 a 6 1".., 8192) = 8192
4395: write(11, " 7 4 0 0 0 d 4 9 6 e 6 9".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: write(11, " 0 0 0 0 0 0 0 0 1 0 0 0".., 8192) = 8192
4395: write(11, " 7 2 0 0 2 2 6 f 7 2 6 7".., 8192) = 8192
4395: write(11, " 0 0 0 6 4 1 6 3 7 4 6 9".., 8192) = 8192
4395: write(11, " 8 6 f 6 c 6 1 6 2 2 e 6".., 8192) = 8192
4395: write(11, " 5 7 4 f 5 2 4 b 4 1 5 2".., 8192) = 8192
4395: write(11, " 5 4 9 6 d 7 0 6 c 0 0 0".., 8192) = 8192
4395: write(11, " 0 0 0 0 0 0 0 0 0 1 0 1".., 8192) = 8192
4395: write(11, " 4 c 4 5 5 4 4 5 7 4 0 0".., 8192) = 8192
4395: write(11, " 3 2 e 6 3 6 c 6 9 6 5 6".., 8192) = 8192
4395: poll(0xF9781B68, 0, 50) = 0
4395: write(11, " T O M E S S A G E S ".., 8192) = 8192
4395: poll(0xE3A015C8, 0, 300) = 0
4395: write(11, " 4 9 0 0 0 9 7 4 6 8 7 2".., 8192) = 8192
4395: poll(0xE4B015E8, 0, 100) = 0
4395: write(11, " d 7 3 2 e 6 d 6 5 7 3 7".., 8192) = 8192
4395: write(11, " c 6 9 7 6 6 5 7 2 7 9 2".., 8192) = 8192
4395: write(11, " 6 d 6 5 7 3 7 3 6 1 6 7".., 8192) = 8192
4395: write(11, " 3 3 6 4 3 2 7 4 0 0 0 c".., 8192) = 8192
4395: write(11, " 4 3 a 5 0 3 5 3 0 3 8 3".., 8192) = 8192
4395: write(11, " 0 0 0 0 0 0 0 0 0 0 0 1".., 8192) = 8192
4395: write(11, " 7 4 4 6 9 7 2 7 4 0 0 1".., 8192) = 8192 ... etc.
Whereas a TRUSS of the equivalent process on our test machine is just a lot of polls: (this is a fair test, as the TRUS was obtained when there were no deployments running)!
1558: psargs: /usr/iw/OpenDeployNG/jre/bin/java -Xmx256M -Xms64M -Djava.securi
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: lwp_cond_wait(0x00036CB0, 0x00036C98, 0xFDF01AD0) Err#62 ETIME
1558: poll(0xE3B01530, 0, 300) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE3B01530, 0, 300) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE3B01530, 0, 300) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE35015E8, 0, 100) = 0
1558: poll(0xF9781B68, 0, 50) = 0
1558: poll(0xE3B01530, 0, 300) = 0
2. On the PROD server, it seems that this process is the CHILD of the iwod60 process:
ptree 4379
4379 /usr/iw/OpenDeployNG/jre/bin/java -Xss2m -Xms32m -Xmx256m -classpath /usr/iw/Op
---> 4395 /usr/iw/OpenDeployNG/jre/bin/java -Xmx256M -Xms64M -Djava.security.manager -Dja
(it is 4395 that is running consistently at 50%).
Now I know that 4379 is the iwod60 process, however I have no idea what the 4395 process is (any help here would be appreciated). I have been through all the OpenDeploy startup scripts and I have no idea what this script actually is, how it is started and why is it a child of the iwod60 process.
HOWEVER, on our TEST server (which is pretty similar to our PROD server), the equivalent processes are not related, per:
ptree 13307
13307 /usr/iw/OpenDeployNG/jre/bin/java -Xss2m -Xms32m -Xmx256m -classpath /usr/iw/Op
ptree 1558
1558 /usr/iw/OpenDeployNG/jre/bin/java -Xmx256M -Xms64M -Djava.security.manager -Dja
Incidentally, no config files have changed since well before this problem started.
So any help would be greatly appreciated....even a clue as to what this second (non-iwod60) process could be??
Cheers and thanks.