Performance Issues with Linux TeamSite732

Hello,
we're having perplexing performance issues with Linux TeamSite732, for which we also use Samba (currently installed on the TS server) and mounts of TeamSite to remote servers - I specifically mention these because they may be key. We're wondering if anyone else has had these problems, because we are not seeing a definitive cause yet.

The 'performance issues' manifest themselves in 3 ways (please bear with me on this):

1. TeamSite UI slows to a halt and people can't do anything or even login
Doing an iwstat -c at these times shows a large buildup of processes and the iwstat 'load' often gets over 10%, and I have seen 'top' load at 19. It eventually comes down and people can work again. I don't believe the number of user processes on the UI themselves are the issue though.

2. TeamSite not unmounting properly
When stopping TeamSite the /iwmnt/default store wasn't getting unmounted properly (the other 3 usually do).

In iwserver.log we see:
umount: /iwmnt/default: device is busy.
iwmount: Failed to umount ...

What we then see is that default isn't shown with a df -k, but its directory is there and if we do an ls -l we see:
ls: cannot access /iwmnt/default: Input/output error
d????????? ? ? ? ? ? default

We think that one was due to Samba, so we have a workaround (until we move Samba to another server) whereby we have created a 'mirror' iwmnt_smb/default mount for Samba to access. This seems to be working better (I need to try and restart TS a couple of times with this new regime to fully confirm).

3. Each night at 9:15 and/or 10:15 the servers on which TeamSite is remotely mounted report the mounts down (cannot statvfs for Sol servers, I/O error for Linux ones). If I look on TeamSite itself at those times it has slowed to a crawl - running functions like iwstat or df -k take a long time to return. There are no users in TS then, so they can't be blamed.
I have the Unix and Network teams investigating but they don't see anything, and moving backups and other network processes to different times didn't change anything.

Our setup is unique, but I am wondering if there is anyone else out there who may have any clues for us, as we are scratching our collective heads here. I have given 3 types of issue but I am sure there must be a common cause, and then a fix hopefully.

Cheers
HF

Find more posts tagged with

Comments

Rick Poulin

Can't say any solutions come to mind, but I would ask these follow up questions:

1- Is this only happening on one box? Do you have multiple boxes with the same setup?
2- During times of UI slowness, which process is running hot? iwserver, jboss, or something else?
3- Is turning off samba for a couple days an option? You could try that and see if the issue persists
4- How big/old is your content store? Have you ever done any cleanup such as removing ancient editions?
5- Have you run an iwfsck?
6- Have you done any performance tuning in iw.cfg, whether or not with help from support?

HurricanesFan

Can't say any solutions come to mind, but I would ask these follow up questions:

1- Is this only happening on one box? Do you have multiple boxes with the same setup?
2- During times of UI slowness, which process is running hot? iwserver, jboss, or something else?
3- Is turning off samba for a couple days an option? You could try that and see if the issue persists
4- How big/old is your content store? Have you ever done any cleanup such as removing ancient editions?
5- Have you run an iwfsck?
6- Have you done any performance tuning in iw.cfg, whether or not with help from support?

Thanks rpoulin,
1. Yes, we only have 1 Linux TS server thus far. We had just upgraded from TS672 on Solaris where we had no problems, but not really a comparison of like for like.
2. Nothing really running hot. Below I will post the results of top & iwstat from during the 5 minute slowdown we had 2 days ago. Other than the 'top' load there doesn't seem to be anything untoward.
3. I have thought of that and I may do it this weekend;we will at least see if it has any effect on these nightly unmountings.
4. Content store ~ 400G
5. iwfsck runs every Sunday, no errors reported.
6. We have the cachesize set to the maximum, which is about 3 x the number of files in the largest branch. We had increased memory from 8 --> 12G.

top - 12:54:30 up 14 days, 23 min, 1 user, load average: 19.04, 10.53, 4.81
Tasks: 226 total, 1 running, 225 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.7%us, 2.2%sy, 0.0%ni, 94.8%id, 0.5%wa, 0.0%hi, 0.8%si, 0.0%st
Mem: 12198552k total, 12021612k used, 176940k free, 327688k buffers
Swap: 8388600k total, 45688k used, 8342912k free, 6282804k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21484 iwts 20 0 2610m 2.0g 9748 S 6.6 17.2 134:27.27 iwserver.linux
21831 iwui 20 0 1314m 577m 3972 S 1.3 4.8 143:23.47 java
1945 nscd 20 0 1245m 2264 1340 S 0.3 0.0 30:55.86 nscd
22657 root 20 0 24084 1720 948 S 0.3 0.0 0:46.01 rpc.mountd

ID Thread User Duration Store Operation
0x2b83cc6 0xd84ffb70 root 0.000 - GetServerStatus
0x2b83cc5 0xe3ce0b70 820 0.001 default vfs_getattr (67,60002142) 69c1/69be
0x2b83cc0 0xe32dfb70 820 0.001 default vfs_getattr (67,60008da8) 6f8cd4f/6f8cd4c
0x2b83cbf 0xe1eddb70 sulane 5.131 default vfs_getattr (67,60000000) 107/104
0x2b83cbe 0xe28deb70 820 5.132 default vfs_getattr (67,60000008) 204/201
0x2b83cb8 0x8b6f6b70 kralyk 20.726 default trp__IsPermitted Download
0x2b83cb3 0x8defab70 kellought 30.981 default LookupVpathFSEInfo //pvlbteamsite01/default/main/nlm/WORKAREA/kellought/htdocs/databases/alerts
0x2b83cb1 0xd5cfeb70 nlmoccstest1 35.948 default LookupObjids //pvlbteamsite01/default/main/nlm/WORKAREA/nlmoccstest1/htdocs/test
0x2b83cad 0xd66ffb70 churchw 35.949 default FindBranchByID 0x0000225000000000074ffcb2
0x2b83cae 0xd52fdb70 churchw 35.949 default FindWorkareaByID 0x0000210000000000074ffdba
0x2b83ca9 0xd7afeb70 churchw 36.156 default FindBranchByID 0x0000225000000000002ae449
0x2b83c2b 0x8c0f7b70 kellought 48.826 default GetQueryData 8585

Cache Active Dirty To Purge Available
default 359308 0 0
webarchive 70 0 0
sis 155 0 0
iwadmin 93 0 0
workflow 326 0 0
Total 359952 0 0 40048

Minutes Thruput Avg op Load
1 10 0.8653 8.65
15 113 0.0292 3.29
60 113 0.0088 0.99

HurricanesFan

I'll post an update for any who are interested:

HP-Autonomy Support have advised that "The pstacks show that this is a known issue that can happen during the user/group cache refresh that causes a deadlock in the code, leading to the slowdown. I've attached a patch that fixes this issue. Extract iwserver.linux from the zip file and follow the directions to install..."

I've asked a couple of follow-up questions before I book an outage (e.g. would this explain why the server disconnects so consistently at 9:15/10:15pm), and will update again with the results.

Cheers
HF

Rick Poulin

Did you get clarification on what versions of TS are affected, whether this is resolved in 7.4.1(.1), and what conditions trigger the problem? I've got TS 7.3.2 running on Linux and haven't come across those issues.

HurricanesFan

Did you get clarification on what versions of TS are affected, whether this is resolved in 7.4.1(.1), and what conditions trigger the problem? I've got TS 7.3.2 running on Linux and haven't come across those issues.

I'll see if it succeeds first, then confirm.

But I'm told that "the patch changes the way in which locks are acquired" and the instructions say that after applying the new version of iwserver.linux that the version should be 7.3.2 Build 360. I would imagine (disclaimer) that all subsequent builds and versions would include the change...

HurricanesFan

An update on this. The build 360 version of iwserver.linux did not solve this. Nor did a subsequent install of UVFS2.0.7 to use its files, also recommended by support for enhanced performance.

However when the Unix team turned off snapvault for 1 night, there were no problems - the TeamSite mounts stayed up and I could work in the TS UI OK. Go figure.

Has anyone else seen TeamSite having an issue with snapvault?

I am now involved in a war of words with our particularly sensitive Unix team on whose problem this is - TeamSite or the keepers of the servers. It would help greatly if anyone else has had some experience of this.

Cheers

DeSmetKoen

Hi there,

We're running teamsite 7.3.2 pretty stable over here. We're using Redhat 6.5 64-bit with the latest uvfs drivers, and connected to an external Oracle db for the teamsite messaging. IMPORTANT : teamsite search is installed on a seperate server. Make sure you did that ! . The redhat is installed on a VMware instance, the os runs on a virtual disk , but the content store is running on an ext3 raw device that is connected with fibre to a SAN.

The important things to get here are :

-make sure teamsite search is not installed on the same server

-make sure enough inodes are assigned to your filesystem where the store is located

-make sure you have the latest uvfs drivers, properly compiled for the OS

-make troughput to the disk where the store is located is the highest possible

What you can do for now (and how we did discover latency problems in the past).

Create yourself a small shellscript that measure the time to create a few thousand files,and delete a few thousand files ,and let it run in a recurrent loop. You can use the 'time' option to see how long your script runs to be able to do that. If the result shows you a difference according to whatever change you do in your infrastructure , you will have cought the culprite.

Hope this helps.

Greetings,

De Smet Koen - Belgacom

HurricanesFan

Hi Koen,

it's me, Peter!

We have tried moving the data store to another NetApp filer.

I have uninstalled TS Search from the TS server, and I will install it on another server I have set-up, once we get through these problems.

I've put in uvfs2.0.7 and we are now using its drivers, and I also tried the sync_content_writes switch in iw.cfg (and took it out again when it didn't work).

The latest change is that we have moved the iw-store mount definition from autofs to fstab, so we will see how that goes.

Yes I think I do need to check general server response times. I had been concentrating on TeamSite or mount responses such as iwstat or df -k, because I had allowed myself to be convinced that was where the problem lies. What I notice is that the first sign of a slowdown is when iwstat or df -k are slow to return, followed by a buildup of TeamSite processes, followed by an increasing load (iwstat and top). I need to prove or disprove that the Linux server as a whole has slowed.

Cheers

HurricanesFan

A further update and query.

Since we have moved the mount definition for iw-store from autofs to a hard static mount via fstab, we seem to be having less performance issues - less slowdowns or freezes.

I don't see anything in the documentation on autofs versus fstab, so I was wondering: Has anybody else found they needed to change from autofs to fstab (or vice versa)?

Cheers

DeSmetKoen

Hey Peter,

Good to hear from you. Thanks again for your recommendation on Link.

Good to hear you kind of sorted it.

AutoFS is usually a bad idea. As a matter of fact ,every since sentence that starts 'automatic' is. Pronouncing the word should initiate an alarm. How I usually call it is 'automatic catasprohe'.

Do check these too :

-make sure you're on an ext4 filesytem : ext3/2 supports less subdirectories ,there is a limit.

-do check you've got enough inodes ( http://unix.stackexchange.com/questions/26598/how-can-i-increase-the-number-of-inodes-in-an-ext4-filesystem )

-test yourself the speed on how much time takes to write a few thousand files ,and compare it between systems with different specs. => this is how we discovered in the past a problem with a fibrecard cable.

Hope that helps.

Koen

HurricanesFan

Hi Koen,

I missed your response coming in unfortunately.

Changing to fstab certainly improved the length and magnitude of the slowdowns, which became more momentary but still occurred. And in particular during 3 nightly timeslots of 6:10, 9:10 & 10:10.

However I think I have fixed these and cause/solution is a little bizarre, so I will post it here if it helps anybody or anyone can shed more light:

We have users (people and non-people) accessing TeamSite also via Samba and via other mounts of TeamSite on remote servers. At times of slowdowns I would often see users like '801' or '820' with iwstat or iwrecentusers. I then got a tip from HP-Aut Support that pstacks of iwserver.linux during slowdowns showed that users were not found in the local cache and a lookup was being performed. When I looked I saw getpwuid processes in the pstacks when slow, and not when not.

So I made fake TeamSite users of all the phantom remote users I saw running iwstat and iwrecentusers. Some were in NIS passwd, and others were just local users on remote Linux servers that mounted TS. But I added them all to the TS server's passwd file then made them TS users, so that they would be in the TS user/group cache and not need to be looked up with the getpwuid function. In effect that is tricking the system but it worked. Specifically the nightly slowdowns at 6:10, 9:10 & 10:10 stopped and I could almost tell which users were responsible for those individual events ('cfusion' and 'httpd' users on remote servers).

We are still scratching our heads as to why lookups of non-local users should have such a drastic effect on the system, but it appears this is what was happening and it was backing up processes and increasing the load so that TS froze for a while each time.

Touch wood, but hopefully our performance issues are over.

Cheers