This morning, I noticed that one of my TSM 6.2 servers had stopped running over night. No DB2 processes or dsmserv were running.
I checked the TSM logs, but only found errors from the dsmadmc client:
08/11/11 09:08:52 ANS5216E Could not establish a TCP/IP connection with address 'TSM1.LANIGERA.COM:1500'. The TCP/IP error is 'Connection refused' (errno = 111). 08/11/11 09:08:52 ANS9020E Could not establish a session with a TSM server or client agent. The TSM return code is -50. 08/11/11 09:08:52 ANS1017E Session rejected: TCP/IP connection failure
The DB2 logs seemed to show a clean, orderly shutdown. In retrospect, this should have been my first clue.
2011-08-11-10.09.52.122534-300 I51702288E456 LEVEL: Warning
PID : 27587 TID : 47393806477632PROC : db2wdog 0 INSTANCE: tsm NODE : 000 EDUID : 2 EDUNAME: db2wdog 0 FUNCTION: DB2 UDB, routine_infrastructure, sqlerKillAllFmps, probe:5 MESSAGE : Bringing down all db2fmp processes as part of db2stop DATA #1 : Hexdump, 4 bytes 0x00002B1ABAFFC3D0 : 0000 0000 ....
Next, I checked to see if any of the filesystems had filled, but again, everything looked good, with many gigabytes free on each partition:
[tsm@tsm1 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-Vol01 30G 9.5G 19G 34% /t01 /dev/mapper/VolGroup00-Vol02 6.9T 6.1T 662G 91% /t02 /dev/mapper/VolGroup00-Vol03 40G 34G 4.0G 90% /t03 /dev/mapper/VolGroup00-Vol04 40G 17G 22G 44% /t04 /dev/mapper/VolGroup00-Vol05 40G 17G 22G 44% /t05 /dev/mapper/VolGroup00-Vol06 40G 3.4G 35G 9% /t06 /dev/mapper/VolGroup00-Vol07 20G 156M 19G 1% /t07
I restarted the service and did a tail -f on db2diag.log. Everything looked fine until I connected using the admin client, dsmadmc. There was a flurry of activity, and then TSM and DB2 shut down. What the — ?
Looking over db2diag.log again, I noticed references to a full disk:
2011-08-10-184.108.40.2068710-300 E18843682E763 LEVEL: Error PID : 4581 TID : 47601961396544PROC : db2sysc 0 INSTANCE: tsm NODE : 000 EDUID : 23 EDUNAME: db2pclnr (TSMDB1) 0 FUNCTION: DB2 UDB, buffer pool services, sqlbClnrAsyncWriteCompletion, probe:0 MESSAGE : ADM6017E The table space "TEMPSPACE1" (ID "1") is full. Detected on container "/t03/tsm/data/tsm/NODE0000/TSMDB1/T0000001/C0000000.TMP" (ID "0"). The underlying file system is full or the maximum allowed space usage for the file system has been reached. It is also possible that there are user limits in place with respect to maximum file size and these limits have been reached. 2011-08-09-10.37.42.879000-300 E18844446E731 LEVEL: Error (OS) PID : 4581 TID : 47601961396544PROC : db2sysc 0 INSTANCE: tsm NODE : 000 EDUID : 23 EDUNAME: db2pclnr (TSMDB1) 0 FUNCTION: DB2 UDB, oper system services, sqloLioAIOCollect, probe:100 MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full." DIA8312C Disk was full.
I checked my df output again, and noticed that the /t03 filesystem had a nice, round 10% free. Ah. I resized the LV and grew the ext3 filesystem with system-config-lvm, and was up and running again.