TSM 6.2 shut down, ADM6017E in db2diag.log

Posted by on Aug 11, 2011 in Databases, DB2, Infrastructure, Storage, TSM | No Comments

This morning, I noticed that one of my TSM 6.2 servers had stopped running over night.  No DB2 processes or dsmserv were running.

I checked the TSM logs, but only found errors from the dsmadmc client:

08/11/11   09:08:52 ANS5216E Could not establish a TCP/IP connection with address 'TSM1.LANIGERA.COM:1500'. The TCP/IP error is 'Connection refused' (errno = 111).
08/11/11   09:08:52 ANS9020E Could not establish a session with a TSM server or client agent.  The TSM return code is -50.
08/11/11   09:08:52 ANS1017E Session rejected: TCP/IP connection failure

The DB2 logs seemed to show a clean, orderly shutdown.  In retrospect, this should have been my first clue.

 2011-08-11- I51702288E456 LEVEL: Warning

PID     : 27587                TID  : 47393806477632PROC : db2wdog 0
INSTANCE: tsm                  NODE : 000
EDUID   : 2                    EDUNAME: db2wdog 0
FUNCTION: DB2 UDB, routine_infrastructure, sqlerKillAllFmps, probe:5
MESSAGE : Bringing down all db2fmp processes as part of db2stop
DATA #1 : Hexdump, 4 bytes
0x00002B1ABAFFC3D0 : 0000 0000                                  ....

Next, I checked to see if any of the filesystems had filled, but again, everything looked good, with many gigabytes free on each partition:

[tsm@tsm1 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
                       30G  9.5G   19G  34% /t01
                      6.9T  6.1T  662G  91% /t02
                       40G   34G  4.0G  90% /t03
                       40G   17G   22G  44% /t04
                       40G   17G   22G  44% /t05
                       40G  3.4G   35G   9% /t06
                       20G  156M   19G   1% /t07

I restarted the service and did a tail -f on db2diag.log.  Everything looked fine until I connected using the admin client, dsmadmc.  There was a flurry of activity, and then TSM and DB2 shut down.  What the — ?

Looking over db2diag.log again, I noticed references to a full disk:

2011-08-10- E18843682E763       LEVEL: Error
PID     : 4581                 TID  : 47601961396544PROC : db2sysc 0
INSTANCE: tsm                  NODE : 000
EDUID   : 23                   EDUNAME: db2pclnr (TSMDB1) 0
FUNCTION: DB2 UDB, buffer pool services, sqlbClnrAsyncWriteCompletion, probe:0
MESSAGE : ADM6017E  The table space "TEMPSPACE1" (ID "1") is full. Detected on
          container "/t03/tsm/data/tsm/NODE0000/TSMDB1/T0000001/C0000000.TMP"
          (ID "0").  The underlying file system is full or the maximum allowed
          space usage for the file system has been reached. It is also possible
          that there are user limits in place with respect to maximum file size
          and these limits have been reached.

2011-08-09- E18844446E731       LEVEL: Error (OS)
PID     : 4581                 TID  : 47601961396544PROC : db2sysc 0
INSTANCE: tsm                  NODE : 000
EDUID   : 23                   EDUNAME: db2pclnr (TSMDB1) 0
FUNCTION: DB2 UDB, oper system services, sqloLioAIOCollect, probe:100
MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full."
          DIA8312C Disk was full.

I checked my df output again, and noticed that the /t03 filesystem had a nice, round 10% free.  Ah.  I resized the LV and grew the ext3 filesystem with system-config-lvm, and was up and running again.