Project

General

Profile

Actions

Bug #4600

closed

Netbackup Catalog verification hangs waiting for NDMP server

Added by Jan Kryl over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cmd - userland programs
Start date:
2014-02-14
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
NDMP
Gerrit CR:
External Bug:

Description

Steps to Reproduce:

1. Run Manual backup
2. Click on the Catalog in the tree view
3. In "Date / time ranges" set Between 11/21/2013 And "11/25/2013" ( The time backup was done)
4. Click on the "Search Now " button
5. Right click on the found "Backup ID" (nx-ndmp01_1385143328) and select "Verify"

Expected Results: Verify success
Actual Results: "Netbackup catalog verify" hang, and NDMP device fail

Actions #1

Updated by Jan Kryl over 9 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 100

Symantec catalog verify is in fact NDMP restore from our ndmpd's point of view. DMA requests data from particular backup image and does some kind of verification on them. The problem is that at some point in time the verification gets stucked. We have gathered all possible data at the point when the verification is stuck and will try to reconstruct what happened.

There are two active ndmp sessions. The newer session without initialized mover probably probes for tapes in other two drives (/dev/rmt/1n and /dev/rmt/2n). The older session has initialized mover, mover state is active - it is neither paused nor halted - and current mover record number is 33489. Control connection for this session uses port 36117 on client side. When we filter this port number from snoop file we can see what happened in this session. It begins as any other restore session:

1) Connection is established
2) DMA opens the tape device
3) DMA reads label from first block on the tape
4) DMA jumps over first 10 backup images stored on the tape
5) DMA read header for 11th backup image (the image which it wants to verify)
6) DMA sets configuration parameters for mover (around frame 5813)
7) DMA tells mover to establish data connection to the other end (which is DMA in this case)
8) DMA tells mover to send data
9) after 5 minutes DMA requests mover status - no response
10) after 8 minutes DMA requests mover status again - no response
11) DMA closes the tape device
12) DMA requests mover status again - no response
13) our ndmpd sends message that mover has been halted (probably because DMA closed data connection)
14) DMA initiates control connection shutdown
15) out ndmpd suddenly "wakes-up" and sends two mover get state replies informing DMA that current state of mover is halted because the data connection was closed. Reported current mover record number is 36790.
16) control tcp connection is closed

pfiles output confirms that tcp connection with descriptor #14 is established at the time of hang:

14: S_IFSOCK mode:0666 dev:617,0 ino:62463 uid:0 gid:0 size:0
O_RDWR
SOCK_STREAM
SO_KEEPALIVE,SO_SNDBUF(61440),SO_RCVBUF(131768)
sockname: AF_INET 10.80.80.83 port: 43938
peername: AF_INET 10.81.33.217 port: 4512

pstack output shows that there is one thread stuck in write syscall at the time of hang:

----------------- lwp# 59 / thread# 59 --------------------
feee5805 write (e, 83a4008, 10000)
08082e05 mover_data_write_v3 (fddef110, e, 2, 1) + 20d
08091ae1 ndmpd_select (fddef110, 1, ffffffff, 806c33a) + 295
0806c359 connection_handler (82bd288) + 211
0806baca ndmpd_worker (8374c98, fef73000, fddeffe8, feedfe0e) + 56
feedfe63 _thrp_setup (feda5200) + 9b
feee00f0 _lwp_start (feda5200, 0, 0, 0, 0, 0)

More information about mover_data_write_v3 thread. It is a function which is executed from main ndmpd's event loop whenever a tcp file descriptor is ready to accept more data. If it's ready then mover_data_write_v3 reads data from tape and sends them over tcp. In this case the thread is stuck in write syscall, which most likely means that flow-control on tcp connection was activated, because the other end isn't receiving data at sufficient speed or there is a problem with connection itself. This is weird because mover sockets should have N_DELAY flag set, which means that write or read never blocks on them.

in ndmpd's log file following debug messages deserve attention as they describe what thread #59 was doing (this is the thread which is stuck in write syscall):

12/10 12:40:33 DEBUG: [59][mover_data_write_v3:3922]:read 65536 bytes from tape
12/10 12:40:33 DEBUG: [59][mover_data_write_v3:3969]:wlen window restrictions: 2846162944
12/10 12:40:33 DEBUG: [59][mover_data_write_v3:3980]:buffer restrictions: wlen 65536 len 65536
...
12/10 12:50:35 DEBUG: [59][mover_data_write_v3:3994]:n -1 errno 32
12/10 12:50:35 DEBUG: [59][ndmp_process_messages:1199]:reply_expected: FALSE
12/10 12:50:35 DEBUG: [59][ndmp_process_messages:1279]:received request: 0xa00
12/10 12:50:35 DEBUG: [59][ndmp_process_messages:1279]:received request: 0xa00
12/10 12:50:35 DEBUG: [59][ndmp_process_messages:1279]:received request: 0x301
12/10 12:50:35 DEBUG: [59][ndmp_open_list_del:1532]:Removed dev: /dev/rmt/0n, sid: 0, lun: 0
12/10 12:50:35 DEBUG: [59][ndmp_process_messages:1307]:no more messages in stream buffer
12/10 12:50:35 DEBUG: [59][connection_handler:921]:Connection terminated

In the first section we can see that a block of data was successfully read from tape. Then the data should have been sent over tcp connection and another debug message should have been printed. We see that this doesn't happen until 10 minutes later when the write returns with error status EPIPE. That means that the data could not be sent because the tcp connection has been closed.
All debug data perfectly fit together and there are no inconsistencies among them. Questions:

1) Why write blocks?
2) Why write blocks even though that the socket was considered to be ready to accept more data?
3) Why write blocks even though that it should be a non-blocking call?
4) Why ndmpd doesn't react to mover get state requests?

Add 1) To answer this we would need snoop of data connection traffic, but resulting snoop file would be too big. Also we could perhaps deduce answer if we knew internals of NBU (not possible).

Add 2) Possible explanation is that socket was ready to accept new data but not as much data as we tried to send.

Add 3) non-blocking option should be set but it isn't as one could see in pfiles output. This is a bug in ndmpd which should be fixed.

Add 4) Because the handler for data connection is stuck, the file descriptor for data connection is removed from a set of descriptors monitored by select(). This explains why ndmpd doesn't react to any events on data connection. Furthermore because the thread which should process events from select() is stuck in write syscall it means that no other NDMP messages are processed until the data connection is closed and the thread exits from write() syscall.

Actions #2

Updated by Jan Kryl over 9 years ago

Fix description: setting of socket options was consolidated to one place (function), which is called from all places where socket options were set. More over for mover sockets which are served from callback from main event loop extra socket option (NON_BLOCKING) was added. All sockets will now respect setting of send/receive buffer size set in SMF properties of ndmpd service. Previously only control connection socket respected these settings.

Actions #3

Updated by Jan Kryl over 9 years ago

  • Status changed from In Progress to Pending RTI
Actions #4

Updated by Robert Mustacchi over 9 years ago

  • Status changed from Pending RTI to Resolved
Actions #5

Updated by Electric Monk over 9 years ago

git commit 97f7c4750238e3c507904468e4e9f17b15a08e2d

Author: Jan Kryl <jan.kryl@nexenta.com>

4600 Netbackup Catalog verification hangs waiting for NDMP server
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>

Actions

Also available in: Atom PDF