Bug #4871

t_sync(3nsl) is not thread safe

Added by Marcel Telka about 6 years ago.

In Progress
lib - userland libraries
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


The t_sync(3nsl) function is not thread safe.

The t_sync() synchronizes the data structures maintained by libnsl with the information from the transport provider (kernel). It does so by calling _t_create() via _tx_sync() and _t_checkfd(). When the _t_create() is executed, the _ti_userlock mutex is held.

The _t_create() function first calls add_tilink() to allocate new _ti_user structure and to add it into hash_bucket. Or, if there already is a _ti_user structure for this file descriptor in the hash_bucket, the existing _ti_user structure is reused. After that the _ti_user structure is initialized - _t_reinit_tiptr() is called.

Once the _ti_user structure (either newly allocated or reused) is returned by add_tilink(), the _t_create() acquires ti_lock mutex associated with it. Then the _ti_user structure initialization is completed. Right before the _t_create() is returned, the ti_lock mutex is released.

In a case something fails during the _ti_user structure initialization in _t_create(), _t_delete_tilink() is called to remove the _ti_user structure from hash_bucket and to free it.

OTOH, the typical use case of the _ti_user structure in other TLI/XTI functions could be visible in _tx_look():

53    if ((tiptr = _t_checkfd(fd, 0, api_semantics)) == NULL)
54        return (-1);
55    sig_mutex_lock(&tiptr->ti_lock);

IOW, the _t_checkfd() is called to get the _ti_user structure. Usually, the _t_checkfd() just uses find_tilink() to find the _ti_user structure in the hash_bucket. Once the _ti_user structure is obtained via _t_checkfd(), the ti_lock is acquired in _tx_look().

All of the above shows two issues related to t_sync():

Issue 1

t_sync() could modify the _ti_user structure without holding the ti_lock mutex in add_tilink(). This operation is protected only by _ti_userlock, but this is not sufficient. There could easily be some other thread without the _ti_userlock held, but with ti_lock acquired (see _tx_look() example above). This effectively causes that the add_tilink() modifies the _ti_user structure used by some other thread.

To fix this issue we need to lock the ti_lock in add_tilink() before we touch it.

Issue 2

Since there is a gap between the _t_checkfd() call and ti_lock acquiring in almost all TLI/XTI functions (see _tx_look() example above), it is possible that some other thread will call t_sync() in the gap, reinitialize the _ti_user structure, and destroy it in _t_create() by calling _t_delete_tilink() due to some failure. This will lave the first thread (in _tx_look(), for example) to work on deallocated _ti_user structure.

To fix this we should eliminate the gap by acquiring the ti_lock in _t_checkfd() before the _ti_userlock is released.

Reproduction steps

The Issue 1 is reproducible using the attached t_sync-test.c:

$ gcc -lnsl -o t_sync-test t_sync-test.c
$ ./t_sync-test
t_accept failed (s): System error: I/O error
c_syn: 35
c_con: 3

The test uses three threads. The first thread, implemented in function srv(), does in a cycle t_listen() and t_accept(). The second thread, the function clnt(), is the client connecting to the server in srv() using t_connect(). The third thread is sync(), calling t_sync() in a cycle on the server's file descriptor.

The number of successful connections from clnt() to srv() is counted in c_con counter. The number of successful t_sync() calls is in c_syn.

The test is run for 10 seconds, or until a failure happens.

As we could see in the example above the t_accept() call failed after three successful connections. The problem is reproducible very fast, in less than one second.

To compare, when there is no third thread with t_sync() call, the test is able to finish successfully, with 30k+ successful connections established in 10 seconds:

$ gcc -lnsl -DNOSYNC -o t_sync-test-nosync t_sync-test.c
$ ./t_sync-test-nosync 
c_syn: 0
c_con: 32429


t_sync-test.c (2.3 KB) t_sync-test.c Marcel Telka, 2014-05-20 08:23 PM

Related issues

Related to illumos gate - Bug #4545: _t_create(): Use after free in error code pathsClosed2014-01-29


Also available in: Atom PDF