t_sync(3nsl) is not thread safe
The t_sync(3nsl) function is not thread safe.
The t_sync() synchronizes the data structures maintained by libnsl with the information from the transport provider (kernel). It does so by calling _t_create() via _tx_sync() and _t_checkfd(). When the _t_create() is executed, the _ti_userlock mutex is held.
The _t_create() function first calls add_tilink() to allocate new _ti_user structure and to add it into hash_bucket. Or, if there already is a _ti_user structure for this file descriptor in the hash_bucket, the existing _ti_user structure is reused. After that the _ti_user structure is initialized - _t_reinit_tiptr() is called.
Once the _ti_user structure (either newly allocated or reused) is returned by add_tilink(), the _t_create() acquires ti_lock mutex associated with it. Then the _ti_user structure initialization is completed. Right before the _t_create() is returned, the ti_lock mutex is released.
In a case something fails during the _ti_user structure initialization in _t_create(), _t_delete_tilink() is called to remove the _ti_user structure from hash_bucket and to free it.
OTOH, the typical use case of the _ti_user structure in other TLI/XTI functions could be visible in _tx_look():
53 if ((tiptr = _t_checkfd(fd, 0, api_semantics)) == NULL) 54 return (-1); 55 sig_mutex_lock(&tiptr->ti_lock);
IOW, the _t_checkfd() is called to get the _ti_user structure. Usually, the _t_checkfd() just uses find_tilink() to find the _ti_user structure in the hash_bucket. Once the _ti_user structure is obtained via _t_checkfd(), the ti_lock is acquired in _tx_look().
All of the above shows two issues related to t_sync():
t_sync() could modify the _ti_user structure without holding the ti_lock mutex in add_tilink(). This operation is protected only by _ti_userlock, but this is not sufficient. There could easily be some other thread without the _ti_userlock held, but with ti_lock acquired (see _tx_look() example above). This effectively causes that the add_tilink() modifies the _ti_user structure used by some other thread.
To fix this issue we need to lock the ti_lock in add_tilink() before we touch it.
Since there is a gap between the _t_checkfd() call and ti_lock acquiring in almost all TLI/XTI functions (see _tx_look() example above), it is possible that some other thread will call t_sync() in the gap, reinitialize the _ti_user structure, and destroy it in _t_create() by calling _t_delete_tilink() due to some failure. This will lave the first thread (in _tx_look(), for example) to work on deallocated _ti_user structure.
To fix this we should eliminate the gap by acquiring the ti_lock in _t_checkfd() before the _ti_userlock is released.
The Issue 1 is reproducible using the attached t_sync-test.c:
$ gcc -lnsl -o t_sync-test t_sync-test.c $ ./t_sync-test t_accept failed (s): System error: I/O error c_syn: 35 c_con: 3 $
The test uses three threads. The first thread, implemented in function srv(), does in a cycle t_listen() and t_accept(). The second thread, the function clnt(), is the client connecting to the server in srv() using t_connect(). The third thread is sync(), calling t_sync() in a cycle on the server's file descriptor.
The number of successful connections from clnt() to srv() is counted in c_con counter. The number of successful t_sync() calls is in c_syn.
The test is run for 10 seconds, or until a failure happens.
As we could see in the example above the t_accept() call failed after three successful connections. The problem is reproducible very fast, in less than one second.
To compare, when there is no third thread with t_sync() call, the test is able to finish successfully, with 30k+ successful connections established in 10 seconds:
$ gcc -lnsl -DNOSYNC -o t_sync-test-nosync t_sync-test.c $ ./t_sync-test-nosync c_syn: 0 c_con: 32429 $