Bug #12694


race between write() and shutdown() for unix sockets

Added by John Levon about 2 years ago. Updated almost 2 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:



I found a race condition when requesting stats over haproxy's local UDS socket.
The client sends out a string command, and expects data back, but very occasionally
will get back just an EOF. Much dtracing confirmed that neither side were doing anything
incorrect at the system call layer. The race goes as follows:

The client by this point has sent the command out over the socket, and done a shutdown(SHUT_WR).

Now it's waiting for some data from haproxy:

 37  76382                 kstrgetmsg:entry ------------------ T1604397695044427 vnode fffffe27a2f52000 stp:fffffe282c511350
 37  75699                     strget:entry T1604397695049725 stp:fffffe282c511350 q:fffffe2719643848 (next:0) sd_flag: 2800
 36 115409                    tl_wput:entry T1604397695049815 wq:fffffe277761b138 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 type 1 mtype 28
 36 115493         tl_wput_common_ser:entry T1604397695052687 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 ->te_wq:fffffe277761b138
 37  75700                    strget:return T1604397695053836 ret:0
 37  78194                   strwaitq:entry T1604397695056500 sd_flag: 2800

It goes to sleep.

Now haproxy sends some data:

 36  78747            strwrite_common:entry T1604397695666377 stp:fffffe274c2dc800
 36  62143                    putnext:entry T1604397695672796 wqp:ffffff18098043e8 mp:fffffe27058e6660 stp:fffffe274c2dc800 putproc:tl`tl_wput
 36 115409                    tl_wput:entry T1604397695676466 wq:fffffe2719646418 mp:fffffe27058e6660 tep:fffffebc7f1908c8 type 0 mtype 2020614179
 36 115409                    tl_wput:entry T1604397695677777 wq:fffffe2719646418 mp:fffffe27058e6660
 36  68062         serializer_enqueue:entry T1604397695680837 proc:tl`tl_wput_data_ser mp:fffffe27058e6660
 36  78748           strwrite_common:return T1604397695684977

Note that often we'd be able to strrput() directly to the peer (haproxy-socket) and all is good. However,
this time, the serializer is busy, so we place ourselves on it to run tl_wput_data_ser later.

Next, haproxy does a shutdown(SHUT_WR) as is its occasional wont:

 36  62143                    putnext:entry T1604397695706917 T_ORDREL_REQ for q:ffffff18098043e8 q_next:fffffe2719646418 mp:fffffe30b0735000 stp:fffffe274c2dc800 putproc:tl`tl_wput exec haproxy


 36 115409                    tl_wput:entry T1604397695715599 wq:fffffe2719646418 mp:fffffe30b0735000 tep:fffffebc7f1908c8 type 1 mtype 10

This gets queued into the serializer via a tl_putq_ser() callback.

Next, haproxy calls close:

 36  78152                   strclose:entry T1604397695739902 stp:fffffe274c2dc800
 36  75705                    qdetach:entry T1604397695743477 stp:fffffe274c2dc800 close:tl`tl_close

And we run through the serializer. First, we have our pending data write, tl_wput_data_ser():

 46 115475           tl_wput_data_ser:entry T1604397695747558 mp:fffffe27058e6660 tep:fffffebc7f1908c8
tep fffffebc7f1908c8 te_state 11
peer fffffebc7eefdbd0 te_state 10


 46  71082                       putq:entry T1604397695752711 q:fffffe2719646418 mp:fffffe27058e6660
 46 115476          tl_wput_data_ser:return T1604397695763091

Rather than taking the "fast path" there, however, the peer state is
TS_WIND_ORDREL, so we place it back onto haproxy's write queue. Presumably the
idea is that we're later going to process it via the service routine again
(tl_wput_ser()->tl_wput_common_ser()), but it's not clear.

Next on the serializer queue we have the tl_putq_ser() callback - our T_ORDREL_REQ from
above. But this routine does this:

1903         if (tep->te_closing) {
1904                 tl_wput_ser(mp, tep);
1905         } else {
1906                 TL_PUTQ(tep, mp);

And as we can see:

 46 115469                tl_putq_ser:entry T1604397695766517 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_closing 1 type 1 mtype 10

qdetach()->tl_close() ran enough to set ->te_closing. This now gets processed immediately:

 46 115495                tl_do_proto:entry T1604397695784218 handling T_ORDREL_REQ for mp:fffffe30b0735000 tep:fffffebc7f1908c8  ->wq:fffffe2719646418 next:0 stp:fffffe274c2dc800
 46 115501                  tl_ordrel:entry T1604397695788117
 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_wq:fffffe2719646418
 peer tep:fffffebc7eefdbd0 peer_rq:fffffe271403bad0 to get T_ORDREL_IND
 46  62143                    putnext:entry T1604397695799793 T_ORDREL_IND for q:fffffe271403bad0 q_next:fffffe2719643848 mp:fffffe30b0735000 stp:fffffe282c511350
 46  77566                    strrput:entry T1604397695803390 handling T_ORDREL_IND for q:fffffe2719643848 next:0 stp:fffffe282c511350 mp:fffffe30b0735000
 46  77566                    strrput:entry T1604397695804446:q:fffffe2719643848, mp:fffffe30b0735000 type 1 stp:fffffe282c511350 sd_flag 10242 mtype 23
 46  99006              strsock_proto:entry T1604397695809474  T_ORDREL_IND
 46  76788                  strseteof:entry T1604397695812531 vnode fffffe27a2f52000 stp fffffe282c511350

Since the handling of T_ORDREL_REQ is to effectively tell the other side that there's no more writes coming, this translates into strseteof() on the client/read side. Now we wake up from strget() in haproxy-socket, see that STREOF is set, and return a read of zero.

The short version of this race is that the server's shutdown(SHUT_WR) manages to jump
the queue in front of the last data mblk because ->te_closing happens to get set
in between.

Actions #1

Updated by John Levon almost 2 years ago

Given the above, the apparent fix is relatively simple:

1938 2003  static void
1939 2004  tl_wput_data_ser(mblk_t *mp, tl_endpt_t *tep)
1940 2005  {
1941 2006          tl_endpt_t      *peer_tep = tep->te_conp;
1942 2007          queue_t         *peer_rq;
1943 2008  
1944 2009          ASSERT(DB_TYPE(mp) == M_DATA);
1945 2010          ASSERT(IS_COTS(tep));
1946 2011  
1947 2012          IMPLY(peer_tep, tep->te_serializer == peer_tep->te_serializer);

1948 2013  
1949 2014          /*
1950 2015           * fastpath for data. Ignore flow control if tep is closing.
1951 2016           */
1952 2017          if ((peer_tep != NULL) &&
1953 2018              !peer_tep->te_closing &&
1954 2019              ((tep->te_state == TS_DATA_XFER) ||
1955 2020              (tep->te_state == TS_WREQ_ORDREL)) &&
1956 2021              (tep->te_wq != NULL) &&
1957 2022              (tep->te_wq->q_first == NULL) &&
1958      -            ((peer_tep->te_state == TS_DATA_XFER) ||
1959      -            (peer_tep->te_state == TS_WREQ_ORDREL))     &&
     2023 +            (peer_tep->te_state == TS_DATA_XFER ||
     2024 +            peer_tep->te_state == TS_WIND_ORDREL ||
     2025 +            peer_tep->te_state == TS_WREQ_ORDREL) &&

That is, when we come to push out the last piece of data in the walk through above, we take the "fast path" for a peer that's in state TS_WIND_ORDREL after the shutdown(SHUT_WR). That way, we'll go straight ahead and do the strrput(). Then later when the close comes in and sets EOF, we're OK, because we'll have already dealt with the data.

While we're here, we'll also clean up the TE_ state stable so it's more readable.

Actions #2

Updated by John Levon almost 2 years ago

Aside from testing this fix against the haproxy test case and confirming it lasts for hours without the issue (along with general Triton/Manta operation),
I also booted an OI system with it, ran ostest successfully, and ran stress-ng UDS socket tests for some while.

Actions #3

Updated by Electric Monk almost 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit b8ffbd3111b44e6c578236149ddc6140effd3dc2

commit  b8ffbd3111b44e6c578236149ddc6140effd3dc2
Author: John Levon <>
Date:   2020-08-04T18:28:18.000Z

    12694 race between write() and shutdown() for unix sockets
    Reviewed by: Robert Mustacchi <>
    Approved by: Dan McDonald <>


Also available in: Atom PDF