Project

General

Profile

Bug #10449

wsdiff hangs forever after an exception is thrown

Added by Joshua M. Clulow 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
tools - gate/build tools
Start date:
2019-02-25
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

I was running a wsdiff between two workspaces where the comparison of a particular file induced an encoding error. That error was thrown but not caught in the worker thread. The process wrote no further output to the result file, and appeared to be basically idle.

Before hanging, this was the output:

18:34:51 + /opt/onbld/bin/wsdiff -r artefacts/wsdiff.txt previous/proto/root_i386 current/proto/root_i386
18:37:48 platform/i86xpv/kernel/amd64/unix
18:38:27 platform/i86pc/kernel/amd64/unix
18:41:54 etc/motd
18:45:40 usr/sbin/iasl
18:46:14 usr/sbin/acpixtract
18:46:15 usr/sbin/acpidump
19:38:36 usr/lib/spell/hlistb
19:38:36 Exception in thread Thread-2:
19:38:36 Traceback (most recent call last):
19:38:36   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
19:38:36     self.run()
19:38:36   File "/opt/onbld/bin/wsdiff", line 1233, in run
19:38:36     compareOneFile(base, ptch, False)
19:38:36   File "/opt/onbld/bin/wsdiff", line 1277, in compareOneFile
19:38:36     return compareBasic(base, ptch, quiet, fileType)
19:38:36   File "/opt/onbld/bin/wsdiff", line 1112, in compareBasic
19:38:36     difference(fileName, fileType, diffs)
19:38:36   File "/opt/onbld/bin/wsdiff", line 195, in difference
19:38:36     log_difference(f, dtype, diffs)
19:38:36   File "/opt/onbld/bin/wsdiff", line 212, in log_difference
19:38:36     print(diffs[:diffs_sz_thresh], file=log)
19:38:36 UnicodeEncodeError: 'ascii' codec can't encode characters in position 64-66: ordinal not in range(128)
19:38:36 
19:47:31 usr/share/lib/java/javadoc/dtrace/api/org/opensolaris/os/dtrace/class-use/StackFrame.html
19:47:31 usr/share/lib/java/javadoc/dtrace/api/org/opensolaris/os/dtrace/class-use/ProcessEvent.html
19:47:31 usr/share/lib/java/javadoc/dtrace/api/org/opensolaris/os/dtrace/class-use/ErrorEvent.html

The stuck wsdiff process has three worker threads (from the original total of 4, for a 4 CPU VM?) remaining, all of which are waiting to acquire a Python-level lock:

root@jenkins-oi0:~# pstack 16110 
16110:    /usr/bin/python2.7 /opt/onbld/bin/wsdiff -r artefacts/wsdiff.txt previ
--------------------- thread# 1 / lwp# 1 ---------------------
 fed265d5 pollsys  (8045590, 0, 8045628, 0)
 fecbad39 pselect  (0, fed9f020, fed9f020, fed9f020, 8045628, 0) + 232
 fecbb03b select   (0, 0, 0, 0, 8045688) + 8e
 fe951d8c time_sleep (0, fe3dd02c, 0, feeba79b) + ac
 feeb9f84 PyEval_EvalFrameEx (1ff, 3cfa8ed9, fef55000, feebdbcd) + 6024
 feebdbcd PyEval_EvalCodeEx (fe8a0430, fe8a0430, fedf97ab, feeb9c9f, fea11410, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fed95804, ffffffff, 8045958, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (fed95000, fed97870, 8045a08, feebdf0e, fea11458, feb97714) + 7bd
 feebdf0e PyEval_EvalCode (fea11458, feb97714, feb97714, 83ddbc0, fed95000, fed97870) + 2e
 feee2373 PyRun_FileExFlags (fed97870, 8045dc3, 101, feb97714, feb97714, 1) + 73
 feee3905 PyRun_SimpleFileExFlags (fed97870, 8045dc3, 1, 8045b3c) + d5
 feee400d PyRun_AnyFileExFlags (fed97870, 8045dc3, 1, 8045b3c, fef693c0, 0) + 6d
 feefc494 Py_Main  (6, 8045c6c, 8045c6c, 8051044) + dd4
 08051057 main     (8045c0c, feda2308, 8045c48, 8050f18, 6, 8045c6c) + 27
 08050f18 _start_crt (6, 8045c6c, f7d9bdb0, 0, 0, 0) + 97
 08050dea _start   (6, 8045db0, 8045dc3, 8045dd9, 8045ddc, 8045df1) + 1a
--------------------- thread# 2 / lwp# 2 ---------------------
 fed21ec9 lwp_park (0, 0, 0)
 fed1575e sema_wait (8429fe0) + 19
 fed07ecd sem_wait (8429fe0) + 35
 feef7dd9 PyThread_acquire_lock (8429fe0, 1, fdf5f178, feefe23d, feb9b140, fe8f10e0) + 99
 feefe250 lock_PyThread_acquire_lock (feb9b140, feb6102c, 0, fed95000) + 50
 feeb9f84 PyEval_EvalFrameEx (fed9a340, fed95000, fdf5f2b8, feebdbcd) + 6024
 feebdbcd PyEval_EvalCodeEx (1ff, 641dd4a6, feb62020, feeb9c9f, febe39b0, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (24, 4, fdf5f488, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 7747505b, fdf5f4b8, feeb9c9f, febe34e8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (c, 1, fdf5f628, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 72d3e7f1, fdf5f658, feeb9c9f, fea11140, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fed9a340, fed95000, fdf5f798, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 2e0f5864, feb62020, feeb9c9f, fea113c8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fe8f0d10, fe8a82d4, fdf5f938, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (fe3dd82c, fe3ec36c, fdf5f998, feeb9c9f, fea112a8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fe87a98c, fe88172c, fdf5fb68, feeb9ee9, fe3ec36c, 0) + 5d3f
 feeb9ee9 PyEval_EvalFrameEx (0, 0, 0, feebdbcd) + 5f89
 feebdbcd PyEval_EvalCodeEx (0, 0, 0, fee2e785, fe91fde8, fe9178ac) + 7bd
 fee2e785 function_call (fe8816bc, fe9165ec, 0, fee64ee2, fe9165f8, 0) + 75
 fedf8fde PyObject_Call (fe8816bc, fe9165ec, 0, 0, 0, fe3dd82c) + 4e
 fee0b4a8 instancemethod_call (fe898cac, feb6102c, 0, fed07ecd, 8459da0, fef55000) + a8
 fedf8fde PyObject_Call (fe898cac, feb6102c, 0, feef7d4a, fef55000, 842c728) + 4e
 feeb391c PyEval_CallObjectWithKeywords (fe898cac, feb6102c, 0, feefe65f) + 4c
 feefe68b t_bootstrap (83ddfd0, 0, 0, 0) + 4b
 fed21cdb _thrp_setup (fdf70240) + 88
 fed21e70 _lwp_start (fdf70240, 0, 0, 0, 0, 0)
--------------------- thread# 4 / lwp# 4 ---------------------
 fed21ec9 lwp_park (0, 0, 0)
 fed1575e sema_wait (8429fe0) + 19
 fed07ecd sem_wait (8429fe0) + 35
 feef7dd9 PyThread_acquire_lock (8429fe0, 1, fdb21178, feefe23d, feb9b140, fe8f10e0) + 99
 feefe250 lock_PyThread_acquire_lock (feb9b140, feb6102c, 0, fed95000) + 50
 feeb9f84 PyEval_EvalFrameEx (fed95000, 0, 0, feebdbcd) + 6024
 feebdbcd PyEval_EvalCodeEx (1ff, 641dd4a6, feb62020, feeb9c9f, febe39b0, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (24, 4, fdb21488, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 7747505b, fdb214b8, feeb9c9f, febe34e8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (c, 1, fdb21628, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 72d3e7f1, fdb21658, feeb9c9f, fea11140, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fed9a340, fed95000, fdb21798, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 2e0f5864, feb62020, feeb9c9f, fea113c8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fe8f0d10, fe8a8acc, fdb21938, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (fe3ddcec, fe3d433c, fdb21998, feeb9c9f, fea112a8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fd26616c, fe88172c, fdb21b68, feeb9ee9, fe3d433c, 0) + 5d3f
 feeb9ee9 PyEval_EvalFrameEx (0, 0, 0, feebdbcd) + 5f89
 feebdbcd PyEval_EvalCodeEx (0, 0, 0, fee2e785, fe91fde8, fe9178ac) + 7bd
 fee2e785 function_call (fe8816bc, fe3dd90c, 0, fee64ee2, fe3dd918, 0) + 75
 fedf8fde PyObject_Call (fe8816bc, fe3dd90c, 0, 0, 0, fe3ddcec) + 4e
 fee0b4a8 instancemethod_call (fd94f414, feb6102c, 0, fed07ecd, 8459da0, fef55000) + a8
 fedf8fde PyObject_Call (fd94f414, feb6102c, 0, feef7d4a, fef55000, 840d1d8) + 4e
 feeb391c PyEval_CallObjectWithKeywords (fd94f414, feb6102c, 0, feefe65f) + 4c
 feefe68b t_bootstrap (83de010, 0, 0, 0) + 4b
 fed21cdb _thrp_setup (fdf71240) + 88
 fed21e70 _lwp_start (fdf71240, 0, 0, 0, 0, 0)
--------------------- thread# 5 / lwp# 5 ---------------------
 fed21ec9 lwp_park (0, 0, 0)
 fed1575e sema_wait (8429fe0) + 19
 fed07ecd sem_wait (8429fe0) + 35
 feef7dd9 PyThread_acquire_lock (8429fe0, 1, fd54e178, feefe23d, feb9b140, fe8f10e0) + 99
 feefe250 lock_PyThread_acquire_lock (feb9b140, feb6102c, 0, fed95000) + 50
 feeb9f84 PyEval_EvalFrameEx (fed9a340, fed95000, fd54e2b8, feebdbcd) + 6024
 feebdbcd PyEval_EvalCodeEx (1ff, 641dd4a6, feb62020, feeb9c9f, febe39b0, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (24, 4, fd54e488, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 7747505b, fd54e4b8, feeb9c9f, febe34e8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (c, 1, fd54e628, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 72d3e7f1, fd54e658, feeb9c9f, fea11140, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fed9a340, fed95000, fd54e798, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (1ff, 2e0f5864, feb62020, feeb9c9f, fea113c8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fe8f0d10, fdb4646c, fd54e938, feebdbcd) + 5d3f
 feebdbcd PyEval_EvalCodeEx (fe3dddac, fe3d44c4, fd54e998, feeb9c9f, fea112a8, feb97714) + 7bd
 feeb9c9f PyEval_EvalFrameEx (fd26658c, fe88172c, fd54eb68, feeb9ee9, fe3d44c4, 0) + 5d3f
 feeb9ee9 PyEval_EvalFrameEx (0, 0, 0, feebdbcd) + 5f89
 feebdbcd PyEval_EvalCodeEx (0, 0, 0, fee2e785, fe91fde8, fe9178ac) + 7bd
 fee2e785 function_call (fe8816bc, fe8a9c0c, 0, fee64ee2, fe8a9c18, 0) + 75
 fedf8fde PyObject_Call (fe8816bc, fe8a9c0c, 0, 0, 0, fe3dddac) + 4e
 fee0b4a8 instancemethod_call (fd94f43c, feb6102c, 0, fed07ecd, 8459da0, fef55000) + a8
 fedf8fde PyObject_Call (fd94f43c, feb6102c, 0, feef7d4a, fef55000, 845fb58) + 4e
 feeb391c PyEval_CallObjectWithKeywords (fd94f43c, feb6102c, 0, feefe65f) + 4c
 feefe68b t_bootstrap (83de030, 0, 0, 0) + 4b
 fed21cdb _thrp_setup (fdf71a40) + 88
 fed21e70 _lwp_start (fdf71a40, 0, 0, 0, 0, 0)

The stuck process appears to be sleeping for one second in a loop:

root@jenkins-oi0:~# truss -D -p 16110 
/1:     0.1207    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/5:    lwp_park(0x00000000, 0)        (sleeping...)
/4:    lwp_park(0x00000000, 0)        (sleeping...)
/2:    lwp_park(0x00000000, 0)        (sleeping...)
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0011    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0006    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0006    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:     1.0004    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0006    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0006    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0004    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0005    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0
/1:    pollsys(0x08045590, 0, 0x08045628, 0x00000000) (sleeping...)
/1:     1.0006    pollsys(0x08045590, 0, 0x08045628, 0x00000000)    = 0

Which I suspect aligns with this code:

...
        # Wait for the threads to finish and do cleanup if interrupted
        debug("Waiting for the threads to finish")
        while True:
                if not True in [thread.isAlive() for thread in mythreads]:
                    break
                else:
                    # Some threads are still going
                    time.sleep(1)
...

Looking at the log_difference function, where the original exception occurred, it's not difficult to see what's happened:

 197 #
 198 # Do the actual logging of the difference to the results file
 199 #
 200 def log_difference(f, dtype, diffs) :
 201 
 202         if logging :
 203                 log_lock.acquire()
 204                 print(f, file=log)
 205                 print("NOTE: " + dtype + " difference detected.", file=log)
 206 
 207                 difflen = len(diffs)
 208                 if difflen > 0 :
 209                         print('', file=log)
 210 
 211                         if not vdiffs and difflen > diffs_sz_thresh :
 212                                 print(diffs[:diffs_sz_thresh], file=log)
 213                                 print("... truncated due to length: " +
 214                                       "use -v to override ...", file=log)
 215                         else :
 216                                 print(diffs, file=log)
 217                         print('\n', file=log)
 218                 log.flush()
 219                 log_lock.release()

We're logging to a result file, so we take log_lock on 203. When we attempt to print the differences on line 212, an encoding exception is thrown. The exception is not caught within the function, or indeed anywhere in the worker thread, so the thread exits at this point while holding log_lock. The other three threads get to work on at most one more item, which they print to stdout (using output_lock, not held) and then when they go to write differences to the file, they hang forever -- this explains the three additional HTML files mentioned before the process came to rest.

There are quite a lot of places where we use the same pattern for different locks. The whole program should exit in error if an exception is not handled before unwinding to the processing loop in workerThread().


Related issues

Related to illumos gate - Bug #10448: wsdiff explodes on encoding errorClosed2019-02-25

Actions

History

#1

Updated by Joshua M. Clulow 10 months ago

  • Related to Bug #10448: wsdiff explodes on encoding error added

Also available in: Atom PDF