Project

General

Profile

Actions

Bug #13700

closed

pollhead_delete trips over bad pointer

Added by Olaf Bohlen over 2 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

I get the following panic when killing hanging postgresql processes from the postgres-HEAD (upcoming 14) test suit:

panic[cpu3]/thread=fffffec85ed5b860: 
BAD TRAP: type=d (#gp General protection) rp=fffffe0116d67a70 addr=fffffec8f1f90a60

postgres: 
#gp General protection
addr=0xfffffec8f1f90a60
pid=857, pc=0xfffffffffba35475, sp=0xfffffe0116d67b60, eflags=0x10203
cr0: 80050033<pg,wp,ne,et,mp,pe>  cr4: 26f8<vmxe,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 80da0d0  
cr3: 2474c58000  
cr8: 0

        rdi: 6f7078012f746f6f rsi: fffffec86ba571a8 rdx: fffffec85ed5b860
        rcx:              100  r8:               a5  r9:                0
        rax:                0 rbx: fffffec86ba571a8 rbp: fffffe0116d67b90
        r10: fffffffffb8762dc r11: fffffec85ed5b860 r12: fffffffffbcce900
        r13: fffffec857cd9248 r14: fffffec87699e000 r15: fffffec8f1f90a60
        fsb:                0 gsb: fffffec83097f000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                d err:                0 rip: fffffffffba35475
         cs:               30 rfl:            10203 rsp: fffffe0116d67b60
         ss:               38

fffffe0116d67970 unix:real_mode_stop_cpu_stage2_end+c40d ()
fffffe0116d67a60 unix:trap+c72 ()
fffffe0116d67a70 unix:cmntrap+e9 ()   
fffffe0116d67b90 genunix:pollhead_delete+55 ()
fffffe0116d67bf0 poll:dpclose+81 ()
fffffe0116d67c20 genunix:dev_close+27 ()
fffffe0116d67c70 specfs:device_close+c0 ()
fffffe0116d67cf0 specfs:spec_close+19d ()
fffffe0116d67d70 genunix:fop_close+66 ()
fffffe0116d67db0 genunix:closef+5e ()
fffffe0116d67df0 genunix:closeall+57 ()
fffffe0116d67e80 genunix:proc_exit+429 ()
fffffe0116d67ea0 genunix:exit+b ()
fffffe0116d67ec0 genunix:rexit+15 ()
fffffe0116d67f10 unix:brand_sys_sysenter+1d2 ()

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
> ::status
debugging crash dump vmcore.3 (64-bit) from sin
operating system: 5.11 illumos-ffe7853a48 (i86pc)
build version: heads/master-0-gffe7853a48-dirty

image uuid: c609e9f2-75b0-cd19-b04b-add90c63e31d
panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe0116d67a70 addr=fffffec8f1f90a60
dump content: kernel pages only
> ::stackregs
fffffe0116d67b90 pollhead_delete+0x55(fffffec857cd9248, fffffec86ba571a8)
fffffe0116d67bf0 dpclose+0x81(a500000064, 3, 2, fffffec870da9cd0)
fffffe0116d67c20 dev_close+0x27(a500000064, 3, 2, fffffec870da9cd0)
fffffe0116d67c70 device_close+0xc0(fffffec8be0a0d40, 3, fffffec870da9cd0)
fffffe0116d67cf0 spec_close+0x19d(fffffec8be0a0d40, 3, 1, 10, fffffec870da9cd0, 0)
fffffe0116d67d70 fop_close+0x66(fffffec8be0a0d40, 3, 1, 10, fffffec870da9cd0, 0)
fffffe0116d67db0 closef+0x5e(fffffec8dd0f5a40)
fffffe0116d67df0 closeall+0x57(fffffec86ff88fc0)
fffffe0116d67e80 proc_exit+0x429(1, 2)
fffffe0116d67ea0 exit+0xb(1, 2)
fffffe0116d67ec0 rexit+0x15(2)
fffffe0116d67f10 _sys_sysenter_post_swapgs+0x14f()

I had that crash twice already at the same event. Unfortunately the following output is from the previous panic, but it should be the same:

elmer@ailbhein:~$ cat typescript                                                
Script started on March 18, 2021 at 03:30:08 PM CET
elmer@ailbhein:~$ ps -fu elmer
     UID   PID  PPID   C    STIME TTY         TIME CMD
   elmer   838   239   0   Mar 16 ?           0:00 psql -X -a -q -d regression -v HIDE_TABLEAM=on
   elmer 27983 27980   0 15:30:13 pts/3       0:00 ps -fu elmer
   elmer   373   364   0   Mar 16 ?           0:00 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer   364   239   0   Mar 16 ?           0:06 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer   857   364   0   Mar 16 ?           0:00 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer 20166 20165   0   Mar 16 ?           0:00 /usr/bin/perl ./run_branches.pl --run-all --verbose
   elmer   372   364   0   Mar 16 ?           0:11 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer 26210 26209   0   Mar 16 ?           0:00 gmake NO_LOCALE=1 check
   elmer   369   364   0   Mar 16 ?           0:01 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer   370   364   0   Mar 16 ?           0:01 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer   238 26210   0   Mar 16 ?           0:00 /bin/sh -c PATH="/export/home/elmer/c12/buildroot/HEAD/pgsql.build/tmp_install/
   elmer 26209 20326   0   Mar 16 ?           0:00 sh -c { cd pgsql.build/src/test/regress && gmake NO_LOCALE=1 check; echo $? > /
   elmer 27073 27069   0 14:17:08 pts/2       0:00 -ksh
   elmer   368   364   0   Mar 16 ?           0:00 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer 20326 20166   0   Mar 16 ?           0:00 /usr/perl5/5.22/bin/perl ./run_build.pl --config ./build-farm.conf --verbose HE
   elmer 20165  9097   0   Mar 16 ?           0:00 sh -c ( cd /export/home/elmer/c12 && PATH=/usr/gnu/bin:$PATH ./run_branches.pl 
   elmer   371   364   0   Mar 16 ?           0:04 postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
   elmer   239   238   0   Mar 16 ?           0:00 ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. -
   elmer 27979 27978   0 15:30:08 pts/2       0:00 script
   elmer 27980 27979   0 15:30:08 pts/3       0:00 /bin/ksh -i
   elmer 27978 27073   0 15:30:08 pts/2       0:00 script
elmer@ailbhein:~$ pgrep psql                                                    
838
elmer@ailbhein:~$ pstack 838                                                    
838:    psql -X -a -q -d regression -v HIDE_TABLEAM=on
 fe81c5d5 pollsys  (8047160, 1, 0, 0)
 fe7a747d poll     (8047160, 1, ffffffff) + 61
 fef2f0d7 pqSocketPoll (6, 1, 0, ffffffff) + ac
 fef2efb9 pqSocketCheck (80fea38, 1, 0, ffffffff) + ae
 fef2ee73 pqWaitTimed (1, 0, 80fea38, ffffffff) + 23
 fef2ee48 pqWait   (1, 0, 80fea38, fef29329) + 23
 fef2939d PQgetResult (80fea38, fef6d124, 8047368, fef29a6a) + 81
 fef29acd PQexecFinish (80fea38, 8107d50, 80473a8, fef29788) + 6f
 fef297d1 PQexec   (80fea38, 8107d50, 25, 8128f90) + 55
 08067ca7 SendQuery (8107d50, 80fdbe0, 8047430, 807b4a2) + 3df
 0807b626 MainLoop (80fc1c0, 8, 80479d4, 80b088d) + 9ce
 0806383e process_file (0, 0, 8047968, 808bbf4) + 182
 0808be74 main     (804796c, fe8985c8, 80479a8, 805d31b) + 9c2
 0805d31b _start_crt (8, 80479d4, fefd0c6f, 0, 0, 0) + 9a
 0805d1ea _start   (8, 8047af8, 8047afd, 8047b00, 8047b03, 8047b06) + 1a
elmer@ailbhein:~$ pfiles 838                                                    
838:    psql -X -a -q -d regression -v HIDE_TABLEAM=on
  Current rlimit: 65536 file descriptors
   0: S_IFREG mode:0644 dev:272,65872 ino:508845 uid:5432 gid:5432 size:3356
      O_RDONLY|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/sql/regproc.sql
      offset:3356
   1: S_IFREG mode:0644 dev:272,65872 ino:531226 uid:5432 gid:5432 size:1868
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/results/regproc.out
      offset:1868
   2: S_IFREG mode:0644 dev:272,65872 ino:531226 uid:5432 gid:5432 size:1868
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/results/regproc.out
      offset:1868
   3: S_IFREG mode:0644 dev:272,65872 ino:528826 uid:5432 gid:5432 size:84
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/regression.out
      offset:84
   4: S_IFREG mode:0644 dev:272,65872 ino:509211 uid:5432 gid:5432 size:4944
      O_RDONLY|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/parallel_schedule
      offset:4944
   5: S_IFDOOR mode:0444 dev:537,0 ino:192 uid:0 gid:0 rdev:539,0
      O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[8540]
   6: S_IFSOCK mode:0666 dev:546,0 ino:16958 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peername: AF_UNIX /tmp/pg_regress-oKa4Da/.s.PGSQL.5678
        peer: postgres[364] zone: ailbhein[4]
elmer@ailbhein:~$ pstack 364                                                    
364:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c5d5 pollsys  (8043720, 1, 80437e8, 0)
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047a9d, 8047aa0, 8047af4, 8047af7) + 1a
elmer@ailbhein:~$ pfiles 364                                                    
364:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
  Current rlimit: 65536 file descriptors
   0: S_IFCHR mode:0666 dev:536,5 ino:201223015 uid:0 gid:3 rdev:134,2
      O_RDONLY|O_LARGEFILE
      /dev/null
      offset:0
   1: S_IFREG mode:0644 dev:272,65872 ino:528229 uid:5432 gid:5432 size:2344339
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log
      offset:2344339
   2: S_IFREG mode:0644 dev:272,65872 ino:528229 uid:5432 gid:5432 size:2344339
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/log/postmaster.log
      offset:2344339
   3: S_IFREG mode:0644 dev:272,65872 ino:528826 uid:5432 gid:5432 size:84
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/regression.out
      offset:84
   4: S_IFIFO mode:0000 dev:533,0 ino:60317 uid:5432 gid:5432 rdev:0,0
      O_RDWR|O_NONBLOCK
   5: S_IFIFO mode:0000 dev:533,0 ino:60317 uid:5432 gid:5432 rdev:0,0
      O_RDWR
   6: S_IFSOCK mode:0666 dev:546,0 ino:19348 uid:0 gid:0 rdev:0,0
      O_RDWR
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX /tmp/pg_regress-oKa4Da/.s.PGSQL.5678
   7: S_IFDOOR mode:0444 dev:537,0 ino:192 uid:0 gid:0 rdev:539,0
      O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[8540]
   8: S_IFSOCK mode:0666 dev:546,0 ino:25650 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK
        SOCK_DGRAM
        SO_DGRAM_ERRIND,SO_SNDBUF(57344),SO_RCVBUF(102400)
        sockname: AF_INET6 ::1  port: 59134
        peername: AF_INET6 ::1  port: 59134
elmer@ailbhein:~$ pargs 364                                                     
pargs: Couldn't determine locale of target process.
pargs: Some strings may not be displayed properly.
364:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/
argv[0]: postgres
argv[1]: -D
argv[2]: /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test/regress/./tmp_check/data
argv[3]: -F
argv[4]: -c
argv[5]: listen_addresses=
argv[6]: -k
argv[7]: /tmp/pg_regress-oKa4Da
elmer@ailbhein:~$ # all postgres server stacks follow                           
elmer@ailbhein:~$ for p in $(pgrep postgres); do pstack $p; done                
373:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (a, d001, 8043010)
 0853c949 WaitEventSetWaitBlock (895cfd8, 2bf20, 80430d0, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, 2bf20, 80430d0, 1, 5000006, 0) + 12e
 0853bd8c WaitLatch (fe5816dc, 29, 2bf20, 5000006) + a6
 084db16b ApplyLauncherMain (0, 895d190, 8043378, 84ac205) + 295
 084ac238 StartBackgroundWorker (0, 80434cc, 80433c8, 0, 8957d20, fe98b000) + 225
 084c0837 do_start_bgworker (897ffc8, 0, fe91bb35, fe8f1b69) + 1d5
 084c0c24 maybe_start_bgworkers (3, 0, 16f, 0, fe8efa50, fe98b000) + 1a2
 084bd638 reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 39c
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
364:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c5d5 pollsys  (8043720, 1, 80437e8, 0)
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047a9d, 8047aa0, 8047af4, 8047af7) + 1a
857:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (b, d001, 8042ee0)
 0853c949 WaitEventSetWaitBlock (895cfd8, ffffffff, 8042fa0, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, ffffffff, 8042fa0, 1, 8000007, 0) + 12e
 0853bd8c WaitLatch (fe57c804, 21, 0, 8000007) + a6
 08396b39 gather_readnext (89f9e7c, 0, 8043028, 870560b) + 1dd
 08396890 gather_getnext (89f9e7c, 89fbc5c, 4, 83db151) + 48
 0839679f ExecGather (89f9e7c, 89fa764, 8043088, 837d68e) + 1f9
 0837d6be ExecProcNodeFirst (89f9e7c, 80430a8, 81f578b, 1) + 3b
 08373a16 ExecProcNode (89f9e7c, 0, 80430f8, 8375e97) + 2c
 08375ec1 ExecutePlan (89f9d5c, 89f9e7c, 1, 1, 1, 0) + a3
 08373fc7 standard_ExecutorRun (898105c, 1, 0, 0, 1, 8962b8c) + 1c1
 08373e00 ExecutorRun (898105c, 1, 0, 0, 1, 8047920) + 5c
 085701b6 PortalRunSelect (89b1164, 1, 0, 8962d14) + ed
 0856fede PortalRun (89b1164, 7fffffff, 1, 1, 8962d14, 8962d14) + 1ea
 0856a3bf exec_simple_query (8961034, 0, 80437a8, 856e3c3) + 406
 0856e41f PostgresMain (1, 80437c4, 895ec20, 898a8ec, 8047920, 885dd07) + 7cc
 084bf972 ExitPostmaster (8980a00, 8980a00, 8043808, 84bf2fc, 3, 89595e8)
 084bf322 BackendStartup (8980a00, 89595e8, 0, 84bbafd) + 1bf
 084bb8b6 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + 1de
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
372:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (5, d001, 8042ee0)
 0853c949 WaitEventSetWaitBlock (8962c2c, ffffffff, 8042fac, 1) + 26
 0853c804 WaitEventSetWait (8962c2c, ffffffff, 8042fac, 1, 5000007, 8042fbc) + 12e
 084b5cae PgstatCollectorMain (0, 0, 80433e8, 84aff95, fe98b000, fe812a40) + 4eb
 084affa1 pgstat_start (3, 0, 16f, 0, fe8efa50, fe98b000) + cf
 084bd62e reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 392
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
369:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (a, d001, 8041c00)
 0853c949 WaitEventSetWaitBlock (895cfd8, 2710, 8041cc0, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, 2710, 8041cc0, 1, 5000002, 0) + 12e
 0853bd8c WaitLatch (fe58377c, 29, 2710, 5000002) + a6
 084acf0d BackgroundWriterMain (3, 89595e8, 8043358, 870440c, 8960fa0, ffffffff) + 341
 0821c6cb AuxiliaryProcessMain (2, 804339c, 80433d8, 84bfe70, 4, 0) + 568
 084bfeb8 StartChildProcess (3, 80433f4, 40, 11) + cb
 084bd528 reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 28c
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
370:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (a, d001, 8043020)
 0853c949 WaitEventSetWaitBlock (895cfd8, 1388, 80430e0, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, 1388, 80430e0, 1, 500000c, 0) + 12e
 0853bd8c WaitLatch (fe583a34, 29, 1388, 500000c) + a6
 084c2cb4 WalWriterMain (3, 89595e8, 8043358, 870440c, 8960fa0, ffffffff) + 2a0
 0821c6da AuxiliaryProcessMain (2, 804339c, 80433d8, 84bfe70, 4, 0) + 577
 084bfeb8 StartChildProcess (6, 80433f4, 40, 11) + cb
 084bd543 reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 2a7
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
368:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (a, d001, 8043000)
 0853c949 WaitEventSetWaitBlock (895cfd8, 493e0, 80430c0, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, 493e0, 80430c0, 1, 5000004, 0) + 12e
 0853bd8c WaitLatch (fe5834c4, 29, 493e0, 5000004) + a6
 084ad5fb CheckpointerMain (3, 89595e8, 8043358, 870440c, 8960fa0, ffffffff) + 662
 0821c6d0 AuxiliaryProcessMain (2, 804339c, 80433d8, 84bfe70, 4, 0) + 56d
 084bfeb8 StartChildProcess (5, 80433f4, 40, 11) + cb
 084bd50d reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 271
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
371:    postgres -D /export/home/elmer/c12/buildroot/HEAD/pgsql.build/src/test
 fe91c345 ioctl    (a, d001, 8043090)
 0853c949 WaitEventSetWaitBlock (895cfd8, 749d, 8043150, 1) + 26
 0853c804 WaitEventSetWait (895cfd8, 749d, 8043150, 1, 5000001, 0) + 12e
 0853bd8c WaitLatch (fe58011c, 29, 749d, 5000001) + a6
 084a7985 AutoVacLauncherMain (0, 0, 80433e8, 84a753a, fe98b000, fe812a40) + 42d
 084a7553 StartAutoVacLauncher (3, 0, 16f, 0, fe8efa50, fe98b000) + 7e
 084bd570 reaper   (12, 0, 80434cc, fe98b000, fe812a40, fe98b000) + 2d4
 fe917d15 __sighndlr (12, 0, 80434cc, 84bd29c, fe8524f8, fe990f80) + 15
 fe90b832 call_user_handler (12, 0, 80434cc) + 1e9
 fe90ba40 sigacthandler (12, 0, 80434cc) + e4
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe91c5d5 __pollsys (8043720, 1, 80437e8, 0, 6, 40) + 15
 fe8ac772 pselect  (7, 804382c, fe995240, fe995240, 80437e8, 0) + 272
 fe8acb08 select   (7, 804382c, 0, 0, 804582c) + 89
 084bb7b3 ServerLoop (fef70530, fef70530, 8962330, 1, 8962330, 0) + db
 084bb259 PostmasterMain (8, 895cba0, 8047908, 83e08d2, 8047970, 89312e0) + 10c8
 083e0a99 startup_hacks (804790c, fe9985c8, 8047948, 811f57b)
 0811f57b _start_crt (8, 8047970, fefd0c6f, 0, 0, 0) + 9a
 0811f44a _start   (8, 8047a94, 8047f7e, 8047f7e, 8047f7e, 8047f7e) + 1a
elmer@ailbhein:~$                                                               

script done on March 18, 2021 at 03:31:31 PM CET

I still have the crash dump and will be willing to provide that.

How to reproduce:

Set up a postgresql buildfarm member, use this config:

https://www.eenfach.de/~olbohlen/build-farm.conf

Then run:

35 0,3,6,9,12,18,22 * * * ( cd /export/home/elmer/c12 && PATH=/usr/gnu/bin:$PATH ./run_branches.pl --run-all --verbose >/export/home/elmer/c12/cron.out 2>&1 )

It will hang on the regression tests for posgresql-HEAD, 32bit built


Files

typescript.mdb-k (20.7 KB) typescript.mdb-k mdb -k session Olaf Bohlen, 2021-04-10 02:42 PM
typescript (41.5 KB) typescript shell session Olaf Bohlen, 2021-04-10 02:42 PM

Related issues

Related to illumos gate - Bug #14892: pollhead lifetime too short in signalfdClosedPatrick Mooney

Actions
Actions #1

Updated by Olaf Bohlen over 2 years ago

I uploaded a vmdump here: https://www.eenfach.de/~olbohlen/share/vmdump.3
compressed 6.1G, uncompressed 18.1G

The download requires a user/password, drop me a /msg on freenode (Agnar) or drop me a mail to

Actions #2

Updated by Olaf Bohlen over 2 years ago

so, I got this again. This time updated outputs (note, c12 directory indicates -m32 build, c12x is 64bit build).
Also this time I include kernel stacktraces for all relevant processes.
(see attached files)

Actions #3

Updated by Olaf Bohlen over 2 years ago

and of course I have the matching crash dump, let me know if someone is interested in it.

Actions #4

Updated by Olaf Bohlen almost 2 years ago

any news here?

Actions #5

Updated by Dan McDonald almost 2 years ago

Sorry for not seeing this earlier.

In your initial posting, register `rdi` has some STRANGE contents. This suggests to me that you may be seeing a use-after-free bug.

Have you reproduced this using a DEBUG kernel by any chance? DEBUG kernels turn on the full kmem debugging AND all of the ASSERTs fire. A dump from a DEBUG kernel would be most useful, I think.

Actions #6

Updated by Olaf Bohlen almost 2 years ago

No, unfortunately it's not a debug kernel. If I remember correctly Andy Fiddaman had more insights on this issue, we chatted about it.
Also I think Robert should know about this one.

Actions #7

Updated by Dan McDonald about 1 year ago

We have OS-5886 in SmartOS that might be this same problem. https://smartos.org/bugview/OS-5886

Actions #8

Updated by Patrick Mooney about 1 year ago

  • Subject changed from illumos-ffe7853a48 panics on genunix:pollhead_delete+55 for posgresql-HEAD buildfarm tests to pollhead_delete trips over bad pointer
Actions #9

Updated by Patrick Mooney about 1 year ago

I've confirmed that the program and D script listed in OS-5886 are still able to trigger the race on current bits (OmniOSCE r42) when kmem_flags is set to 0xf like on DEBUG.

Actions #10

Updated by Electric Monk about 1 year ago

  • Gerrit CR set to 2269
Actions #11

Updated by Patrick Mooney about 1 year ago

Since epoll is backed by /dev/poll, I ran the epoll test suite on bits before and after the fix was applied to check for a change in behavior. The results were the same. Similarly, the os-test suite results were the same before and after.

I used some simplified bug reproducer programs (including the one from OS-5886) to check that at least the known pathological cases were now safe.

For further smoke testing, I ran the tokio test suite, which heavily relies on our epoll emulation for its event handling. The results were the same before and after.

Actions #12

Updated by Patrick Mooney about 1 year ago

Per rm's suggestion, I circled back to test a few more things:

I booted up a system featuring an ipmi device to confirm that ipmitool, which appears poll on the ipmi device, continues to work after the change. (This was checked via ipmitool sensor and ipmitool sel)

Regarding event ports, I ran the libuv test suite (in v1.41, prior to its switch to using epoll on illumos), and confirmed that the results were the same before and after the change.

For ctfs, I restarted the fmd service while tracing pollhead_clean, to see that it had, in fact, cleaned up after itself and successfully restarted. I repeated the same tracing while running sleep under ctrun in order to confirm similar activity.

Actions #13

Updated by Electric Monk about 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 2c76d75129011c98e79463bb84917b828f922a11

commit  2c76d75129011c98e79463bb84917b828f922a11
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2022-08-04T15:59:49.000Z

    13700 pollhead_delete trips over bad pointer
    Reviewed by: Dan McDonald <danmcd@mnx.io>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions #14

Updated by Patrick Mooney about 1 year ago

  • Related to Bug #14892: pollhead lifetime too short in signalfd added
Actions

Also available in: Atom PDF