Project

General

Profile

Actions

Bug #16087

closed

lseek(SEEK_DATA) on ZFS races with writes and can briefly miss data before it is flushed in a txg

Added by Gordon Ross 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

From https://github.com/openzfs/zfs/pull/15571

Description
Over its history this the dirty dnode test has been changed between checking for a dnodes being on os_dirty_dnodes (dn_dirty_link) and dn_dirty_record.

de198f2 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
2531ce3 Revert "Report holes when there are only metadata changes"
ec4f9b8 Report holes when there are only metadata changes
454365b Fix dirty check in dmu_offset_next()
66aca24 SEEK_HOLE should not block on txg_wait_synced()

Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9

It turns out both are actually required.

In the case of appending data to a newly created file, the dnode proper is dirtied (at least to change the blocksize) and dirty records are added. Thus, a single logical operation is represented by separate dirty indicators, and must not be separated.

The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. It can happen that the dnode part is undirtied, while dirty records are still on the dnode for the next txg. In this case, lseek(fd, 0, SEEK_DATA) would not know that the file is dirty, and would go to dnode_next_offset(). Since the object has no data blocks yet, it returns ESRCH, indicating no data found, which results in ENXIO being returned to lseek()'s caller.

Since coreutils 9.2, cp performs sparse copies by default, that is, it uses SEEK_DATA and SEEK_HOLE against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call fallocate() to create a hole over the entire destination file.

This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14.

This change simply updates the dirty check to check both types of dirty. If there's anything dirty at all, we immediately go to the "wait for sync" stage, It doesn't really matter after that; both changes are on disk, so the dirty fields should be correct.

[There are notes about "How has this been tested" after that, but it's not clear those apply to us].
In fact, it's not clear what is the exposure of illumos systems to this problem.
Analysis and testing are underway to determine that.

Here are some of the more helpful links with more info about this issue:
https://github.com/openzfs/zfs/pull/15571
https://github.com/openzfs/zfs/issues/15526
https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574d30dc73


Files

reproducer.sh (843 Bytes) reproducer.sh Rich Ercolani, 2023-11-30 01:23 AM

Related issues

Related to illumos gate - Bug #16092: ZFS implementation of _FIO_SEEK_HOLE could use more workNew

Actions
Actions #1

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 3165
Actions #2

Updated by Gordon Ross 3 months ago

It's worth noting that SEEK_HOLE/SEEK_DATA have been exposed in SMB since:
#11020 want SMB2 ioctl FSCTL_SRV_COPYCHUNK
commit 55f0a249fd3511728b02627190771a4ce4ddf20e
Author: Gordon Ross <>
Date: Mon May 2 13:07:54 2016 -0400

Actions #3

Updated by Rich Ercolani 3 months ago

Nothing up my sleeve...

$ for i in `seq 1 100`;do (bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh & bash ~/reproducer.sh) 2>&1 | grep differ;done;
Binary files reproducer_2249_0 and reproducer_2249_314 differ
Binary files reproducer_2249_0 and reproducer_2249_629 differ
Binary files reproducer_2249_0 and reproducer_2249_630 differ
$ uname -a
SunOS oivm 5.11 illumos-676abcb77c i86pc i386 i86pc

When I screwed up and didn't remember that illumos dd won't take =1M, and it generated a bunch of slightly over 1k files, I instead convinced it my pool was deadlocked and it kernel panicked, so that was also exciting.

(Both of the above was with GNU coreutils 9.2's cp, the former with GNU dd, and the listed kernel version on an OI VM from April 2023.)

Subjectively, I'd put the time it took to break at about where early ZoL was - it's very uncommon, in my limited testing, but it can happen.

I didn't try more with illumos cp to test how that would play, yet, I wanted to first see how it worked with something I knew would tickle the fun.

Actions #4

Updated by Dan McDonald 3 months ago

Rich Ercolani wrote in #note-3:

Nothing up my sleeve...

[...]

When I screwed up and didn't remember that illumos dd won't take =1M, and it generated a bunch of slightly over 1k files, I instead convinced it my pool was deadlocked and it kernel panicked, so that was also exciting.

(Both of the above was with GNU coreutils 9.2's cp, the former with GNU dd, and the listed kernel version on an OI VM from April 2023.)

Subjectively, I'd put the time it took to break at about where early ZoL was - it's very uncommon, in my limited testing, but it can happen.

I didn't try more with illumos cp to test how that would play, yet, I wanted to first see how it worked with something I knew would tickle the fun.

Can you please attach your reproducer.sh to this ticket? I tried a version but had little luck. I see you used it on OI, as well.

EDIT/UPDATE: I believe using GNU cp (gcp in pkgsrc) will do the trick. Will update script, verify, and try here.

Actions #5

Updated by Rich Ercolani 3 months ago

Dan McDonald wrote in #note-4:

Rich Ercolani wrote in #note-3:

Nothing up my sleeve...

[...]

When I screwed up and didn't remember that illumos dd won't take =1M, and it generated a bunch of slightly over 1k files, I instead convinced it my pool was deadlocked and it kernel panicked, so that was also exciting.

(Both of the above was with GNU coreutils 9.2's cp, the former with GNU dd, and the listed kernel version on an OI VM from April 2023.)

Subjectively, I'd put the time it took to break at about where early ZoL was - it's very uncommon, in my limited testing, but it can happen.

I didn't try more with illumos cp to test how that would play, yet, I wanted to first see how it worked with something I knew would tickle the fun.

Can you please attach your reproducer.sh to this ticket? I tried a version but had little luck. I see you used it on OI, as well.

It took a while to reproduce, but sure, it's just Tony's reproducer.sh from the OpenZFS bug with the sysctl check removed.

There's almost certainly better ways to do it, but it's worked sufficiently.

Actions #6

Updated by Dan McDonald 3 months ago

I've reproduced an all-zero output ONCE out of a dozen trials. illumos dd does take 1M. I use gnu cp on coreutils-9.4 via pkgsrc trunk.

#!/bin/bash
#
# Run this script multiple times in parallel inside your pool's mount
# to reproduce https://github.com/openzfs/zfs/issues/15526.  Like:
#
# ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & /reproducer.sh & wait
#

# We aren't linux...
#if [ $(cat /sys/module/zfs/parameters/zfs_bclone_enabled) != "1" ] ; then
#    echo "please set /sys/module/zfs/parameters/zfs_bclone_enabled = 1" 
#    exit
#fi

prefix="reproducer_${BASHPID}_" 
dd if=/dev/urandom of=${prefix}0 bs=1M count=1 status=none

echo "writing files" 
end=1000
h=0
for i in `seq 1 2 $end` ; do
    let "j=$i+1" 
    # Gnu DD might use hole-seeking
    # /opt/local/bin/dd if=${prefix}$h of=${prefix}$i bs=1M status=none
    # Our cp doesn't IIRC.
    gcp  ${prefix}$h ${prefix}$i
    gcp --reflink=never ${prefix}$i ${prefix}$j
    let "h++" 
done

echo "checking files" 
for i in `seq 1 $end` ; do
    diff ${prefix}0 ${prefix}$i
done

Actions #7

Updated by Rich Ercolani 3 months ago

It produced slightly over 1k-sized files for me, which was not my expected behavior for bs=1M.

Actions #8

Updated by Dan McDonald 3 months ago

Using my own version of the Rich E. invocation:

for i in `seq 1 100`; do ( ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ) 2>&1 | grep differ ; done 

I've encountered a few files that get nothing-but-zeroes in them.

Binary files reproducer_45363_0 and reproducer_45363_68 differ
Binary files reproducer_45363_0 and reproducer_45363_137 differ
Binary files reproducer_45363_0 and reproducer_45363_138 differ
Binary files reproducer_45363_0 and reproducer_45363_275 differ
Binary files reproducer_45363_0 and reproducer_45363_276 differ
Binary files reproducer_45363_0 and reproducer_45363_277 differ
Binary files reproducer_45363_0 and reproducer_45363_278 differ
Binary files reproducer_45363_0 and reproducer_45363_551 differ
Binary files reproducer_45363_0 and reproducer_45363_552 differ
Binary files reproducer_45363_0 and reproducer_45363_553 differ
Binary files reproducer_45363_0 and reproducer_45363_554 differ
Binary files reproducer_45363_0 and reproducer_45363_555 differ
Binary files reproducer_45363_0 and reproducer_45363_556 differ
Binary files reproducer_45363_0 and reproducer_45363_557 differ
Binary files reproducer_45363_0 and reproducer_45363_558 differ
^C
[root@coal /zones/global/zfs-hole]# for a in 0 68 137 ; do ls -lt reproducer_45363_$a ; od -tx1 reproducer_45363_$a | wc -l ; done
-rw-r--r--   1 root     root     1048576 Nov 30 01:54 reproducer_45363_0
   65537
-rw-r--r--   1 root     root     1048576 Nov 30 01:54 reproducer_45363_68
       3
-rw-r--r--   1 root     root     1048576 Nov 30 01:54 reproducer_45363_137
       3
[root@coal /zones/global/zfs-hole]# od -tx1 reproducer_45363_68
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
4000000
[root@coal /zones/global/zfs-hole]# 
Actions #9

Updated by Dan McDonald 3 months ago

Due to my limited VM disk space, I invoke this now:

for i in `seq 1 17`; do echo "Try $i"; ( ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ) 2>&1 | grep differ ; done ;  rm reproducer_1* ; rm reproducer_2* ; rm reproducer_3*; rm reproducer_4* ; rm reproducer_5* ; rm reproducer_6*; rm reproducer_7* ; rm reproducer_8* ; rm reproducer_9*

Without this fix, I get all-zero files per earlier in this bug. So far, one invocation of 17 x 5-concurrent didn't find any diffs. Running 34 more (17 + 17) and do not expect corruption.

Also, so we can anticipate an obvious performance hit if there is one, I'm building smartos-live on a platform where prior to this it'd take 65 minutes. Given the illumos build process does NOT use utilities that use ticklers of dmu_object_wait_synced() ( _FIO_SEEK_DATA, _FIO_SEEK_HOLE, and _FIO_COUNT_FILLED handling are the only obvious ones).

Actions #10

Updated by Rob Norris 3 months ago

I've just been doing the patch for FreeBSD 12.4 (the last before they rebased on ZoL). Its basically the same code, so I did one for illumos as well:

https://github.com/robn/illumos-gate/commit/45f4321f9cba5c98457fc0bac46c96b295ddb477

Its untested on illumos, but I have written this patch for different systems and ZFS versions six times this week, so I'd be surprised if it didn't work. Happy for you to take it if you want it!

Actions #11

Updated by Joshua M. Clulow 3 months ago

First, dd should definitely be producing a 1MiB file if you invoke it like this:

$ dd if=/dev/urandom of=anotherrandfile bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1 MiB) transferred in 0.031013 secs (32 MiB/sec)

$ ls -lh anotherrandfile
-rw-r--r--   1 username groupname     1M Nov 30 07:38 anotherrandfile

$ xxd anotherrandfile | head -3
00000000: d252 7753 2cf0 b879 b228 2c61 f3f9 79d3  .RwS,..y.(,a..y.
00000010: ce2f 2a90 828b feb7 314a 3326 6019 6566  ./*.....1J3&`.ef
00000020: 5054 4fe4 f796 9998 d4c1 7d1e a8fa 8542  PTO.......}....B

$ xxd anotherrandfile | tail -3
000fffd0: b978 a9c4 9f6c 9f62 7326 98ec 1045 54ee  .x...l.bs&...ET.
000fffe0: 01b8 91ea 7a52 7da3 519a dff8 91be dcce  ....zR}.Q.......
000ffff0: 6e9e b3bc 0005 d917 8f52 7186 2762 d739  n........Rq.'b.9

I have been digging in a little to the reproduction script in the comment above. I'm trying to understand the mechanism of action here, and I am extremely unclear on how seeking and holes are involved at all. Using gcp from coreutils-9.4 lifted from pkgsrc:

$ ./gcp --version
cp (GNU coreutils) 9.4
Copyright (C) 2023 Free Software Foundation, Inc.
...

First, looking at gcp --help, which covers the --sparse option a little:

...
      --sparse=WHEN            control creation of sparse files. See below
...
By default, sparse SOURCE files are detected by a crude heuristic and the
corresponding DEST file is made sparse as well.  That is the behavior
selected by --sparse=auto.  Specify --sparse=always to create a sparse DEST
file whenever the SOURCE file contains a long enough sequence of zero bytes.
Use --sparse=never to inhibit creation of sparse files.

I suspect you're never going to see a long enough run of zero bytes in a correctly generated random file to trip that heuristic? And, indeed, looking at the truss output for the portion of the gcp run that copies the actual file:

$ truss -o out.gcp.txt ./gcp randfile randfile6
$ cat out.gcp.txt
...
open("randfile6", O_SEARCH|O_DIRECTORY)         Err#2 ENOENT
stat("randfile", 0xFFFFFC7FFFDFF3D0)            = 0
open("randfile", O_RDONLY)                      = 3
fstat(3, 0xFFFFFC7FFFDFF550)                    = 0
open("randfile6", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, 0xFFFFFC7FFFDFF4D0)                    = 0
fstat(3, 0xFFFFFC7FFFDFF1E0)                    = 0
brk(0x004B4000)                                 = 0
read(3, " ?019E j J eF1D1CE11 uAA".., 131072)   = 131072
write(4, " ?019E j J eF1D1CE11 uAA".., 131072)  = 131072
read(3, "\v1C96CDEEAA +ACE3C8AE U".., 131072)   = 131072
write(4, "\v1C96CDEEAA +ACE3C8AE U".., 131072)  = 131072
read(3, "16CFAA s i OB5179FEC m 9".., 131072)   = 131072
write(4, "16CFAA s i OB5179FEC m 9".., 131072)  = 131072
read(3, "DDF9139517FD nADB9\r A8E".., 131072)   = 131072
write(4, "DDF9139517FD nADB9\r A8E".., 131072)  = 131072
read(3, " NE7DC91F3B2C8C2F1BF C j".., 131072)   = 131072
write(4, " NE7DC91F3B2C8C2F1BF C j".., 131072)  = 131072
read(3, "B1E814EA 6 "1197AA8D yD7".., 131072)   = 131072
write(4, "B1E814EA 6 "1197AA8D yD7".., 131072)  = 131072
read(3, "AD z gB7 Y f 9FD M q NB2".., 131072)   = 131072
write(4, "AD z gB7 Y f 9FD M q NB2".., 131072)  = 131072
read(3, "CDE3 &8DF6C4AC / z )E1\n".., 131072)   = 131072
write(4, "CDE3 &8DF6C4AC / z )E1\n".., 131072)  = 131072
read(3, 0x00486000, 131072)                     = 0
close(4)                                        = 0
close(3)                                        = 0
...

I don't see any seeking behaviour at all, here. It's just opening the file, and in a loop it is doing regular 128K reads and then 128K writes (this happens to match the record size of the dataset, but I don't know if that is a coincidence or not) until it's done and then it closes the file. I assume it is looking at the contents to do the sparse heuristic, and finding no sufficient runs of zeroes at the end of a record to bother.

Note that this is pretty different to what our native cp does. To avoid copying the data through userland, we actually just mmap() the whole file and direct the kernel to do one large write from the mapping into the target file:

$ truss -o out.cp.txt cp randfile randfile 5
$ cat out.cp.txt
...
pathconf("randfile", _PC_ACL_ENABLED)           = 2
acl("randfile", ACE_GETACLCNT, 0, 0x00000000)   = 3
stat64("randfile", 0x0804708C)                  = 0
acl("randfile", ACE_GETACL, 3, 0x0806AA30)      = 3
stat64("randfile5", 0x080685E0)                 Err#2 ENOENT
open64("randfile", O_RDONLY)                    = 3
open64("randfile5", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 4
stat64("randfile5", 0x080685E0)                 = 0
fstat64(3, 0x08068680)                          = 0
stat64("/usr/gcc/10/lib/libcmdutils.so.1", 0x08046788) Err#2 ENOENT
stat64("/lib/libcmdutils.so.1", 0x08046788)     = 0
resolvepath("/lib/libcmdutils.so.1", "/lib/libcmdutils.so.1", 1023) = 21
open("/lib/libcmdutils.so.1", O_RDONLY)         = 5
mmapobj(5, MMOBJ_INTERPRET, 0xFED806B0, 0x08046804, 0x00000000) = 0
close(5)                                        = 0
memcntl(0xFED60000, 7784, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
lwp_cond_broadcast(0xFED80894)                  = 0
mmap64(0x00000000, 1048576, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC5F000
write(4, " ?019E j J eF1D1CE11 uAA".., 1048576) = 1048576
munmap(0xFEC5F000, 1048576)                     = 0
close(3)                                        = 0
close(4)                                        = 0
...

So a few things:

  • cp is doing an mmap() thing that is quite different
  • gcp in this case is really doing the most pedestrian I/O system calls you could imagine; why are we not seeing this all the time?

Dan can you confirm what your gcp processes are doing to interact with the files?

Actions #12

Updated by Rich Ercolani 3 months ago

Joshua M. Clulow wrote in #note-11:

First, dd should definitely be producing a 1MiB file if you invoke it like this:

[...]

I believe you, but I definitely have a dataset full of 1040b files from this.

It's not about the --sparse flag. That determines its behavior on the destination.

coreutils 9+ tries to use various methods to notice sparse files, among them, SEEK_HOLE/SEEK_DATA, if it doesn't have Linux's FIEMAP, and there's a race in the dnode dirtying where it doesn't always successfully notice the dnode is dirty and needs to sync out, and if you don't do that, then you can wind up with getting back ENXIO like the file was empty if it's not synced out yet, and you get an empty result. Surprise.

On illumos, I wouldn't be astonished if the result is something like stat saying the file is however long and not the same size, and that triggers it sometimes and not others.

But if you're syncing out for some other reason, you won't hit this, you only hit it if you sync out some things and not others, and you hit the flaw in the dirty check before it flushes again.

I'd recommend you talk to robn or read his writeup in the linked issues/PRs.

Actions #13

Updated by Joshua M. Clulow 3 months ago

Yeah my issue is I haven't yet seen it actually doing any SEEK_DATA calls. I have written some simpler software to do what's suggested as the problem (append to a file in one thread, concurrently with frenetic lseek from another thread) and so far I have only seen lseek() starting to block around the time the write is being issued, and then return the correct result some time later.

I think tomorrow we'll try and dig in with the proposed addition to the dirty check condition (3165 on Gerrit) using DTrace and try and catch any cases where the behaviour of dmu_object_wait_synced() will materially change as a result. That is: any cases where the existing code would not go through txg_wait_synced() even though at least one of the four dn_dirty_records lists is not empty and suggests that we should. I think it should be relatively coherent to do this: in DTrace probe context, we hold up the current thread when the probe fires, and that thread will be between dnode_hold() and dnode_release() calls, which presumably protect the lists we'll be reading at the time.

I think that If we can catch that then we can go an investigate how dn_dirty_records is being populated on our system without also influencing dn_dirty_link and we can see the problem at the source.

Actions #14

Updated by Gordon Ross 3 months ago

  • Description updated (diff)
Actions #15

Updated by Dan McDonald 3 months ago

Rob Norris wrote in #note-10:

I've just been doing the patch for FreeBSD 12.4 (the last before they rebased on ZoL). Its basically the same code, so I did one for illumos as well:

https://github.com/robn/illumos-gate/commit/45f4321f9cba5c98457fc0bac46c96b295ddb477

Its untested on illumos, but I have written this patch for different systems and ZFS versions six times this week, so I'd be surprised if it didn't work. Happy for you to take it if you want it!

You're the third person to independently arrive at this conclusion (I'd proposed this Monday amongst the illumos-core team, and the Gerrit link above was created 6pm US/Eastern yesterday by the team at Racktop). This makes me feel good that this is the consensus fix.

Actions #16

Updated by Joshua M. Clulow 3 months ago

  • Subject changed from import OpenZFS 15571 check dnode and its data for dirtiness to lseek(SEEK_DATA) on ZFS races with writes and can briefly miss data before it is flushed in a txg
Actions #17

Updated by Gordon Ross 3 months ago

Re #note-16
It's not only lseek(SEEK_DATA) that has problems, as the new synopsis might suggest.
The observed problem (in our tests) is that copying sparse files can copy holes where there should be data, when the fie is being modified.
As for "why" (the analysis), yes, there are races in the check for dnode and its data for dirtiness.

Actions #18

Updated by Rob Norris 3 months ago

I finished my own analysis, see https://github.com/openzfs/zfs/pull/15615. I'll try translate it to illumos ZFS, but it may have diverged in other ways that make this wrong.

The root cause is in dmu_objset_sync_dnodes(). The dnode is removed from os_dirty_dnodes (which has dn_dirty_link as its linkage) before dnode_sync() is called. The check in dmu_object_wait_synced() (or dnode_is_dirty() in OpenZFS) uses "is the dnode on a list" as a proxy for "dnode is dirty", but that doesn't hold here, and so the wrong answer is served up.

In OpenZFS, the dnode is quickly placed onto os_synced_dnodes, so the gap is very small. illumos does this too, but appears to have an option to disable useraccounting, in which case it doesn't put it on another list before syncing, so it will appear "clean" right up until its actually clean (all dbufs flushed).

The best fix seems to be query the dirty state directly from the dnode, rather than infer it from an implementation detail. OpenZFS#15615 does this, checking dn_dirty_txg. That may not be possible on illumos, the "no useraccounting" branch zeroes dn_dirty_txg before calling dnode_sync(), so will run into the same problem. There might be a different dirty state or flag you can use instead; anything under dn_mtx is ideal.

An alternate fix is to take the os_dirty_dnodes sublist lock in dmu_object_wait_synced() before checking the the linkage, as dmu_objset_sync_dnodes() is called with that lock held. However that sublist may have multiple dnodes on it waiting to be synced, so the wait could be long.

Regardless, I think the current fix of checking dn_dirty_records is solid, just isn't the best fix because its still just relying on another implementation detail.

Actions #19

Updated by Andy Fiddaman 3 months ago

I was interested in confirming that this is definitely what Dan's reproducer is hitting, and why it is using lseek() in the first place given the truss output that Josh gathered.

On an unpatched system, I ran the following dtrace script (it could be simpler - in its final form it definitely doesn't need to be using speculative tracing) in conjunction with a variant of the reproduction script from earlier in this ticket:

#!/usr/sbin/dtrace -Cs

#pragma D option nspec=500

#define list_empty(a) ((a)->list_head.list_next == &(a)->list_head)

dmu_object_wait_synced:entry
{
        self->spec = speculation();
        self->dirty = 0;
        speculate(self->spec);
        printf("PPID %d (%s)\n", ppid, execname);
        stack();
        ustack();
}

dnode_rele:entry
/self->spec/
{
        self->dirty =
            !list_empty(&args[0]->dn_dirty_records[0]) ||
            !list_empty(&args[0]->dn_dirty_records[1]) ||
            !list_empty(&args[0]->dn_dirty_records[2]) ||
            !list_empty(&args[0]->dn_dirty_records[3]);
}

txg_wait_synced:entry
/self->spec/
{
        self->dirty = 0;
}

dmu_object_wait_synced:return
/self->spec && self->dirty/
{
        commit(self->spec);
}

dmu_object_wait_synced:return
/self->spec && !self->dirty/
{
        discard(self->spec);
}

dmu_object_wait_synced:return
{
        self->spec = 0;
}

When I ran this, I saw hits that coincided with the reproducer reporting an issue.

# JOBMAX=4; while :; do ./rep &; done
...
Binary files reproducer_292123_0 and reproducer_292123_726 differ

which coincided with the only output from the d script:
CPU     ID                    FUNCTION:NAME
 87  44581     dmu_object_wait_synced:entry PPID 292123 (gcp)

              zfs`dmu_offset_next+0x45
              zfs`zfs_holey+0x6b
              zfs`zfs_ioctl+0x18d
              genunix`fop_ioctl+0x40
              genunix`lseek64+0x203
              unix`sys_syscall+0x283

              libc.so.1`lseek+0xa
              gcp`copy+0x84
              gcp`do_copy+0x33c
              gcp`main+0x592
              gcp`_start_crt+0x87
              gcp`_start+0x18

I also repeated this with a recompiled gcp binary that was built without optimisation and some other compiler options that help debugging (-msave-args) and could then see that that lseek() call is coming from gcp's infer_scantype(), which is returning LSEEK_SCANTYPE in this case.

Effectively gcp determines that the source file is one big hole because lseek(SEEK_DATA) -> zfs_holey() says so, because it gets in before the transaction is completely synced. gcp then proceeds to create an empty destination file by calling fallocate() to create a hole over the entire thing.

Actions #20

Updated by Gordon Ross 3 months ago

We've reproduced this with internal builds, and verified that the "reproducer" script no longer reports differences after the fix.

Actions #21

Updated by Dan McDonald 3 months ago

An issue came up a little on Gerrit, and a little more amongst the illumos core team.

The newly-inspected field dn_dirty_records is under protection of dn_mtx. In OpenZFS, the inspection is refactored into a function, where its caller ACTUALLY HOLDS dn_mtx. Prior to our inspection of dirty-dnode-indicators, we should do the same.

To make sure it wasn't a problem, I retested with mutex-holds for dn_mtx.

1.) Reproducer (40 runs) showed:
- NO errors
- NO triggering of Andy's DTrace script
Reproduction:

for i in `seq 1 100`; do echo "Try $i"; ( ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ./zfs-hole-problem-reproducer.sh & ) 2>&1 | grep differ ; done

2.) Concurrent build races.
To see if performance hits were noticeable, I ran two concurrent smartos builds on two machines. With same-OS one machine performs slightly slower (64-5 mins vs 63 mins). With the slower machine running with the mutex-hold, and the faster one without, the relative drop was similar:

No-mutex:

real  1:16:43.095818495
user  4:20:58.448337595
sys   1:27:26.058707065

and
real  1:17:13.817462447
user  4:30:12.299283531
sys   1:29:39.490629186

Yes-mutex:

real  1:20:38.183843879
user  4:22:07.471662732
sys   1:30:45.963439684

and
real  1:21:23.678631331
user  4:32:31.666361782
sys   1:33:01.983779257
Actions #22

Updated by Dan McDonald 3 months ago

For the record, this originated from openzfs/zfs commit 30d581121bb122c90959658e7b28b1672d342897.

Actions #23

Updated by Dan McDonald 3 months ago

Some lockstat output as well, possibly showing some noticeable upticks in locks on the with-mutex machine. (It may also be about timing relative to build activity, as we were sampling while two smartos-live builds were concurrently running.)

https://gist.github.com/danmcd/f13da2cdefc06ede833a559efb31e37c

Actions #24

Updated by Dan McDonald 3 months ago

  • Related to Bug #16092: ZFS implementation of _FIO_SEEK_HOLE could use more work added
Actions #25

Updated by Dan McDonald 3 months ago

  • Subject changed from lseek(SEEK_DATA) on ZFS races with writes and can briefly miss data before it is flushed in a txg to import OpenZFS 15571 check dnode and its data for dirtiness

(Keeping newer summary, as it's descriptive of the problem. The solution is to import openzfs/zfs#15571 )

Actions #26

Updated by Dan McDonald 3 months ago

  • Subject changed from import OpenZFS 15571 check dnode and its data for dirtiness to lseek(SEEK_DATA) on ZFS races with writes and can briefly miss data before it is flushed in a txg
Actions #27

Updated by Dan McDonald 3 months ago

  • % Done changed from 0 to 90
Actions #28

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

git commit 2a6a1d830ddfebbe8d533e8495df0a3c4bca636f

commit  2a6a1d830ddfebbe8d533e8495df0a3c4bca636f
Author: Rob Norris <rob.norris@klarasystems.com>
Date:   2023-12-03T06:20:22.000Z

    16087 lseek(SEEK_DATA) on ZFS races with writes and can briefly miss data before it is flushed in a txg
    Portions contributed by: Dan McDonald <danmcd@mnx.io>
    Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Jerry Jelinek <gjelinek@racktopsystems.com>
    Reviewed by: Paul Zuchowski <p.zuchowski98@gmail.com>
    Approved by: Joshua M. Clulow <josh@sysmgr.org>

Actions

Also available in: Atom PDF