oi_151a7 ZFS deduplication problem

Added by Stephen Rondeau almost 7 years ago. Updated about 5 years ago.

My storage server running oi_151a7 has 16GB of RAM.

zpool status data1

  pool: data1
 state: ONLINE
  scan: scrub repaired 0 in 1h37m with 0 errors on Tue Aug 20 07:09:55 2013

        NAME        STATE     READ WRITE CKSUM
        data1       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c4t5d0  ONLINE       0     0     0

errors: No known data errors

I enabled deduplication on my zpool for a volume called data1/itfiles/home, which defined to be 900GB. Via iSCSI, this volume is attached to a Windows Server 2008 file server.

From the file server I attempted to copy about 100GB of data onto that remote volume. Most of it deduped fine, with a dedupratio of 2.30x. However, copying three fairly-deep directories caused the volume to detach from the file server. I was able to reproduce the volume detaching behavior by attempting to copy those directories again. If dedup is turned off for the volume, there is no problem with volumes detaching using those same directories. I tried copying the directories to a different ZFS volume with dedup on and off, with the same effect. Hence, I think there is a problem with deduping.

I tried to trim the test case down, but it seems to take several 100's of MB before the problem occurs, and it doesn't consistently fail with smaller test cases.

I don't appear to have run out of RAM for the DDTs. I have no way of clearing the DDTs, so I can't start over with a clean system without destroying and re-creating my pool. I have turned off dedup on all volumes for now.

The only unusual thing I could find with that RAIDZ1 pool was that the I/O counts seemed uneven -- I thought they would be about the same across all RAID disks:

zpool iostat -v data1

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data1        321G  3.31T     35    298   188K  1.45M
  raidz1     321G  3.31T     35    298   188K  1.45M
    c4t2d0      -      -      9      7  77.6K   738K
    c4t3d0      -      -      6      6  46.0K   610K
    c4t4d0      -      -      9      7  76.8K   738K
    c4t5d0      -      -      6      5  46.8K   601K
----------  -----  -----  -----  -----  -----  -----

So, I am wondering if I have some kind of disk error on c4t3d0 and c4t5d0. I don't know how to diagnose that. I tried smartmontools, but it did not yield any useful info.



Updated by Stephen Rondeau almost 7 years ago

Additional information:

I restarted the storage server to clear any in-memory data, then re-enabled dedup and tried copying the problem directories to two different ZFS volumes via iSCSI. I could not get the problem to reproduce, and performance was much better than before, by an order of magnitude.


Updated by Stephen Rondeau almost 7 years ago

That was largely due to turning off the antivirus scanning.


Updated by Stephen Rondeau almost 7 years ago

When the problem occurred, I had AVS remotely replicating to another storage server over the network (i.e., a remote mirror). After the recent restart, it was simply logging (sndradm l). I changed it to synchronize (sndradm -u) to the remote mirror, and then tried copying the problem directories with deduping on - the volume failed again, reproducing the problem.

So, I turned off ZFS deduplication. The performance is slow (back to original rates; antivirus is off), but the problem directories copied without a volume failure.

Consequently, this appears to be a problem when AVS is replicating a ZFS volume undergoing deduplication.


Updated by Ken Mays almost 7 years ago

  • Assignee set to OI illumos

Updated by Ken Mays about 6 years ago

  • Status changed from New to Closed
  • Target version deleted (oi_151_stable)
  • % Done changed from 0 to 100

