Project

General

Profile

Actions

Feature #14407

closed

unix, genunix partial warning cleanup

Added by Robert Mustacchi 9 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

While going through and doing some development work, I have gone through and taken a pass through a number of warnings that were gagged in unix and genunix which genuinely make building harder. While things aren't 100% clean, this has been sitting around long enough that it's worth going through and getting back. While this is all being put back as one commit, the assosciated review has the individual changes for things that should make it easier for this to be reviewed if preferred. Due to the testing and risk here, I've tried to combine that into a large chunk.

In particular, to test this on debug bits I went through and ran through the following test suites to try and build up confidence (along with some basic use myself):

  • crypto-tests
  • libc-tests
  • util-tests
  • os-tests
  • ksh93-tests
Actions #1

Updated by Electric Monk 9 months ago

  • Gerrit CR set to 1950
Actions #2

Updated by Robert Mustacchi 8 months ago

As part of the review process, I got some additional analysis by a colleague who asked for more to be written up on the lgrp and mnode changes. Here are my notes from re-evaluating my changes and why I believe they're safe:

A while back, we had a discussion on both the mnode and lgrp changes. After my analysis, I believe the change to lgrp.c is relatively safe. Importantly we need to make sure that we don't go beyond MAX_NODES which is 8. Because the lgrp_handle_t is a uintptr_t, the main cases we have are LGRP_NONE and the prior things like LGRP_NULL_HANDLE and LGRP_DEFAULT_HANDLE. The latter two are easy cases because they are always going to be > MAX_NODES in this case. However, the question of LGRP_NONE is really at issue. In this case, my reasoning that the change is ok is that when that is cast to a uintptr_t, it will result in a value UINTPTR_MAX, which will always be outside the region. Similarly any negative values in the uintptr_t space will result in large values outside there, which means we will correctly capture them and return 0. Note, this presumes that the choice of returning 0 for non-existant or out of range nodes is correct, but I believe we will still capture things correctly and there isn't a source change to make right now. Though I may have forgotten something else here.

Similarly, with the mnode change, I went back through and looked at this. There are two observations here. The first is that all the surrounding code not just in this function, but in others actually is assuming this value is a signed integer and the dereferences are guarded on this. Then the question is what about the number of ranges and negative values. It turns out nranges was an unsporting global variable that corresponds to the array size of 4. So we are always entering into this with a value of 3. So I believe this doesn't regress or change how we end up doing memory node initialization.

Actions #3

Updated by Robert Mustacchi 8 months ago

As an additional way of testing this in addition to the test suite, I wanted to try and make sure that I had triggered the kmem callbacks that had changed just to make sure that we still executed things cleanly in there and didn't end up in a bad place. To do this, I used debug bits and then loaded up a small Rome server with several concurrent illumos builds and some larger rust cargo package builds. To see that this happened, I set up the following DTrace oneish liner:

# pfexec dtrace -n 'fbt::kmem_cache_scan:entry{ self->t = 1; }' -n 'fbt::kmem_cache_scan:return{ self->t = 0; }' -n 'fbt::kmem_move_buffers:return/self->t/{ @[arg1] = count(); }' -n 'tick-5m{ printa(@); }'
...
 15  82757                         :tick-5m
       4294967295                6
                0               90
                1             5065

The above was the final thing that I saw (note that there is no use of trunc, so these represent the total use over the hour or two that this ran). The use of debug bits was intended to ensure that we had less memory available and to make sure that if we had screwed up some kind of buffer management, kmem debugging was more likely to find that. The system seemed fine. I also went through and took a crash dump of the system. After that, I verified that all the caches were in a good state with ::kmem_verify in mdb on the crash dump.

Actions #4

Updated by Electric Monk 8 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 3df2e8b2fd61f45437285750d2880d6416a9200c

commit  3df2e8b2fd61f45437285750d2880d6416a9200c
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2022-02-10T01:02:41.000Z

    14407 unix, genunix partial warning cleanup
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Rich Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF