Project

General

Profile

Actions

Bug #1983

closed

sshd problems

Added by Hugo Fraga over 11 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Start date:
2012-01-13
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
sshd
Gerrit CR:
External Bug:

Description

Greetings , hopefully someone can help me as i am at a total loss and feeling highly frustrated for not being able to understand what's happening.

So i've been a linux user and recently tried openindiana and fell in love with it , i downloaded openindiana 151a and noticed that every time i would connect to the ssh server for the first time after installation and every first time after rebooting i would get a 2-3 second delay before getting password prompt.

This started worrying me so i went to irc and people said it had to be a dns issue , so i added the following to sshd_config :

LookupClientHostnames no
VerifyReverseMapping no

But the delay persisted , so then i saw that there was a illumos joyment patch for sshd to make sure that it wasn't really trying to do the reverse mapping and i patched illumos-gate sshd and recompiled it and replaced the default sshd and still nothing.

I've tried on 3 different machines with very different hardware specs , tried even without even setting up a network interface and just typing ssh localhost from the server itself and i still got the delay the first time i did
that , i notice that the hard drive reads a lot for that first time , as in if it's loading modules or so but when i asked people said they didn't have this delay for the first time after reboot etc.

My machines have decent hardware , core i7 920 on the flagship one where i first experienced the issues , a hp micro server with a amd dual core and my iMac which i also installed openindiana on just to make sure it wasn't some hardware issue.

This happens on solaris 11 as well , does not happen on nexenta core or nexenta stor , happens on illumian , doesn't happen on the linux ones.

Any help would be greatly appreciated , here is a sshd -ddd log before i had added the reverselookup no etc since i can't recall how to output it to a file was it sshd -ddd 2&>1 > /tmp/blah or ? sigh :( can't recall but that log shows exactly the same symptoms as it does now with the dns lookup off , it halts on the "Waiting for monitor" part.

Thank you very much in advance , hope we can figure out what this is , unless this is normal of course but everyone i asked said it wasn't.

http://pastebin.com/x12Lb9Tv sshd server log

Actions #1

Updated by Bayard Bell over 11 years ago

Could you provide some further indication of where in these operations you saw delay? Also, could you provide the corresponding client output (ssh -vvv) with a similar indication of where the delays were noticeable?

Actions #2

Updated by John Wiegley over 11 years ago

I can confirm this is happening to me too, immediately after upgrading to 151a2. My other zones, which are still on 151a1, do not have this problem.

To diagnose, run on the server:

sshd -d -d -d -p 5050

Then on the client:

ssh -v -v -v -p 5050 $host

The client hangs after writing "debug1: SSH2_MSG_KEXINIT sent". The server hangs after writing "debug2: Waiting for monitor".

None of the typical SSH slowness tricks (GSSApi, no DNS, editing /etc/hosts, etc) have had any effect on this problem whatsoever.

Immediately after logging in, SSH becomes responsive to further logins for around 30 secs, but if you wait a couple of minutes then it should stall again waiting for the monitor.

Actions #3

Updated by Bayard Bell about 11 years ago

  • Project changed from OpenIndiana Distribution to illumos gate

Additional info is that sshd appears to make an attempt to talk to tcsd over loopback and gets a SYN_SENT, which is why the behaviour is different. It's not clear why there's a listener that's not really there under these conditions. This doesn't look like a problem with sshd per se.

Actions #4

Updated by Bayard Bell about 11 years ago

  • Tags changed from needs-triage to sshd

This may or may not be a problem with illumos-gate per se. First need to look into trousers side and determine whether the issue should be fixed there. Moved to illumos-gate until sshd is completely in the clear.

Actions #5

Updated by Yuri Pankov about 11 years ago

This could (or not) be related to #1848. However, I'm seeing the same disk thrashing when trying to ssh in after the reboot.

Actions #6

Updated by Rich Lowe about 11 years ago

There's also #1906, where trousers is concerned.

If the conversation with tcsd is the culprit, remove pkcs11_tpm (unless by some strange chance you're using it, which is unlikely).

Yuri, you could probably pretty easily dtrace the disk access (disable sshd, reboot, start sshd -d under dtrace looking at its disk use, ssh in once).

Actions #7

Updated by Jon Strabala about 11 years ago

I noticed this slow ssh connect issue too for several weeks especialy after a reboot (or a long dormancy), first login sort of slow, subsequent logins much faster. Three different machines all behave the same.

Of interest to me is when I run ZFS scrubs and/or do disk benchmarks I really start to notice the initial loggin lag (like disks too busy to load shared libs?) which can go up to 7 sec on a disk I/O bound system with only 4% CPU used and 18GB free memory ( I limited L2ARC in /etc/system as part of my what ifs).

The slowness of establishing initial ssh connections has been more of an annoyance to me, but after reading this I would be willing to help by running some tests. So nothing much technical from me, but if I can run any tests to help resolve this let me know

Actions #8

Updated by Hugo Fraga about 11 years ago

  • Status changed from New to Feedback

Hello, sorry i've been somewhat busy so haven't been able to post here since i haven't even been touching my server, i do have logs on my other machine, i believe it used to hang on "sending client info".

The server part would hang on : "debug1: SSH2_MSG_KEXINIT sent". The server hangs after writing "debug2: Waiting for monitor" like in John's case.

This was with openindiana default , tried disabling gssapi, no dns etc.

I mean after the first connect it's something unnoticeable but i think it would be nice to have it working 'normally'.

Now i have some free time and will see if i can go back to testing

Actions #9

Updated by Jon Strabala about 11 years ago

I have similar results with a client and server setup (described above) I also provide truss outputs for three tests the first two on separate systems (client and server) and the last on the same system. I would be more than happy to repeat with a trace script if needed (as truss only does system calls)

My setup is as follows

The client
opteron02% uname -a
SunOS opteron02 5.10 Generic_142901-11 i86pc i386 i86pc

The server
admin@ukserver:~$ uname -a
SunOS ukserver 5.11 oi_151a2 i86pc i386 i86pc Solaris

===============
TEST 1 as above - full trace http://pastie.org/3771846

The client (SunOS 5.10) hangs at

  debug1: SSH2_MSG_KEXINIT sent
  debug3: kex_reset_dispatch -- should we dispatch_set(KEXINIT) here? 0 && !0

The server (the oi_151a2 system) hangs at

  Connection from 192.168.3.5 port 54398
  debug1: Client protocol version 2.0; client software version Sun_SSH_1.1.3
  debug1: match: Sun_SSH_1.1.3 pat Sun_SSH_1.1*
  debug1: Enabling compatibility mode for protocol 2.0
  debug1: Local version string SSH-2.0-Sun_SSH_1.5
  monitor debug1: list_hostkey_types: 
  debug2: Waiting for monitor

===============
TEST 2 as above, but running "truss" - full trace http://pastie.org/3771880

The client (SunOS 5.10) hangs at

  getpid()                                        = 25560 [25559]
  getpid()                                        = 25560 [25559]
  getpid()                                        = 25560 [25559]
  getpid()                                        = 25560 [25559]
  getpid()                                        = 25560 [25559]
  getpid()                                        = 25560 [25559]
  debug1: SSH2_MSG_KEXINIT sentwrite(2, " d e b u g 1 :   S S H 2".., 29) = 29

  write(2, "\r\n", 2)                             = 2
  debug3: kex_reset_dispatch -- should we dispatch_set(KEXINIT) here? 0 && !0write(2, " d e b u g 3 :   k e x _".., 75)   = 75

  write(2, "\r\n", 2)                             = 2
  write(4, "\0\001C4\t14D4931BC4A49E".., 456)     = 456
  pollsys(0x08045AD0, 1, 0x00000000, 0x00000000) (sleeping...)

  **** delay 2-7 seconds here *****

  pollsys(0x08045AD0, 1, 0x00000000, 0x00000000)  = 1
  read(4, "\0\0\t04\v149E yEBC3149F".., 8192)     = 2312
  debug1: SSH2_MSG_KEXINIT receivedwrite(2, " d e b u g 1 :   S S H 2".., 33)     = 33

The server (the oi_151a2 system) hangs at

  getpid()                                        = 2757 [2756]
  getpid()                                        = 2757 [2756]
  chdir("/")                                      = 0
  sigaction(SIGPIPE, 0x080473E0, 0x08047460)      = 0
  so_socket(PF_INET6, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 3
  fcntl(3, F_SETFL, FNONBLOCK)                    = 0
  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, 0x080474DC, 4, SOV_DEFAULT) = 0
  debug1: Bind to port 5050 on ::.write(2, " d e b u g 1 :   B i n d".., 32)      = 32

  write(2, "\r\n", 2)                             = 2
  bind(3, 0x080C4A58, 32, SOV_SOCKBSD)            = 0
  Server listening on :: port 5050.write(2, " S e r v e r   l i s t e".., 33)     = 33

  write(2, "\r\n", 2)                             = 2
  listen(3, 5, SOV_DEFAULT)                       = 0
  sigaction(SIGHUP, 0x080473E0, 0x08047460)       = 0
  sigaction(SIGTERM, 0x080473E0, 0x08047460)      = 0
  sigaction(SIGQUIT, 0x080473E0, 0x08047460)      = 0
  sigaction(SIGCLD, 0x080473E0, 0x08047460)       = 0
  pollsys(0x080473E0, 1, 0x00000000, 0x00000000) (sleeping...)

  **** delay 2-7 seconds here *****

  pollsys(0x080473E0, 1, 0x00000000, 0x00000000)  = 1
  accept(3, 0x08047920, 0x080474E4, SOV_DEFAULT)  = 4
  fcntl(4, F_SETFL, (no flags))                   = 0

===============
*Test 3 - running both on the SAME oi_151a2 system - not paste.org outputs *

My final setup (for test 3) is as follows

The client
root@ukserver:~$ uname -a
SunOS ukserver 5.11 oi_151a2 i86pc i386 i86pc Solaris

The server
root@ukserver:~$ uname -a
SunOS ukserver 5.11 oi_151a2 i86pc i386 i86pc Solaris

The client on oi_151a command "truss ssh -v -v -v -p 5050 192.168.3.4"

  getpid()                                        = 2988 [2987]
  getpid()                                        = 2988 [2987]
  debug1: SSH2_MSG_KEXINIT sentwrite(2, " d e b u g 1 :   S S H 2".., 29) = 29

  write(2, "\r\n", 2)                             = 2
  debug3: kex_reset_dispatch -- should we dispatch_set(KEXINIT) here? 0 && !0write(2, " d e b u g 3 :   k e x _".., 75)   = 75

  write(2, "\r\n", 2)                             = 2
  write(4, "\0\001EC0514EBF7AC - 3FE".., 496)     = 496

  pollsys(0x08045A60, 1, 0x00000000, 0x00000000) (sleeping...)

  **** delay 2-7 seconds here *****

  pollsys(0x08045A60, 1, 0x00000000, 0x00000000)  = 1
  read(4, "\0\0\t L\n14EDF2A7E3 A 9".., 8192)     = 2384
  debug1: SSH2_MSG_KEXINIT receivedwrite(2, " d e b u g 1 :   S S H 2".., 33)     = 33

  write(2, "\r\n", 2)                             = 2
  debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha1,diffie-hellman-group1-sha1write(2, " d e b u g 2 :   k e x _".., 88)      = 88

The server on oi_151a command "truss /usr/lib/ssh/sshd -d -d -d -p 505"

  write(2, "\r\n", 2)                             = 2
  access("/usr/bin/sh", X_OK)                     = 0
  open("/dev/null", O_RDWR)                       = 9
  fcntl(9, F_GETFD, 0x080473D8)                   = 0
  pipe()                                          = 10 [11]
  access("/usr/bin/sh", X_OK)                     = 0
  lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0xFFFFFFFF, 0x000000FF, 0x00000000) = 0x00000000 [0x00000000]
  vforkx(FORK_NOSIGCHLD|FORK_WAITPID)             = 2990
  lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
  close(11)                                       = 0
  fcntl(10, F_DUP2FD, 0x00000009)                 = 9
  close(10)                                       = 0
  fstat64(9, 0x08047330)                          = 0
  fstat64(9, 0x08047240)                          = 0
  ioctl(9, TCGETA, 0x080472E0)                    Err#22 EINVAL
  cread(9, 0x080CECDC, 5120)       (sleeping...)

  **** delay 2-7 seconds here *****

  read(9, " C\n P O S I X\n a f _ Z".., 5120)     = 2907
  brk(0x080DA228)                                 = 0
  brk(0x080DC228)                                 = 0
  read(9, 0x080CECDC, 5120)                       = 0
  llseek(9, 0, SEEK_CUR)                          Err#29 ESPIPE
  close(9)                                        = 0
Actions #10

Updated by Thomas Gouverneur about 11 years ago

Try to add -ff to follow forked pids while using truss.
If my assumptions are correct, you should see somewhere three connections attempt to port 30003 which are refused, but after an amount of time.

This is where the delay is caused under Solaris 11 at least... (which i've tested on...)

Actions #11

Updated by Thomas Gouverneur about 11 years ago

FYI, as Rich Lowe suggested, runnning two following commands fixed my problem with it:

cryptoadm disable provider=/usr/lib/security/\$ISA/pkcs11_tpm.so mechanism=all
cryptoadm unload provider=/usr/lib/security/\$ISA/pkcs11_tpm.so

Still, it's on solaris 11 and if someone can confirm it's the same on OI...

Actions #12

Updated by Thomas Gouverneur about 11 years ago

s/unload/uninstall/g in my previous post, sorry for spam encountered...

Actions #13

Updated by Hugo Fraga almost 11 years ago

Hello, sorry not being able to reply sooner as i got caught up in moving and having the server offline for a while due to no time.

Anyhow i tried :

cryptoadm disable provider=/usr/lib/security/\$ISA/pkcs11_tpm.so mechanism=all
cryptoadm uninstall provider=/usr/lib/security/\$ISA/pkcs11_tpm.so

But the login is still slow first time after reboot, i also disabled the common stuff again just for measure.

LookupClientHostnames no
VerifyReverseMapping no
GSSAPIAuthentication no

This is on a fresh openindiana latest version (151a5)

Any suggestions?

Actions #14

Updated by Predrag Zečević over 10 years ago

Hi,

i was also very unhappy with slow response. Also spent hours looking for solution.

I have 2 OI_151a7 installations: one at home and one in office (= 2 different DNS 'setup').

For both, this looks like solution (at least works for my setup):

a) modify /etc/nsswitch.conf and remove 'mdns' from hosts and ipnodes entries
b) disable 'svc:/network/dns/multicast:default' service
c) [not sure if this has to be done, but] restart ssh service

After that, getting login prompt was fast (even in Kerberized environment).

Regards.

BTW, i cannot explain this, because i couldn't find man page for mdns (even on http://illumos.org/man/all page), although (from nsswitch.conf):

# server lookup.  See resolv.conf(4). For lookup via mdns  
# svc:/network/dns/multicast:default must also be enabled. See mdnsd(1M)

Actions #15

Updated by Jason Matthews over 10 years ago

i see this problem within zones but not on the hosts themselves. i can make access available to these zones if a reference system is needed.

Actions #16

Updated by Udo Grabowski over 10 years ago

We had this problem when setting primarycache and secondarycache to
'metadata' for the zfs root file system on small machines. ssh reads all locales
over and over, and when setting that property back to default 'all', these files
were cached. arcstat shows the misses on every ssh login if they are not
cached. Reboot needed on change of this setting, first login will always be slow.
Caching saves 2-3 seconds on every round (aka ssh lwp) on the locales.

Actions #17

Updated by Yuri Pankov about 6 years ago

  • Status changed from Feedback to Closed

closing this as it has no activity for 4 years and sunssh is gone.

Actions

Also available in: Atom PDF