x64 mutex_enter only clears half of %rax, may block unneccesarily
Further looking around at the x86 mutex_enter, I notice that we clear %eax before issuing a locked cmpxchgq, cmpxchgq compares with %rax (of which we only cleared the low 32bits, via %eax), so we may not have actually 0d the register before the comparison.
I think this will push us into a slow path if any of the high 32bits of %rax are set, and we end up going through vector_enter long enough to find the lock is actually free (and always has been).
Updated by Rich Lowe about 10 years ago
When searching for the processor manuals to confirm this theory, one of the hits was an old bugs.opensolaris.org bug (6958602) stating basically what I said above, and containing a DTrace script to test the unnecessary block theory.
The google cached copy is: http://webcache.googleusercontent.com/search?q=cache:_TH8yRL3TvYJ:bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid%3D4cbbe2b16c2f2f9397445cead61c%3Fbug_id%3D6958602