Project

General

Profile

Actions

Bug #3572

open

Openindiana fails to boot

Added by Ivan Stanojevic over 8 years ago. Updated over 8 years ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
kernel
Start date:
2013-02-18
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

When booting OpenIndiana 151a7 from the live DVD (desktop installation), the system reboots after 20-30 seconds and the whole process repeats. This happens regardless of the selected boot option or additional kernel boot parameters. It only happens on certain machines, one of which is a Fujitsu brand machine with the following data:

Model: Fujitsu Esprimo P400 E85+
Mainboard: D2990-A21
BIOS revision: R1.21.0
CPU: Intel i5-2500

The reboot happens while dots are appearing on the initial post grub screen. Pressing no key can cause diagnostic text messages to be displayed.

After a more thorough investigation in the kernel debugger, I found the position at which the reboot occurs - file usr/src/uts/intel/io/pci/pci_resource.c, function mps_probe(), line 429:

428 ctp = (struct mps_ct_hdr )(uintptr_t)fpp->fps_mpct_paddr;
429 if (ctp->ct_sig != 0x504d4350) { /
check "PCMP" signature */
430 dprintf("MP Configuration Table signature is wrong");
431 return;
432 }

Here, fpp is a pointer to a struct mps_fps_hdr (MP Floating Pointer Structure). This structure has the correct signature (field fps_sig==0x5F4D505F) and the correct checksum, both of which have previously been verified. However, its field fps_mpct_paddr (pointer to a struct mps_ct_hdr - MP Configuration Table), which is assigned to the ctp pointer in line 428, is set to an address of unmapped memory. Accessing memory at that address while checking the "PCMP" signature in line 429 causes a page fault and, consequently, a reboot.

The stack trace at the critical instruction (pci_autoconfig`mps_probe+0xd0) is:

fec40244 pci_autoconfig`mps_probe+0xd0()
fec40274 pci_autoconfig`find_bus_res+0x2f(0,2,fec402b4,f082a602)
fec402a4 pci_autoconfig`populate_bus_res+0x18(0,c2826d68,fec402f4,f082b9da)
fec402f4 pci_autoconfig`pci_reprogram+0xeb()
fec40304 pci_autoconfig`pci_enumerate+0x22(1,1,0,fe815845)
fec40324 impl_bus_reprobe+0x1b()
fec40334 configure+0x62()
fec40364 startup_end+0x77()
fec40374 startup+0x3f()
fec40384 main+0x30()
fec0b824 _locore_start+0x2d8()

The values of the pointers are:

fpp = %ebx = 0x000fd430
ctp = %edi = 0xc991fa98

No settings in the BIOS can change this, and it also happens in OpenSolaris 2009.6 and in OpenIndiana 151 (probably in all OpenIndiana releases).

I have seen the same behaviour on a Dell Inspiron 17R - N1770 laptop, but I cannot check whether the cause is the same since I don't have that laptop available anymore.


Files

acpidump-pciexpress-graphics.txt (206 KB) acpidump-pciexpress-graphics.txt Output of acpidump when the PCI Express graphics card is active Ivan Stanojevic, 2013-02-23 11:00 PM
acpidump-integrated-graphics.txt (207 KB) acpidump-integrated-graphics.txt Output of acpidump when the integrated graphics card is active Ivan Stanojevic, 2013-02-23 11:00 PM
x32__rsdp.dsl (1.05 KB) x32__rsdp.dsl 1: RSDP Aakash Saini, 2013-02-24 06:07 PM
x32__rsdt.dsl (1.73 KB) x32__rsdt.dsl 2: RSDT Aakash Saini, 2013-02-24 06:07 PM
x32__xsdt.dsl (1.96 KB) x32__xsdt.dsl 3: XSDT Aakash Saini, 2013-02-24 06:07 PM
x32__facp1.dsl (9.04 KB) x32__facp1.dsl 4: FACP-1 Aakash Saini, 2013-02-24 06:07 PM
x32__facp2.dsl (5.35 KB) x32__facp2.dsl 5: FACP-2 Aakash Saini, 2013-02-24 06:07 PM
x32__dsdt.dsl (349 KB) x32__dsdt.dsl Aakash Saini, 2013-02-26 06:25 PM
Actions #1

Updated by Aakash Saini over 8 years ago

  • Status changed from New to Feedback

boot options alternatively via grub.
acpi-user-options=0x8
acpi-user-options=0x4
acpi-user-options=0x2 (last option as try)

probably acpi/ca compatibility issue with laptops/pc, not supported.

Actions #2

Updated by Ivan Stanojevic over 8 years ago

I've tried each of the suggested options and none of them helps. The behaviour is exactly the same and the position of the trap, the stack frame and the register values are identical in all cases.

I've noticed that I had made a slight mistake in retyping the details of the stack trace, so one of the lines should be corrected:

fec402a4 pci_autoconfig`populate_bus_res+0x18(0,c2826d60,fec402f4,f082b9da)

(Only the second argument was wrong here, if this is of any importance.)

Actions #3

Updated by Aakash Saini over 8 years ago

not sure if you are able to reach beyond dots to... _kobj_boot(). try capture message trails.
if so, could you update stack details.
(console) -B kbm_debug=true,prom_debug=true,map_debug=true -kdv
(remote-console) -B console=ttya,kbm_debug=true,prom_debug=true,map_debug=true -kdv
::bp -n 1 attach_drivers
:c
...
:c

Actions #4

Updated by Ivan Stanojevic over 8 years ago

The execution reaches neither _kobj_boot() nor attach_drivers(). I've set breakpoints at both of these functions and also at impl_bus_reprobe(), and the first breakpoint reached is at impl_bus_reprobe(). After that it goes on the usual way and page faults inside pci_autoconfig`mps_probe() (line 429).

The only difference now is that I get some more diagnostic output at the beginning of the debugger session, due to the boot options kbm_debug=true,prom_debug=true,map_debug=true.

It seems to me that the problem is one of the following:

1. The BIOS sets the member fps_mpct_paddr of the structure mps_fps_hdr to an invalid address, even though the signature and the checksum of this structure are correct.

2. The value of fps_mpct_paddr is not a pointer and should be treated differently.

3. Some tampering occurs with mps_fps_hdr before is is read. (This is highly unlikely, since it would probably invalidate the checksum.)

Regardless of what the problem is, pci_autoconfig`mps_probe() should be able to deal with it by performing some additional checks.

Actions #5

Updated by Aakash Saini over 8 years ago

_kobj_boot() cannot be :bp under kmdb; it executes _start/krtld

i dont understand regarding tempering!.. could you verify.
::bp -n 1 pci_autoconfig`mps_probe
:c
::pci_autoconfig`pci_boot_debug/W 0x1
::bp -n 1 pci_autoconfig`checksum
:c
output: Found MP Floating Pointer Structure at 9fff0
i suppose little ::step will help understand the reg/value changes, including xxxx::print -ta struct mps_fps_hdr and mps_ct_hdr.
this will also verify table signature related issues.

Actions #6

Updated by Ivan Stanojevic over 8 years ago

I followed your instructions and here is the result:

[0]> ::bp -n 1 pci_autoconfig`mps_probe
[0]> :c
[0]> pci_autoconfig`pci_boot_debug/W 0x1
[0]> ::bp -n 1 pci_autoconfig`checksum
[0]> :c

o: Found MP Floating Pointer Structure at fd430
o: kmdb: stop at pci_autoconfig`checksum
o: kmdb: traget stopped at:
o: pci_autoconfig`checksum: pushl %ebp

Now the execution is at line 584 (checksum() entry).

[0]> :u
[0]> ::regs

o: ...
o: %eax = 0x00000000
o: ...

This is the result of checksum(), meaning that the checksum is correct.

[0]> :e
[0]> :e
[0]> :e
[0]> :e

Now it executed line 428 and is at line 429.

[0]> ::regs

o: ...
o: %ebx = 0x000fd430
o: %edi = 0xc991fa98
o: %eip = 0xf0830d70 pci_autoconfig`mps_probe+0xd0
o: ...

Here %ebx = fpp and %edi = ctp.

[0]> 0x000fd430::print -ta "struct mps_fps_hdr"

o: fd430 struct mps_fps_hdr {
o: fd430 uint32_t fps_sig = 0x5f504d5f
o: fd434 uint32_t fps_mpct_paddr = 0xc991fa98
o: fd438 uchar_t fps_len = 0x1
o: fd439 uchar_t fps_spec_rev = 0x4
o: fd43a uchar_t fps_cksum = 0xb4
o: fd43b uchar_t fps_featinfo1 = 0
o: fd43c uchar_t fps_featinfo2 = 0
o: fd43d uchar_t fps_featinfo3 = 0
o: fd43e uchar_t fps_featinfo4 = 0
o: fd43f uchar_t fps_featinfo5 = 0
o: }

[0]> 0xc991fa98::print -ta "struct mps_ct_hdr"

o: c991fa98 struct mps_ct_hdr {
o: kmdb: failed to read 4 bytes at c991fa98: no mapping for address
o: }

[0]> :e

o: kmdb: single-step stop on page fault (#pf) trap
o: kmdb: target stopped at:
o: pci_autoconfig`mps_probe+0xd0: cmpl $0x504d4350,(%edi)

Actions #7

Updated by Aakash Saini over 8 years ago

could you verify again before:step-in & after:step-out of checksum().
...
before:step:in checksum()
[0]>$?
%eax = 0x00000010 <<< checksum()-before-step:in
%ebx = 0x0009fff0 <<< you-system: 0x000fd430
%eip = 0xf982e2c8 pci_autoconfig`checksum
...
[0]> 0x0009fff0::print -ta struct mps_fps_hdr <<<
9fff0 uint32_t fps_sig = 0x5f504d5f <<< sig:_MP_
9fff4 uint32_t fps_mpct_paddr = 0xe1300 <<< "0xc991fa98"

"0xc991fa98":: if this is what you get before-step-in, its incorrect value!
it should be less than 0x000ffff0.
...
[0]> 0xe1300::print -ta struct mps_ct_hdr <<< address:0xc991fa98
e1300 uint32_t ct_sig = 0x504d4350 <<< sig:PCMP

after:step:out checksum()
[0]>::step out
[0]>$?
%eax = 0x00000000 <<< checksum()-after-step:out
%ebx = 0x0009fff0 <<<
%eip = 0xf982de2f pci_autoconfig`mps_probe+0xeb
...
[0]> 0x0009fff0::print -ta struct mps_fps_hdr
9fff0 uint32_t fps_sig = 0x5f504d5f <<< MP
9fff4 uint32_t fps_mpct_paddr = 0xe1300 <<< this should not change before/after step:in/out

its not a bug.

it shouldn't make any difference, though just a thought. somehow...
1. try disabling on-board features from bios. ie. graphics, network, apci, power, suspend, delays, usb, floppy, hyper-threads... etc.
you mention 'no setting in bios can change this...'. just a guess... >APIC-issue.
2. try replacing RAM modules (new), or test individual RAM modules on 1st dimm-slot (h/w address faults)
3. bios behaves something!! dont know. but address is incorrect, shouldn't be that high. if you get different values in 64/32bit boot, then its a strictly OS issue!... but i have check both 64/32bits.
4. ???? ... dont have fujitsu h/w

btw, did you try other os: solaris/illumian v1.0, except all-oi151 /opensolaris 2009.06

Actions #8

Updated by Ivan Stanojevic over 8 years ago

The mps_fps_hdr structure is incorrect from the beginning.

[0]> ::bp pci_autoconfig`mps_probe
[0]> :c
[0]> ::bp pci_autoconfig`find_sig
[0]> :c
[0]> :u

Now it tried to find the mps_fps_hdr structure in the first memory region, lines 389-401, and is at line 402.

[0]> ::regs

o: ...
o: %eax = 0x00000000
o: ...

The result of find_sig() is zero, so it didn't find the structure.

[0]> :c
[0]> :u

Now it tried to find the structure in the second memory region, lines 402-409, and is at line 410.

[0]> ::regs

o: ...
o: %eax = 0x00000000
o: ...

Again, the result of find_sig() is zero, so it didn't find the structure.

[0]> :c
[0]> :u

Now it tried to find the structure in the third memory region, lines 410-413, and is at line 415.

[0]> ::regs

o: ...
o: %eax = 0x000fd430
o: ...

Finally, it found the structure at 0xfd430.

[0]> 0xfd430::print -ta "struct mps_fps_hdr"

o: fd430 struct mps_fps_hdr {
o: fd430 uint32_t fps_sig = 0x5f504d5f
o: fd434 uint32_t fps_mpct_paddr = 0xc991fa98
o: fd438 uchar_t fps_len = 0x1
o: fd439 uchar_t fps_spec_rev = 0x4
o: fd43a uchar_t fps_cksum = 0xb4
o: fd43b uchar_t fps_featinfo1 = 0
o: fd43c uchar_t fps_featinfo2 = 0
o: fd43d uchar_t fps_featinfo3 = 0
o: fd43e uchar_t fps_featinfo4 = 0
o: fd43f uchar_t fps_featinfo5 = 0
o: }

fps_mpct_paddr has an incorrect value.

[0]> ::bp pci_autoconfig`checksum
[0]> :c
[0]> :u

Now it calculated the checksum of the structure, line 423, and is about to check whether it is correct.

[0]> ::regs

o: ...
o: %eax = 0x00000000
o: ...

The result indicates the correct checksum.

[0]> 0xfd430::print -ta "struct mps_fps_hdr"

o: fd430 struct mps_fps_hdr {
o: fd430 uint32_t fps_sig = 0x5f504d5f
o: fd434 uint32_t fps_mpct_paddr = 0xc991fa98
o: fd438 uchar_t fps_len = 0x1
o: fd439 uchar_t fps_spec_rev = 0x4
o: fd43a uchar_t fps_cksum = 0xb4
o: fd43b uchar_t fps_featinfo1 = 0
o: fd43c uchar_t fps_featinfo2 = 0
o: fd43d uchar_t fps_featinfo3 = 0
o: fd43e uchar_t fps_featinfo4 = 0
o: fd43f uchar_t fps_featinfo5 = 0
o: }

Nothing in the structure changed after calculating the checksum.

If I'm not wrong, this structure is in ROM so this seems to be a bug in the BIOS. Just to test:

[0]> 0xfd434/W 0

o: kmdb: failed to write 0 at address 0xfd434: operation not supported by target

As you suggested, I tried changing ACPI related settings in the BIOS (there are very few of them) but it didn't make any difference.

The hardware of the machine is quite new and of good quality (it is a brand machine). I tested the RAM for 2 days with memtest and no errors were reported.

I also tried illumian 1.0a5 and the behaviour is the same. Illumian boots well on other machines.

Just a suggestion: Couldn't mps_probe() check whether ctp has a valid value before dereferencing it at line 429? For example, it could check whether it points to a mapped address, or whether it is less than 0xfffff. (I'm not sure whether the latter test is justified, but you mentioned that ctp's value should be much lower than it is.)

Actions #9

Updated by Aakash Saini over 8 years ago

Nothing in the structure changed after calculating the checksum.
If I'm not wrong, this structure is in ROM so this seems to be a bug in the BIOS. Just to test:
[0]> 0xfd434/W 0
o: kmdb: failed to write 0 at address 0xfd434: operation not supported by target

Yes. its ROM BIOS. you cannot write, only read.
0xfd434/X

As you suggested, I tried changing ACPI related settings in the BIOS (there are very few of them) but it didn't make any difference.
The hardware of the machine is quite new and of good quality (it is a brand machine). I tested the RAM for 2 days with memtest and no errors were reported.

sometime back i did use memtest, but it failed to catch the result. surprisingly, ms-stuff caught my attention, and it did find the faults (faulty-ram).
link:: http://technet.microsoft.com/en-us/magazine/2008.09.utilityspotlight.aspx

I also tried illumian 1.0a5 and the behaviour is the same. Illumian boots well on other machines.

this is just simply i asked, cause i'm using this for our discussion, not oi-151. which i thought probably the oi-build has some compilation binary issues (corrupt binaries).

Just a suggestion: Couldn't mps_probe() check whether ctp has a valid value before dereferencing it at line 429? For example, it could check whether it points to a mapped address, or whether it is less than 0xfffff. (I'm not sure whether the latter test is justified, but you mentioned that ctp's value should be much lower than it is.)

regarding range of 0xfffff, well its defined by x86 standard for MP.
link:: https://docs.google.com/viewer?url=http://www.intel.com/design/pentium/datashts/24201606.pdf
chapter/section: 4 (page: 4-2)
mps_fps_hdr (range): 0F0000h and 0FFFFFh
mps_ct_hdr (range): 0E0000h and 0FFFFFh
figures: 4.2 (mps_fps), 4.3 (mps_ct)

you might want to check manually;
[0]> 0x9fff0/X
0x9fff0 *0x5f4d505f * /sig: MP (hdr-start)

[0]> 0x9fff0::dump
\\\\\\\\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef
9fff0: 5f4d505f 00130e00 01047f00 00000000 MP............

[0]> e1300/X
0xe1300: 504d4350 /sig: PCMP (hdr-start)

[0]> e1300::dump
\\\\\\\\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef
e1300: 50434d50 e0000437 56424f58 43505520 PCMP........

you can find your actual address to PCMP; little screen searching required. you might see MP & PCMP with correct address.
[0]> 0x10000,0xfffff::dump

don't know if your system is actually showing... PCMP hdr at 0xc991fa98..!!!

Actions #10

Updated by Ivan Stanojevic over 8 years ago

There is no mps_ct_hdr structure with the corresponding signature ("PCMP") in the region 0xf0000-0xfffff on the tested machine.

[0]> 0/L 0x504d4350

o: 0x5f94dd8

This is the first address where the debugger found the "PCMP" string. It also found that string at 0x747dda0 and at 0x7708380, but none of these are mps_ct_hdr structures. I concluded that by observing the byte at offset +6 from the "PCMP" address, which is the ct_spec_rev (SPEC_REV) field and should be either 1 or 4, according to the Intel MultiProcessor Specification.

The "_MP_" header is found at the already known address:

[0]> 0/L 0x5f504d5f

o: 0xfd430

It is also found at multiple higher addresses, but this is the only address in the range 0xf0000-0xfffff where it is found.

The address 0xc991fa98 points to unmapped memory:

[0]> 0xc991fa98::dump

o: kmdb: failed to read data at 0xc991fa98: no mapping for address

If I understood correctly the introductory part of the 4th chapter of the Intel MultiProcessor Specification, the MP Configuration Table Header (struct mps_ct_hdr) may be at the memory regions suggested there (including 0xe0000-0xfffff), but it is not mandatory to put it in one of them. Hence, it wouldn't be justified to test whether ctp<0xfffff. On the other hand, if mps_probe() could just test whether ctp points to mapped memory and return immediately if not, this problem would be solved.

Actions #11

Updated by Aakash Saini over 8 years ago

There is no mps_ct_hdr structure with the corresponding signature ("PCMP") in the region 0xf0000-0xfffff on the tested machine.
[0]> 0/L 0x504d4350
o: 0x5f94dd8

it should be there somewhere!
0x5f94dd8::print -ta stuct mps_ct_hdr !!

This is the first address where the debugger found the "PCMP" string. It also found that string at 0x747dda0 and at 0x7708380, but none of these are mps_ct_hdr structures. I concluded that by observing the byte at offset +6 from the "PCMP" address, which is the ct_spec_rev (SPEC_REV) field and should be either 1 or 4, according to the Intel MultiProcessor Specification.

yes, it should be either 1 or 4.

could you also check with PMCP, instead of PCMP; never know!
[0]> 0/L 0x50434d50

The "_MP_" header is found at the already known address:
[0]> 0/L 0x5f504d5f
o: 0xfd430

It is also found at multiple higher addresses, but this is the only address in the range 0xf0000-0xfffff where it is found.
The address 0xc991fa98 points to unmapped memory:
[0]> 0xc991fa98::dump
o: kmdb: failed to read data at 0xc991fa98: no mapping for address

only fujitsu specs can explain the phenomenon.

If I understood correctly the introductory part of the 4th chapter of the Intel MultiProcessor Specification, the MP Configuration Table Header (struct mps_ct_hdr) may be at the memory regions suggested there (including 0xe0000-0xfffff), but it is not mandatory to put it in one of them. Hence, it wouldn't be justified to test whether ctp<0xfffff. On the other hand, if mps_probe() could just test whether ctp points to mapped memory and return immediately if not, this problem would be solved.

yes. spec talks about EBDA, 640K, e0000-fffff... for mps_fps, and for mps_ct includes top physical memory... so on. It was just a casual talk, cause most of system have structure defined between 0xe0000-0xffffe. i have not used fujitsu x86 brand, so i dont know.

thumb-rule: MP is searched, and PCMP address is extracted.

my focus would be, why PCMP address location is defined. let's say, if fujitsu considers the mps_ct_hdr as optional (i don't know about fujitsu h/w motherboard/bios specs), shouldn't they have declared it 0x0:zero. Instead of specifying value as 0xc991fa98.

regard return immediately; well if value is substituted 0x0 in fps_mpct_paddr, then so... this can be considered as new point-of-discussion for implementation. bios is doing something! i don't know; there has to be a reason for the said value.

Actions #12

Updated by Ivan Stanojevic over 8 years ago

Some additional tests:

[0]> 0x5f94dd8::print -ta "struct mps_ct_hdr"

o: 5f94dd8: struct mps_ct_hdr {
o: 5f94dd8 uint32_t ct_sig = 0x504d4350
o: 5f94ddc uint16_t ct_len = 0x1374
o: 5f94dde uchar_t ct_spec_rev = 0x83
o: 5f94ddf uchar_t ct_cksum = 0xec
o: 5f94de0 char [8] ct_oem_id = [ '\\b', 'j', ',', 'V', '\\350', '\\374', '\\377', '\\377' ]
o: 5f94de8 char [12] ct_prod_id = [ '\\377', '\\203', '\\304', '\\020', '\\351', '\\327', '\\0', '\\0', '\\0', '\\213', 'F', '\\004' ]
o: 5f94df4 uint32_t ct_oem_ptr = 0xfe44589
o: 5f94df8 uint16_t ct_oem_tbl_len = 0x45b7
o: 5f94dfa uint16_t ct_entry_cnt = 0x89e4
o: 5f94dfc uint32_t ct_local_apic = 0xec83e045
o: 5f94e00 uint16_t ct_ext_tbl_len = 0x560c
o: 5f94e02 uchar_t ct_ext_cksum = 0xe8
o: }

The output at other addresses is similar.

[0]> 0/L 0x50434d50

<no output>

Actions #13

Updated by Aakash Saini over 8 years ago

Some additional tests:
[0]> 0x5f94dd8::print -ta "struct mps_ct_hdr"
o: 5f94dd8: struct mps_ct_hdr {
o: 5f94dd8 uint32_t ct_sig = 0x504d4350
o: 5f94ddc uint16_t ct_len = 0x1374
o: 5f94dde uchar_t ct_spec_rev = 0x83
o: 5f94ddf uchar_t ct_cksum = 0xec
o: 5f94de0 char [8] ct_oem_id = [ '\\\\b', 'j', ',', 'V', '\\\\350', '\\\\374', '\\\\377', '\\\\377' ]
o: 5f94de8 char [12] ct_prod_id = [ '\\\\377', '\\\\203', '\\\\304', '\\\\020', '\\\\351', '\\\\327', '\\\\0', '\\\\0', '\\\\0', '\\\\213', 'F', '\\\\004' ]
o: 5f94df4 uint32_t ct_oem_ptr = 0xfe44589
o: 5f94df8 uint16_t ct_oem_tbl_len = 0x45b7
o: 5f94dfa uint16_t ct_entry_cnt = 0x89e4
o: 5f94dfc uint32_t ct_local_apic = 0xec83e045
o: 5f94e00 uint16_t ct_ext_tbl_len = 0x560c
o: 5f94e02 uchar_t ct_ext_cksum = 0xe8
o: }

its all garbage value..

The output at other addresses is similar.
[0]> 0/L 0x50434d50
<no output>

try this option; i don't know if your system/bios behave something!
::bp -n 1 pci_autoconfig`find_bus_res
:c
pci_autoconfig`tbl_init/X
tbl_init 0
pci_autoconfig`tbl_init/W 0x1
:c
...

Actions #14

Updated by Ivan Stanojevic over 8 years ago

There is some progress.

[0]> ::bp pci_autoconfig`find_bus_res
[0]> :c
[0]> pci_autoconfig`tbl_init/X

o: pci_autoconfig`tbl_init: 0

[0]> pci_autoconfig`tbl_init/W 1
[0]> :c
[0]> :c
[0]> :c
[0]> :c

Now the system boots successfully from the live DVD. (Additional :c's are necessary because it stops several times at pci_autoconfig`find_bus_res()).

I managed to install OpenIndiana, but when I want to boot it from the hard drive I must repeat the former procedure, otherwise it keeps rebooting. It is more difficult to type the debugger commands in that case, since there is no visual feedback (no console appears) and I must be very careful to type them correctly.

I've noticed another strange phenomenon. The machine has both an integrated graphics card and a PCI Express graphics card. When I set the video output to the integrated card in the BIOS, the kernel boots in the 32-bit mode and I see 32-bit instructions in the kernel debugger. When the output is set to the PCI Express card, the kernel boots in the 64-bit mode and the kernel debugger shows 64-bit instructions. In both cases I must set pci_autoconfig`tbl_init to 1 so that the system could complete the boot process.

For some reason (probably driver related) there is no graphics output when the integrated card is used.

I've attached outputs of acpidump for different active graphics cards, just in case that might be relevant to the problem.

Actions #15

Updated by Aakash Saini over 8 years ago

There is some progress.
[0]> ::bp pci_autoconfig`find_bus_res
[0]> :c
[0]> pci_autoconfig`tbl_init/X

o: pci_autoconfig`tbl_init: 0

[0]> pci_autoconfig`tbl_init/W 1
[0]> :c
[0]> :c
[0]> :c
[0]> :c

Now the system boots successfully from the live DVD. (Additional :c's are necessary because it stops several times at pci_autoconfig`find_bus_res()).

I managed to install OpenIndiana, but when I want to boot it from the hard drive I must repeat the former procedure, otherwise it keeps rebooting. It is more difficult to type the debugger commands in that case, since there is no visual feedback (no console appears) and I must be very careful to type them correctly.

oh that's great!. you can simply add to /etc/system, once for all.
set pci_autoconfig:tbl_init=0x1

I've noticed another strange phenomenon. The machine has both an integrated graphics card and a PCI Express graphics card. When I set the video output to the integrated card in the BIOS, the kernel boots in the 32-bit mode and I see 32-bit instructions in the kernel debugger. When the output is set to the PCI Express card, the kernel boots in the 64-bit mode and the kernel debugger shows 64-bit instructions. In both cases I must set pci_autoconfig`tbl_init to 1 so that the system could complete the boot process.

again! i don't know about fujitsu x86 h/w specs, maybe instruction sets to 64bit.

For some reason (probably driver related) there is no graphics output when the integrated card is used.

could be..

yup! may be driver issue.

I've attached outputs of acpidump for different active graphics cards, just in case that might be relevant to the problem.

i will have a look at them, but still i feel there is some issue with bios, cause 0xc991fa98 address is not mapped. you need to talk to fujitsu people and file bug with them regarding PCMP.

Actions #16

Updated by Aakash Saini over 8 years ago

(file:4) FACP-1 shows invalid checksum. You need to talk to fujitsu for ACPI bug.
The bug applies both 32bit/64bit operating modes.

when the system boots, it verifies these tables: FACP & RSDT/XSDT to verify system to be ACPI compliant. if entries don't exist, its switches to legacy mode. But in your case, FACP is corrupted. (invalid checksum). I suppose this is what is leading to invalid PCMP location (not-mapped); could be other reason too.

furthermore, all the addressing for ACPI tables (file2/3: rsdt/xsdt) is close to 0xC92... and not 0xC99...
there is definitely some problem with your BIOS. its a buggy bios.

Actions #17

Updated by Ivan Stanojevic over 8 years ago

I confirm that when the line

set pci_autoconfig:tbl_init = 0x1

is added to /etc/system, OpenIndiana boots normally without the need to do anything in the kernel debugger. Thank you for the tip. So far I haven't noticed any hardware configuration problems.

I still think that an additional sanity check should be implemented in mps_probe(), before line 429. If ctp points to unmapped memory, the function should return immediately. That would prevent repeated reboots for any BIOS which might have this problem.

With this workaround, an illumos based OS can be installed, but it's a rather difficult procedure, even for more experienced users. It requires entering the kernel debugger at least twice and setting pci_autoconfig`tbl_init to 1. I'm not sure whether skipping other parts of the initialization process in find_bus_res() that this setting implies can have any effect on the proper initialization of all hardware on the machine.

Actions #18

Updated by Aakash Saini over 8 years ago

I confirm that when the line
set pci_autoconfig:tbl_init = 0x1

is added to /etc/system, OpenIndiana boots normally without the need to do anything in the kernel debugger. Thank you for the tip. So far I haven't noticed any hardware configuration problems.

I still think that an additional sanity check should be implemented in mps_probe(), before line 429. If ctp points to unmapped memory, the function should return immediately. That would prevent repeated reboots for any BIOS which might have this problem.

This additional check is absolutely not needed.

Let me term this bad-trap as: "0xDEAFCABB" (deaf-cabby) --- pointer to unmapped address location (null-pointer) | de-referencing isssue. You ask deaf-cab-driver (bios) to San-francisco (mp->pcmp), but he takes to other location not on map (unmap-address memory). ...and that's what bios is doing! PCMP data-structure is not mapped/initialized to the location.

cause its not a kernel memory allocation, so i'm not calling this bad-trap as 0xbaddcafe, but rather its a h/w fault, despite it is bios issue. (bios issues are h/w faults category, not OS).

if this is incorrect: ctp = (struct mps_ct_hdr )(uintptr_t)fpp->fps_mpct_paddr; then, how would you like to get correct with additional check. cause its a bad-trap on h/w end (0xDEAFCABB!!! unmapped-location); reason, value (fpp->fps_mpct_paddr = 0xc991fa98), is defined by bios! and it can only be corrected if bios corrects it.

if we bycot this whole process, then whole multi-bus system design needs correction. solaris design is based on multi-bus system. if similar problem occurs on multi-bus system (same bios 0xdeafcabb issue), then system will cause caos, unable to select parent-bus. this would require multiple change, seriously for desktop considerations only.

With this workaround, an illumos based OS can be installed, but it's a rather difficult procedure, even for more experienced users. It requires entering the kernel debugger at least twice and setting pci_autoconfig`tbl_init to 1.

well, bug-in-bios. you cannot deny. if MS stuff boot without correction (or as predefined values within OS), it doesn't mean we should do the same. I'm not sure if their server edition will boot the same way!!!

I'm not sure whether skipping other parts of the initialization process in find_bus_res() that this setting implies can have any effect on the proper initialization of all hardware on the machine.

well, it does have effects unless you are on server-board with multi-bus; cause your system is desktop, so it should not bother you much, as you just have single system-bus ie: \\_SB.PCI0. (check attachment: DSDT.dsl). It will be problem on multi-bus system, where finding/assigning parent bus will be an issue.

Actions #19

Updated by Aakash Saini over 8 years ago

corrections: i just realized that multibus might confuse things; its multiple PCI bus what i mean.
this also relates to PCI bus hierarchy.

Actions #20

Updated by Ivan Stanojevic over 8 years ago

I agree with you that this is a BIOS problem.

On the other hand 64 bit versions of Ubuntu 12.10, FreeBSD 9.1 and Windows Server 2008 install and boot on this machine out of the box.

So there are two options now:

1. The code in mps_probe() can stay as it is and effectively prevent installation of illumos based OS's on machines such as this one, whereas other OS's install without problems.

2. Just a few lines of code can be added before line 429 in pci_resource.c, where it could be tested whether ctp points to mapped memory, and if not mps_probe() could return. I'm not familiar with kernel memory organization and API, but I suppose this test can be done programmatically (since the kernel debugger can detect unmapped memory) at least by reading page tables. mps_probe() checks the structure pointed to by ctp anyway and considers it invalid and returns if the signature or the checksum of that structure are incorrect, so ctp pointing to unmapped memory could also indicate an invalid structure.

This machine is probably not the only one with this problem.

Actions #21

Updated by Aakash Saini over 8 years ago

Will update review this weekend.

Actions

Also available in: Atom PDF