Skiboot Versions Save

OPAL boot and runtime firmware for POWER

v7.1

7 months ago

v6.2

5 years ago

v6.0.3

5 years ago

skiboot-6.0.3

skiboot 6.0.3 was released on Wednesday May 23rd, 2018. It replaces :ref:skiboot-6.0.2 as the current stable release in the 6.0.x series.

It is recommended that 6.0.3 be used instead of any previous 6.0.x version.

Over :ref:skiboot-6.0.3, we have bug fixes related to i2c booting in secure mode, and general functionality with a TPM present. These changes are:

p8-i2c: Remove force reset

Force reset was added as an attempt to work around some issues with TPM devices locking up their I2C bus. In that particular case the problem was that the device would hold the SCL line down permanently due to a device firmware bug. The force reset doesn't actually do anything to alleviate the situation here, it just happens to reset the internal master state enough to make the I2C driver appear to work until something tries to access the bus again.

On P9 systems with secure boot enabled there is the added problem of the "diagostic mode" not being supported on I2C masters A,B,C and D. Diagnostic mode allows the SCL and SDA lines to be driven directly by software. Without this force reset is impossible to implement.

This patch removes the force reset functionality entirely since:

a) it doesn't do what it's supposed to, and b) it's butt ugly code

Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port(). There's no need to reset every port on a master in response to an error that occurred on a specific port.
libstb/i2c-driver: Bump max timeout

We have observed some TPMs clock streching the I2C bus for signifigant amounts of time when processing commands. The same TPMs also have errata that can result in permernantly locking up a bus in response to an I2C transaction they don't understand. Using an excessively long timeout to prevent this in the field.
Add TPM timeout workaround

Set the default timeout for any bus containing a TPM to one second. This is needed to work around a bug in the firmware of certain TPMs that will clock strech the I2C port the for up to a second. Additionally, when the TPM is clock streching it responds to a STOP condition on the bus by bricking itself. Clearing this error requires a hard power cycle of the system since the TPM is powered by standby power.

v6.0.2

5 years ago

skiboot-6.0.2

skiboot 6.0.2 was released on Friday May 18th, 2018. It replaces :ref:skiboot-6.0.1 as the current stable release in the 6.0.x series.

It is recommended that 6.0.2 be used instead of any previous 6.0.x version.

Over :ref:skiboot-6.0.1, we one bug fix:

cpu: Clear PCR SPR in opal_reinit_cpus()

Currently if Linux boots with a non-zero PCR, things can go bad where some early userspace programs can take illegal instructions. This is being fixed in Linux, but in the mean time, we should cleanup in skiboot also.

This could exhibit itself as petitboot getting killed with SIGILL and no boot devices showing up, but only in a situation where you've done a kdump from a kernel running a p8 compat guest

v6.0.1

5 years ago

skiboot-6.0.1

skiboot 6.0.1 was released on Wednesday May 16th, 2018. It replaces :ref:skiboot-6.0 as the current stable release in the 6.0.x series.

It is recommended that 6.0.1 be used instead of any previous 6.0.x version due to the bug fixes and debugging enhancements in it.

Over :ref:skiboot-6.0, we have two bug fixes:

OpenBMC: use 0x3a as OEM command for partial add esel.

This fixes the bug where skiboot would never send an eSEL to the BMC.
Add location code to NPU2 HMI logging

The current HMI error message does not specifiy where the HMI error occured.

The original error message was ::

NPU: FIR#0 FIR 0x0080100000000000 mask 0x009a48180f01ffff

The enhanced error message is ::

NPU2: [Loc: UOPWR.0000000-Node0-Proc0] P:0 FIR#0 FIR 0x0000100000000000 mask 0x009a48180f03ffff

v6.0

5 years ago

skiboot-6.0

skiboot v6.0 was released on Friday May 11th 2018. It is the first release of skiboot 6.0, which is the new stable release of skiboot following the 5.11 release, first released April 6th 2018.

Skiboot 6.0 is the basis for op-build v2.0 and will is required for POWER9 systems.

skiboot v6.0 contains all bug fixes as of :ref:skiboot-5.11, :ref:skiboot-5.10.5, and :ref:skiboot-5.4.9 (the currently maintained stable releases). We do not expect any further stable releases in the 5.10.x series, nor in the 5.11.x series.

For how the skiboot stable releases work, see :ref:stable-rules for details.

Over skiboot-5.11, we have the following changes:

New Features

Since 6.0-rc1:

Update default stop-state-disable mask to cut only stop11

Stability improvements in microcode for stop4/stop5 are available in upstream hcode images. Stop4 and stop5 can be safely enabled by default.

Use ~0xE0000000 to cut all but stop0,1,2 in case there are any issues with stop4/5.

example: ::

nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0x1FFFFFFF

Note: that DD2.1 chips that have a frequency <1867Mhz possible need to run a hcode image different than the default in op-build (set BR2_HCODE_LATEST_VERSION=y in your config)
ibm,firmware-versions: add hcode to device tree

op-build commit 736a08b996e292a449c4996edb264011dfe56a40 added hcode to the VERSION partition, let's parse it out and let the user know.
ipmi: Add BMC firmware version to device tree

BMC Get device ID command gives BMC firmware version details. Lets add this to device tree. User space tools will use this information to display BMC version details.

Since 5.11:

Disable stop states from OPAL

On ZZ, stop4,5,11 are enabled for PowerVM, even though doing so may cause problems with OPAL due to bugs in hcode.

For other platforms, this isn't so much of an issue as we can just control stop states by the MRW. However the rebuild-the-world approach to changing values there is a bit annoying if you just want to rule out a specific stop state from being problematic.

Provide an nvram option to override what's disabled in OPAL.

The OPAL mask is currently ~0xE0000000 (i.e. all but stop 0,1,2)

You can set an NVRAM override with: ::
```
nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0xFFFFFFF
```
This nvram override will disable all stop states.
interrupts: Create an "interrupts" property in the OPAL node

Deprecate the old "opal-interrupts", it's still there, but the new property follows the standard and allow us to specify whether an interrupt is level or edge sensitive.

Similarly create "interrupt-names" whose content is identical to "opal-interrupts-names".
SBE: Add timer support on POWER9

SBE on P9 provides one shot programmable timer facility. We can use this to implement OPAL timers and hence limit the reliance on the Linux heartbeat (similar to HW timer facility provided by SLW on P8).
Add SBE driver support

SBE (Self Boot Engine) on P9 has two different jobs:
- Boot the chip up to the point the core is functional
- Provide various services like timer, scom, stash MPIPL, etc., at runtime
We will use SBE for various purposes like timer, MPIPL, etc.
opal:hmi: Add missing processor recovery reason string.

With this patch now we see reason string printed for CORE_WOF[43] bit. ::

[ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred. [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error: [ 477.352242181,7] HMI: PC - Thread hang recovery
Add DIMM actual speed to device tree

Recent HDAT provides DIMM actuall speed. Lets add this to device tree.
Fix DIMM size property

Today we parse vpd blob to get DIMM size information. This is limited to FSP based system. HDAT provides DIMM size value. Lets use that to populate device tree. So that we can get size information on BMC based system as well.
PCI: Set slot power limit when supported

The PCIe slot capability can be implemented in a root or switch downstream port to set the maximum power a card is allowed to draw from the system. This patch adds support for setting the power limit when the platform has defined one.
hdata/spira: parse vpd to add part-number and serial-number to xscom@ node

Expected by FWTS and associates our processor with the part/serial number, which is obviously a good thing for one's own sanity.

Improved HMI Handling ^^^^^^^^^^^^^^^^^^^^^

opal/hmi: Add documentation for opal_handle_hmi2 call
opal/hmi: Generate hmi event for recovered HDEC parity error.
opal/hmi: check thread 0 tfmr to validate latched tfmr errors.

Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0.
opal/hmi: Print additional debug information in rendezvous.
opal/hmi: Fix handling of TFMR parity/corrupt error.

While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error
- First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1.
- Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB.
On second HMI we end up throwing "HMI: TB invalid without core error reported" even though TB is in a valid state.
opal/hmi: Stop flooding HMI event for TOD errors.

Fix the issue where every thread on the chip sends HMI event to host for TOD errors. TOD errors are reported to all the core/threads on the chip. Any one thread can fix the error and send event. Rest of the threads don't need to send HMI event unnecessarily.
opal/hmi: Fix soft lockups during TOD errors

There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working.

TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected.

But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel.

This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery.
opal/hmi: Do not send HMI event if no errors are found.

For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found.
opal/hmi: Initialize the hmi event with old value of HMER.

Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register.

Without this patch the console event show HMER with all zeros ::

[ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000

After this patch it shows old HMER values on host console: ::

[ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000
opal/hmi: Rework HMI handling of TFAC errors

This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors.
opal/hmi: Don't bother passing HMER to pre-recovery cleanup

The test for TFAC error is now redundant so we remove it and remove the HMER argument.
opal/hmi: Move timer related error handling to a separate function

Currently no functional change. This is a first step to completely rewriting how these things are handled.
opal/hmi: Add a new opal_handle_hmi2 that returns direct info to Linux

It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated.
opal/hmi: Remove races in clearing HMER

Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it.

Instead, use an all 1's mask with only the bit handled cleared.
opal/hmi: Don't re-read HMER multiple times

We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL.

libflash and ffspart ^^^^^^^^^^^^^^^^^^^^

Many improvements to the ffspart utility and libflash have come in this release, making ffspart suitable for building bit-identical PNOR images as the existing tooling used by op-build. The plan is to switch op-build to use this infrastructure in the not too distant future.

libflash/blocklevel: Make read/write be ECC agnostic for callers

The blocklevel abstraction allows for regions of the backing store to be marked as ECC protected so that blocklevel can decode/encode the ECC bytes into the buffer automatically without the caller having to be ECC aware.

Unfortunately this abstraction is far from perfect, this is only useful if reads and writes are performed at the start of the ECC region or in some circumstances at an ECC aligned position - which requires the caller be aware of the ECC regions.

The problem that has arisen is that the blocklevel abstraction is initialised somewhere but when it is later called the caller is unaware if ECC exists in the region it wants to arbitrarily read and write to. This should not have been a problem since blocklevel knows. Currently misaligned reads will fail ECC checks and misaligned writes will overwrite ECC bytes and the backing store will become corrupted.

This patch add the smarts to blocklevel_read() and blocklevel_write() to cope with the problem. Note that ECC can always be bypassed by calling blocklevel_raw_() functions.

All this work means that the gard tool can can safely call blocklevel_read() and blocklevel_write() and as long as the blocklevel knows of the presence of ECC then it will deal with all cases.

This also commit removes code in the gard tool which compensated for inadequacies no longer present in blocklevel.
libflash/blocklevel: Return region start from ecc_protected()

Currently all ecc_protected() does is say if a region is ECC protected or not. Knowing a region is ECC protected is one thing but there isn't much that can be done afterwards if this is the only known fact. A lot more can be done if the caller is told where the ECC region begins.

Knowing where the ECC region start it allows to caller to align its read/and writes. This allows for more flexibility calling read and write without knowing exactly how the backing store is organised.
libflash/ecc: Add helpers to align a position within an ecc buffer

As part of ongoing work to make ECC invisible to higher levels up the stack this function converts a 'position' which should be ECC agnostic to the equivalent position within an ECC region starting at a specified location.
libflash/ecc: Add functions to deal with unaligned ECC memcpy
external/ffspart: Improve error output
libffs: Fix bad checks for partition overlap

Not all TOCs are written at zero
libflash/libffs: Allow caller to specifiy header partition

An FFS TOC is comprised of two parts. A small header which has a magic and very minimmal information about the TOC which will be common to all partitions, things like number of patritions, block sizes and the like. Following this small header are a series of entries. Importantly there is always an entry which encompases the TOC its self, this is usually called the 'part' partition.

Currently libffs always assumes that the 'part' partition is at zero. While there is always a TOC and zero there doesn't actually have to be. PNORs may have multiple TOCs within them, therefore libffs needs to be flexible enough to allow callers to specify TOCs not at zero.

The 'part' partition is otherwise a regular partition which may have flags associated with it. libffs should allow the user to set the flags for the 'part' partition.

This patch achieves both by allowing the caller to specify the 'part' partition. The caller can not and libffs will provide a sensible default.
libflash/libffs: Refcount ffs entries

Currently consumers can add an new ffs entry to multiple headers, this is fine but freeing any of the headers will cause the entry to be freed, this causes double free problems.

Even if only one header is uses, the consumer of the library still has a reference to the entry, which they may well reuse at some other point.

libffs will now refcount entries and only free when there are no more references.

This patch also removes the pointless return value of ffs_hdr_free()
libflash/libffs: Switch to storing header entries in an array

Since the libffs no longer needs to sort the entries as they get added it makes little sense to have the complexity of a linked list when an array will suffice.
libflash/libffs: Remove backup partition from TOC generation code

It turns out this code was messy and not all that reliable. Doing it at the library level adds complexity to the library and restrictions to the caller.

A simpler approach can be achived with the just instantiating multiple ffs_header structures pointing to different parts of the same file.
libflash/libffs: Remove the 'sides' from the FFS TOC generation code

It turns out this code was messy and not all that reliable. Doing it at the library level adds complexity to the library and restrictions to the caller.

A simpler approach can be achived with the just instantiating multiple ffs_header structures pointing to different parts of the same file.
libflash/libffs: Always add entries to the end of the TOC

It turns out that sorted order isn't the best idea. This removes flexibility from the caller. If the user wants their partitions in sorted order, they should insert them in sorted order.
external/ffspart: Remove side, order and backup options

These options are currently flakey in libflash/libffs so there isn't much point to being able to use them in ffspart.

Future reworks planned for libflash/libffs will render these options redundant anyway.
libflash/libffs: ffs_close() should use ffs_hdr_free()
libflash/libffs: Add setter for a partitions actual size
pflash: Use ffs_entry_user_to_string() to standardise flag strings
libffs: Standardise ffs partition flags

It seems we've developed a character respresentation for ffs partition flags. Currently only pflash really prints them so it hasn't been a problem but now ffspart wants to read them in from user input.

It is important that what libffs reads and what pflash prints remain consistent, we should move the code into libffs to avoid problems.
external/ffspart: Allow # comments in input file\

p9dsu Platform changes

The p9dsu platform from SuperMicro (also known as 'Boston') has received a number of updates, and the patches once carried by SuperMicro are now upstream.

Since 6.0-rc1:

p9dsu: timeout for variant detection, default to 2uess

Since 5.11:

p9dsu: detect p9dsu variant even when hostboot doesn't tell us

The SuperMicro BMC can tell us what riser type we have, which dictates the PCI slot tables. Usually, in an environment that a customer would experience, Hostboot will do the query with an SMC specific patch (not upstream as there's no platform specific code in hostboot) and skiboot knows what variant it is based on the compatible string.

However, if you're using upstream hostboot, you only get the bare 'p9dsu' compatible type. We can work around this by asking the BMC ourselves and setting the slot table appropriately. We do this syncronously in platform init so that we don't start probing PCI before we setup the slot table.
p9dsu: add slot power limit.
p9dsu: add pci slot table for Boston LC 1U/2U and Boston LA/ESS.
p9dsu HACK: fix system-vpd eeprom
p9dsu: change esel command from AMI to IBM 0x3a.

ZZ Platform Changes

hdata/i2c: Fix up pci hotplug labels

These labels are used on the devices used to do PCIe slot power control for implementing PCIe hotplug. I'm not sure how they ended up as "eeprom-pgood" and "eeprom-controller" since that doesn't make any sense.
hdata/i2c: Ignore multi-port I2C devices

Recent FSP firmware builds add support for multi-port I2C devices such as the GPIO expanders used for the presence detect of OpenCAPI devices and the PCIe hotplug controllers used to power cycle PCIe slots on ZZ.

The OpenCAPI driver inside of skiboot currently uses a platform-specific method to talk to the relevant I2C device rather than relying on HDAT since not all platforms correctly report the I2C devices (hello Zaius). Additionally the nature of multi-port devices require that we a device specific handler so that we generate the correct DT bindings. Currently we don't and there is no immediate need for this support so just ignore the multi-port devices for now.
hdata/i2c: Replace i2c_ prefix with dev_

The current naming scheme makes it easy to conflate "i2cm_port" and "i2c_port." The latter is used to describe multi-port I2C devices such as GPIO expanders and multi-channel PCIe hotplug controllers. Rename i2c_port to dev_port to make the two a bit more distinct.

Also rename i2c_addr to dev_addr for consistency.
hdata/i2c: Ignore CFAM I2C master

Recent FSP firmware builds put in information about the CFAM I2C master in addition the to host I2C masters accessible via XSCOM. Odds are this information should not be there since there's no handshaking between the FSP/BMC and the host over who controls that I2C master, but it is so we need to deal with it.

This patch adds filtering to the HDAT parser so it ignores the CFAM I2C master. Without this it will create a bogus i2cm@ which migh cause issues.
ZZ: hw/imc: Add support to load imc catalog lid file

Add support to load the imc catalog from a lid file packaged as part of the system firmware. Lid number allocated is 0x80f00103.lid.

Bugs Fixed

Since 6.0-rc2:

core/opal: Fix recursion check in opal_run_pollers()

An earlier commit introduced a counter variable poller_recursion to limit to the number number of error messages shown when opal_pollers are run recursively. However the check for the counter value was placed in a way that the poller recursion was only detected first 16 times and then allowed afterwards.

This patch fixes this by moving the check for the counter value inside the conditional branch with some re-factoring so that opal_poller recursion is not erroneously allowed after poll_recursion is detected first 16 times.
phb4: Print WOF registers on fence detect

Without the WOF registers it's hard to figure out what went wrong first, so print those when we print the FIRs when a fence is detected.
p9dsu: detect variant in init only if probe fails to found.

Currently the slot table init happens twice in both probe and init functions due to the variant detection logic called with in-correct condition check.

Since 6.0-rc1:

core/direct-controls: improve p9_stop_thread error handling

p9_stop_thread should fail the operation if it finds the thread was already quiescd. This implies something else is doing direct controls on the thread (e.g., pdbg) or there is some exceptional condition we don't know how to deal with. Proceeding here would cause things to trample on each other, for example the hard lockup watchdog trying to send a sreset to the core while it is stopped for debugging with pdbg will end in tears.

If p9_stop_thread times out waiting for the thread to quiesce, do not hit it with a core_start direct control, because we don't know what state things are in and doing more things at this point is worse than doing nothing. There is no good recipe described in the workbook to de-assert the core_stop control if it fails to quiesce the thread. After timing out here, the thread may eventually quiesce and get stuck, but that's simpler to debug than undefied behaviour.
core/direct-controls: fix p9_cont_thread for stopped/inactive threads

Firstly, p9_cont_thread should check that the thread actually was quiesced before it tries to resume it. Anything could happen if we try this from an arbitrary thread state.

Then when resuming a quiesced thread that is inactive or stopped (in a stop idle state), we must not send a core_start direct control, clear_maint must be used in these cases.

hmi: Clear unknown debug trigger

On some systems, seeing hangs like this when Linux starts: ::

[ 170.027252763,5] OCC: All Chip Rdy after 0 ms
[ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes)
[ 171.238270428,5] OPAL: Switch to little-endian OS

If you look at the in memory skiboot console (or do nvram -p ibm,skiboot --update-config log-level-driver=7) we see the console get spammed with: ::

[ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000

We're taking the debug trigger (bit 17) early on, before the hmi_debug_trigger function in the kernel is set up.

This clears the HMI in Skiboot and reports to the kernel instead of bringing down the machine.

core/hmi: assign flags=0 in case nothing set by handle_hmi_exception

Theoretically we could have returned junk to the OS in this parameter.
SLW: Fix mambo boot to use stop states

After commit 35c66b8ce5a2 ("SLW: Move MAMBO simulator checks to slw_init"), mambo boot no longer calls add_cpu_idle_state_properties() and as such we never enable stop states.

After adding the call back, we get more testing coverage as well as faster mambo SMT boots.
phb4: Hardware init updates

CFG Write Request Timeout was incorrectly set to informational and not fatal for both non-CAPI and CAPI, so set it to fatal. This was a mistake in the specification. Correcting this fixes a niche bug in escalation (which is necessary on pre-DD2.2) that can cause a checkstop due to a NCU timeout.

In addition, set the values in the timeout control registers to match. This fixes an extremely rare and unreproducible bug, though the current timings don't make sense since they're higher than the NCU timeout (16) which will checkstop the machine anyway.
SLW: quieten 'Configuring self-restore' for DARN,NCU_SPEC_BAR and HRMOR

Since 5.11:

core: Fix iteration condition to skip garded cpu
uart: fix uart_opal_flush to take console lock over uart_con_flush This bug meant that OPAL_CONSOLE_FLUSH didn't take the appropriate locks. Luckily, since this call is only currently used in the crash path.
xive: fix missing unlock in error path
OPAL_PCI_SET_POWER_STATE: fix locking in error paths

Otherwise we could exit OPAL holding locks, potentially leading to all sorts of problems later on.
hw/slw: Don't assert on a unknown chip

For some reason skiboot populates nodes in /cpus/ for the cores on chips that are deconfigured. As a result Linux includes the threads of those cores in it's set of possible CPUs in the system and attempts to set the SPR values that should be used when waking a thread from a deep sleep state.

However, in the case where we have deconfigured chip we don't create a xscom node for that chip and as a result we don't have a proc_chip structure for that chip either. In turn, this results in an assertion failure when calling opal_slw_set_reg() since it expects the chip structure to exist. Fix this up and print an error instead.
opal/hmi: Generate one event per core for processor recovery.

Processor recovery is per core error. All threads on that core receive HMI. All threads don't need to generate HMI event for same error.

Let thread 0 only generate the event.
sensors: Dont add DTS sensors when OCC inband sensors are available

There are two sets of core temperature sensors today. One is DTS scom based core temperature sensors and the second group is the sensors provided by OCC. DTS is the highest temperature among the different temperature zones in the core while OCC core temperature sensors are the average temperature of the core. DTS sensors are read directly by the host by SCOMing the DTS sensors while OCC sensors are read and updated by OCC to main memory.

Reading DTS sensors by SCOMing is a heavy and slower operation as compared to reading OCC sensors which is as good as reading memory. So dont add DTS sensors when OCC sensors are available.
core/fast-reboot: Increase timeout for dctl sreset to 1sec

Direct control xscom can take more time to complete. We seem to wait too little on Boston failing fast-reboot for no good reason.

Increase timeout to 1 sec as a reasonable value for sreset to be delivered and core to start executing instructions.
occ: sensors-groups: Add DT properties to mark HWMON sensor groups

Fix the sensor type to match HWMON sensor types. Add compatible flag to indicate the environmental sensor groups so that operations on these groups can be handled by HWMON linux interface.
core: Correctly load initramfs in stb container

Skiboot does not calculate the actual size and start location of the initramfs if it is wrapped by an STB container (for example if loading an initramfs from the ROOTFS partition).

Check if the initramfs is in an STB container and determine the size and location correctly in the same manner as the kernel. Since load_initramfs() is called after load_kernel() move the call to trustedboot_exit_boot_services() into load_and_boot_kernel() so it is called after both of these.
hdat/i2c.c: quieten "v2 found, parsing as v1"
hw/imc: Check for pause_microcode_at_boot() return status

pause_microcode_at_boot() loops through all the chip's ucode control block and pause the ucode if it is in the running state. But it does not fail if any of the chip's ucode is not initialised.

Add code to return a failure if ucode is not initialized in any of the chip. Since pause_microcode_at_boot() is called just before attaching the IMC device nodes in imc_init(), add code to check for the function return.

Slot location code fixes:

npu2: Use ibm, loc-code rather than ibm, slot-label

The ibm,slot-label property is to name the slot that appears under a PCIe bridge. In the past we (ab)used the slot tables to attach names to GPU devices and their corresponding NVLinks which resulted in npu2.c using slot-label as a location code rather than as a way to name slots.

Fix this up since it's confusing.
hdata/slots: Apply slot label to the parent slot

Slot names only really make sense when applied to an actual slot rather than a device. On witherspoon the GPU devices have a name associated with the device rather than the slot for the GPUs. Add a hack that moves the slot label to the parent slot rather than on the device itself.
pci-dt-slot: Big ol' cleanup

The underlying data that we get from HDAT can only really describe a PCIe system. As such we can simplify the devicetree slot lookup code by only caring about the important cases, namly, root ports and switch downstream ports.

This also fixes a bug where root port didn't get a Slot label applied which results in devices under that port not having ibm,loc-code set. This results in the EEH core being unable to report the location of EEHed devices under that port.

opal-prd ^^^^^^^^

opal-prd: Insert powernv_flash module

Explictly load powernv_flash module on BMC based system so that we are sure that flash device is created before starting opal-prd daemon.

Note that I have replaced pnor_available() check with is_fsp_system(). As we want to load module on BMC system only. Also pnor_init has enough logic to detect flash device. Hence pnor_available() becomes redundant check.

NPU2/NVLINK2 ^^^^^^^^^^^^

npu2/hw-procedures: fence bricks on GPU reset

The NPU workbook defines a way of fencing a brick and getting the brick out of fence state. We do have an implementation of bringing the brick out of fenced/quiesced state. We do the latter in our procedures, but to support run time reset we need to do the former.

The fencing ensures that access to memory behind the links will not lead to HMI's, but instead SUE's will be populated in cache (in the case of speculation). The expectation is then that prior to and after reset, the operating system components will flush the cache for the region of memory behind the GPU.

This patch does the following:
1. Implements a npu2_dev_fence_brick() function to set/clear fence state
2. Clear FIR bits prior to clearing the fence status
3. Clear's the fence status
4. We take the powerbus out of CQ fence much later now, in credits_check() which is the last hardware procedure called after link training.
hw/npu2.c: Remove static configuration of NPU2 register

The NPU_SM_CONFIG0 register currently needs to be configured in Skiboot to select NVLink mode, however Hostboot should configure other bits in this register.

For some reason Skiboot was explicitly clearing bit-6 (CONFIG_DISABLE_VG_NOT_SYS). It is unclear why this bit was getting cleared as recent Hostboot versions explicitly set it to the correct value based on the specific system configuration. Therefore Skiboot should not alter it.

Bit-58 (CONFIG_NVLINK_MODE) selects if NVLink mode should be enabled or not. Hostboot does not configure this bit so Skiboot should continue to configure it.

npu2: Improve log output of GPU-to-link mapping

Debugging issues related to unconnected NVLinks can be a little less irritating if we use the NPU2DEV{DBG,INF}() macros instead of prlog().

In short, change this: ::

NPU2: comparing GPU 'GPU2' and NPU2 'GPU1'
NPU2: comparing GPU 'GPU3' and NPU2 'GPU1'
NPU2: comparing GPU 'GPU4' and NPU2 'GPU1'
NPU2: comparing GPU 'GPU5' and NPU2 'GPU1'
      :
npu2_dev_bind_pci_dev: No PCI device for NPU2 device 0006:00:01.0 to bind to. If you expect a GPU to be there, this is a problem.

to this: ::

NPU6:0:1.0 Comparing GPU 'GPU2' and NPU2 'GPU1'
NPU6:0:1.0 Comparing GPU 'GPU3' and NPU2 'GPU1'
NPU6:0:1.0 Comparing GPU 'GPU4' and NPU2 'GPU1'
NPU6:0:1.0 Comparing GPU 'GPU5' and NPU2 'GPU1'
      :
NPU6:0:1.0 No PCI device found for slot 'GPU1'

npu2: Move NPU2_XTS_BDF_MAP_VALID assignment to context init

A bad GPU or other condition may leave us with a subset of links that never get initialized. If an ATSD is sent to one of those bricks, it will never complete, leaving us waiting forever for a response: ::

watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [acos:2050] ... Modules linked in: nvidia_uvm(O) nvidia(O) CPU: 23 PID: 2050 Comm: acos Tainted: G W O 4.14.0 #2 task: c0000000285cfc00 task.stack: c000001fea860000 NIP: c0000000000abdf0 LR: c0000000000acc48 CTR: c0000000000ace60 REGS: c000001fea863550 TRAP: 0901 Tainted: G W O (4.14.0) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28004484 XER: 20040000 CFAR: c0000000000abdf4 SOFTE: 1 GPR00: c0000000000acc48 c000001fea8637d0 c0000000011f7c00 c000001fea863820 GPR04: 0000000002000000 0004100026000000 c0000000012778c8 c00000000127a560 GPR08: 0000000000000001 0000000000000080 c000201cc7cb7750 ffffffffffffffff GPR12: 0000000000008000 c000000003167e80 NIP [c0000000000abdf0] mmio_invalidate_wait+0x90/0xc0 LR [c0000000000acc48] mmio_invalidate.isra.11+0x158/0x370

ATSDs are only sent to bricks which have a valid entry in the XTS_BDF table. So to prevent the hang, don't set NPU2_XTS_BDF_MAP_VALID unless we make it all the way to creating a context for the BDF.

Secure and Trusted Boot ^^^^^^^^^^^^^^^^^^^^^^^

hdata/tpmrel: detect tpm not present by looking up the stinfo->status

Skiboot detects if tpm is present by checking if a secureboot_tpm_info entry exists. However, if a tpm is not present, hostboot also creates a secureboot_tpm_info entry. In this case, hostboot creates an empty entry, but setting the field tpm_status to TPM_NOT_PRESENT.

This detects if tpm is not present by looking up the stinfo->status.

This fixes the "TPMREL: TPM node not found for chip_id=0 (HB bug)" issue, reproduced when skiboot is running on a system that has no tpm.

PCI ^^^

phb4: Restore bus numbers after CRS

Currently we restore PCIe bus numbers right after the link is up. Unfortunately as this point we haven't done CRS so config space may not be accessible.

This moves the bus number restore till after CRS has happened.
romulus: Add a barebones slot table
phb4: Quieten and improve "Timeout waiting for electrical link"

This happens normally if a slot doesn't have a working HW presence detect and relies instead of inband presence detect.

The message we display is scary and not very useful unless ou are debugging, so quiten it up and change it to something more meaningful.
pcie-slot: Don't fail powering on an already on switch

If the power state is already the required value, return OPAL_SUCCESS rather than OPAL_PARAMETER to avoid spurrious errors during boot.

CAPI/OpenCAPI ^^^^^^^^^^^^^

capi: Keep the current mmio windows in the mbt cache table.

When the phb is used as a CAPI interface, the current mmio windows list is cleaned before adding the capi and the prefetchable memory (M64) windows, which implies that the non-prefetchable BAR is no more configured. This patch allows to set only the mbt bar to pass capi mmio window and to keep, as defined, the other mmio values (M32 and M64).
npu2-opencapi: Fix 'link internal error' FIR, take 2

When setting up an opencapi link, we set the transport muxes first, then set the PHY training config register, which includes disabling nvlink mode for the bricks. That's the order of the init sequence, as found in the NPU workbook.

In reality, doing so works, but it raises 2 FIR bits in the PowerBus OLL FIR Register for the 2 links when we configure the transport muxes. Presumably because nvlink is not disabled yet and we are configuring the transport muxes for opencapi.

bit 60: link0 internal error bit 61: link1 internal error

Overall the current setup ends up being correct and everything works, but we raise 2 FIR bits.

So tweak the order of operations to disable nvlink before configuring the transport muxes. Incidentally, this is what the scripts from the opencapi enablement team were doing all along.
npu2-opencapi: Fix 'link internal error' FIR, take 1

When we setup a link, we always enable ODL0 and ODL1 at the same time in the PHY training config register, even though we are setting up only one OTL/ODL, so it raises a "link internal error" FIR bit in the PowerBus OLL FIR Register for the second link. The error is harmless, as we'll eventually setup the second link, but there's no reason to raise that FIR bit.

The fix is simply to only enable the ODL we are using for the link.
phb4: Do not set the PBCQ Tunnel BAR register when enabling capi mode.

The cxl driver will set the capi value, like other drivers already do.
phb4: set TVT1 for tunneled operations in capi mode

The ASN indication is used for tunneled operations (as_notify and atomics). Tunneled operation messages can be sent in PCI mode as well as CAPI mode.

The address field of as_notify messages is hijacked to encode the LPID/PID/TID of the target thread, so those messages should not go through address translation. Therefore bit 59 is part of the ASN indication.

This patch sets TVT#1 in bypass mode when capi mode is enabled, to prevent as_notify messages from being dropped.

Debugging/Testing improvements

Since 6.0-rc1:

mambo: Enable XER CA32 and OV32 bits on P9

POWER9 adds 32 bit carry and overflow bits to the XER, but we need to set the relevant CTRL1 bit to enable them.
Makefile: Fix building natively on ppc64le

When on ppc64le and CROSS is not set by the environment, make assumes ppc64 and sets a default CROSS. Check for ppc64le as well, so that 'make' works out of the box on ppc64le.
Experimental support for building with Clang
Improvements to testing and Travis CI

Since 5.11:

core/stack: backtrace unwind basic OPAL call details

Put OPAL callers' r1 into the stack back chain, and then use that to unwind back to the OPAL entry frame (as opposed to boot entry, which has a 0 back chain).

From there, dump the OPAL call token and the caller's r1. A backtrace looks like this: ::
```
CPU 0000 Backtrace:
 S: 0000000031c03ba0 R: 000000003001a548   ._abort+0x4c
 S: 0000000031c03c20 R: 000000003001baac   .opal_run_pollers+0x3c
 S: 0000000031c03ca0 R: 000000003001bcbc   .opal_poll_events+0xc4
 S: 0000000031c03d20 R: 00000000300051dc   opal_entry+0x12c
 --- OPAL call entry token: 0xa caller R1: 0xc0000000006d3b90 ---
```
This is pretty basic for the moment, but it does give you the bottom of the Linux stack. It will allow some interesting improvements in future.

First, with the eframe, all the call's parameters can be printed out as well. The ___backtrace / ___print_backtrace API needs to be reworked in order to support this, but it's otherwise very simple (see opal_trace_entry()).

Second, it will allow Linux's stack to be passed back to Linux via a debugging opal call. This will allow Linux's BUG() or xmon to also print the Linux back trace in case of a NMI or MCE or watchdog lockup that hits in OPAL.
asm/head: implement quiescing without stack or clobbering regs

Quiescing currently is implmeented in C in opal_entry before the opal call handler is called. This works well enough for simple cases like fast reset when one CPU wants all others out of the way.

Linux would like to use it to prevent an sreset IPI from interrupting firmware, which could lead to deadlocks when crash dumping or entering the debugger. Linux interrupts do not recover well when returning back to general OPAL code, due to r13 not being restored. OPAL also can't be re-entered, which may happen e.g., from the debugger.

So move the quiesce hold/reject to entry code, beore the stack or r1 or r13 registers are switched. OPAL can be interrupted and returned to or re-entered during this period.

This does not completely solve all such problems. OPAL will be interrupted with sreset if the quiesce times out, and it can be interrupted by MCEs as well. These still have the issues above.
core/opal: Allow poller re-entry if OPAL was re-entered

If an NMI interrupts the middle of running pollers and the OS invokes pollers again (e.g., for console output), the poller re-entrancy check will prevent it from running and spam the console.

That check was designed to catch a poller calling opal_run_pollers, OPAL re-entrancy is something different and is detected elsewhere. Avoid the poller recursion check if OPAL has been re-entered. This is a best-effort attempt to cope with errors.
core/opal: Emergency stack for re-entry

This detects OPAL being re-entered by the OS, and switches to an emergency stack if it was. This protects the firmware's main stack from re-entrancy and allows the OS to use NMI facilities for crash / debug functionality.

Further nested re-entry will destroy the previous emergency stack and prevent returning, but those should be rare cases.

This stack is sized at 16kB, which doubles the size of CPU stacks, so as not to introduce a regression in primary stack size. The 16kB stack originally had a 4kB machine check stack at the top, which was removed by 80eee1946 ("opal: Remove machine check interrupt patching in OPAL."). So it is possible the size could be tightened again, but that would require further analysis.
hdat_to_dt: hash_prop the same on all platforms Fixes this unit test on ppc64le hosts.
mambo: Add persistent memory disk support

This adds support to for mapping disks images using persistent memory. Disks can be added by setting this ENV variable:

PMEM_DISK="/mydisks/disk1.img,/mydisks/disk2.img"

These will show up in Linux as /dev/pmem0 and /dev/pmem1.

This uses a new feature in mambo "mysim memory mmap .." which is only available since mambo commit 0131f0fc08 (from 24/4/2018).

This also needs the of_pmem.c driver in Linux which is only available since v4.17. It works with powernv_defconfig + CONFIG_OF_PMEM.

external/mambo: Add di command to decode instructions

By default you get 16 instructions but you can specify the number you want. i.e. ::

systemsim % di 0x100 4
0x0000000000000100: Enc:0xA64BB17D : mtspr   HSPRG1,r13
0x0000000000000104: Enc:0xA64AB07D : mfspr   r13,HSPRG0
0x0000000000000108: Enc:0xF0092DF9 : std     r9,0x9F0(r13)
0x000000000000010C: Enc:0xA6E2207D : mfspr   r9,PPR

Using di since it's what xmon uses.

mambo/mambo_utils.tcl: Inject an MCE at a specified address

Currently we don't support injecting an MCE on a specific address. This is useful for testing functionality like memcpy_mcsafe() (see https://patchwork.ozlabs.org/cover/893339/)

The core of the functionality is a routine called inject_mce_ue_on_addr, which takes an addr argument and injects an MCE (load/store with UE) when the specified address is accessed by code. This functionality can easily be enhanced to cover instruction UE's as well.

A sample use case to create an MCE on stack access would be ::

set addr [mysim display gpr 1] inject_mce_ue_on_addr $addr

This would cause an mce on any r1 or r1 based access
external/mambo: improve helper for machine checks

Improve workarounds for stop injection, because mambo often will trigger on 0x104/204 when injecting sreset/mces.

This also adds a workaround to skip injecting on reservations to avoid infinite loops when doing inject_mce_step.
travis: Enable ppc64le builds

At least on the IBM Travis Enterprise instance, we can now do ppc64le builds!

We can only build a subset of our matrix due to availability of ppc64le distros. The Dockerfiles need some tweaking to only attempt to install (x86_64 only) Mambo binaries, as well as the build scripts.
external: Add "lpc" tool

This is a little front-end to the lpc debugfs files to access the LPC bus from userspace on the host.
core/test/run-trace: fix on ppc64el

v6.0-rc2

5 years ago

skiboot-6.0-rc2

skiboot v6.0-rc2 was released on Wednesday May 9th 2018. It is the second release candidate of skiboot 6.0, which will become the new stable release of skiboot following the 5.11 release, first released April 6th 2018.

Skiboot 6.0 will mark the basis for op-build v2.0 and will be required for POWER9 systems.

skiboot v6.0-rc2 contains all bug fixes as of :ref:skiboot-5.11, :ref:skiboot-5.10.5, and :ref:skiboot-5.4.9 (the currently maintained stable releases). Once 6.0 is released, we do not expect any further stable releases in the 5.10.x series, nor in the 5.11.x series.

For how the skiboot stable releases work, see :ref:stable-rules for details.

The current plan is to cut the final 6.0 in early May (maybe in a day or two after this -rc if things look okay), with skiboot 6.0 being for all POWER8 and POWER9 platforms in op-build v2.0.

Over skiboot-6.0-rc1, we have the following changes:

Update default stop-state-disable mask to cut only stop11

Stability improvements in microcode for stop4/stop5 are available in upstream hcode images. Stop4 and stop5 can be safely enabled by default.

Use ~0xE0000000 to cut all but stop0,1,2 in case there are any issues with stop4/5.

example: ::

nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0x1FFFFFFF

Note: that DD2.1 chips that have a frequency <1867Mhz possible need to run a hcode image different than the default in op-build (set BR2_HCODE_LATEST_VERSION=y in your config)
ibm,firmware-versions: add hcode to device tree

op-build commit 736a08b996e292a449c4996edb264011dfe56a40 added hcode to the VERSION partition, let's parse it out and let the user know.
ipmi: Add BMC firmware version to device tree

BMC Get device ID command gives BMC firmware version details. Lets add this to device tree. User space tools will use this information to display BMC version details.
mambo: Enable XER CA32 and OV32 bits on P9

POWER9 adds 32 bit carry and overflow bits to the XER, but we need to set the relevant CTRL1 bit to enable them.
Makefile: Fix building natively on ppc64le

When on ppc64le and CROSS is not set by the environment, make assumes ppc64 and sets a default CROSS. Check for ppc64le as well, so that 'make' works out of the box on ppc64le.
p9dsu: timeout for variant detection, default to 2uess
core/direct-controls: improve p9_stop_thread error handling

p9_stop_thread should fail the operation if it finds the thread was already quiescd. This implies something else is doing direct controls on the thread (e.g., pdbg) or there is some exceptional condition we don't know how to deal with. Proceeding here would cause things to trample on each other, for example the hard lockup watchdog trying to send a sreset to the core while it is stopped for debugging with pdbg will end in tears.

If p9_stop_thread times out waiting for the thread to quiesce, do not hit it with a core_start direct control, because we don't know what state things are in and doing more things at this point is worse than doing nothing. There is no good recipe described in the workbook to de-assert the core_stop control if it fails to quiesce the thread. After timing out here, the thread may eventually quiesce and get stuck, but that's simpler to debug than undefied behaviour.
core/direct-controls: fix p9_cont_thread for stopped/inactive threads

Firstly, p9_cont_thread should check that the thread actually was quiesced before it tries to resume it. Anything could happen if we try this from an arbitrary thread state.

Then when resuming a quiesced thread that is inactive or stopped (in a stop idle state), we must not send a core_start direct control, clear_maint must be used in these cases.
occ: Use major version number while checking the pstate table format

The minor version increments of the pstate table are backward compatible. The minor version is changed when the pstate table remains same and the existing reserved bytes are used for pointing new data. So use only major version number while parsing the pstate table. This will allow old skiboot to parse the pstate table and handle minor version updates.

hmi: Clear unknown debug trigger

On some systems, seeing hangs like this when Linux starts: ::

[ 170.027252763,5] OCC: All Chip Rdy after 0 ms
[ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes)
[ 171.238270428,5] OPAL: Switch to little-endian OS

If you look at the in memory skiboot console (or do nvram -p ibm,skiboot --update-config log-level-driver=7) we see the console get spammed with: ::

[ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000

We're taking the debug trigger (bit 17) early on, before the hmi_debug_trigger function in the kernel is set up.

This clears the HMI in Skiboot and reports to the kernel instead of bringing down the machine.

core/hmi: assign flags=0 in case nothing set by handle_hmi_exception

Theoretically we could have returned junk to the OS in this parameter.
SLW: Fix mambo boot to use stop states

After commit 35c66b8ce5a2 ("SLW: Move MAMBO simulator checks to slw_init"), mambo boot no longer calls add_cpu_idle_state_properties() and as such we never enable stop states.

After adding the call back, we get more testing coverage as well as faster mambo SMT boots.
phb4: Hardware init updates

CFG Write Request Timeout was incorrectly set to informational and not fatal for both non-CAPI and CAPI, so set it to fatal. This was a mistake in the specification. Correcting this fixes a niche bug in escalation (which is necessary on pre-DD2.2) that can cause a checkstop due to a NCU timeout.

In addition, set the values in the timeout control registers to match. This fixes an extremely rare and unreproducible bug, though the current timings don't make sense since they're higher than the NCU timeout (16) which will checkstop the machine anyway.
SLW: quieten 'Configuring self-restore' for DARN,NCU_SPEC_BAR and HRMOR
Experimental support for building with Clang
Improvements to testing and Travis CI

v5.11

6 years ago

skiboot-5.11

skiboot v5.11 was released on Friday April 6th 2018. It is the first release of skiboot 5.11, which is now the new stable release of skiboot following the 5.10 release, first released February 23rd 2018.

It is not expected to keep the 5.11 branch around for long, and instead quickly move onto a 6.0, which will mark the basis for op- build v2.0 and will be required for POWER9 systems.

It is expected that skiboot 6.0 will follow very shortly. Consider 5.11 more of a beta release to 6.0 than anything. For POWER9 systems it should certainly be more solid than previous releases though.

skiboot v5.11 contains all bug fixes as of skiboot-5.10.4 and skiboot-5.4.9 (the currently maintained stable releases). There may be more 5.10.x stable releases, it will depend on demand.

For how the skiboot stable releases work, see Skiboot stable tree rules and releases for details.

Over skiboot-5.10, we have the following changes:

New Platforms

Add VESNIN platform support

The Vesnin platform from YADRO is a 4 socked POWER8 system with up to 8TB of memory with 460GB/s of memory bandwidth in only 2U. Many kudos to the team from Yadro for submitting their code upstream!

New Features

fast-reboot: enable by default for POWER9
- Fast reboot is disabled if NPU2 is present or CAPI2/OpenCAPI is used
PCI tunneled operations on PHB4
- phb4: set PBCQ Tunnel BAR for tunneled operations
  
  P9 supports PCI tunneled operations (atomics and as_notify) that are initiated by devices.
  
  A subset of the tunneled operations require a response, that must be sent back from the host to the device. For example, an atomic compare and swap will return the compare status, as swap will only performed in case of success. Similarly, as_notify reports if the target thread has been woken up or not, because the operation may fail.
  
  To enable tunneled operations, a device driver must tell the host where it expects tunneled operation responses, by setting the PBCQ Tunnel BAR Response register with a specific value within the range of its BARs.
  
  This register is currently initialized by enable_capi_mode(). But, as tunneled operations may also operate in PCI mode, a new API is required to set the PBCQ Tunnel BAR Response register, without switching to CAPI mode.
  
  This patch provides two new OPAL calls to get/set the PBCQ Tunnel BAR Response register.
  
  Note: as there is only one PBCQ Tunnel BAR register, shared between all the devices connected to the same PHB, only one of these devices will be able to use tunneled operations, at any time.
- phb4: set PHB CMPM registers for tunneled operations
  
  P9 supports PCI tunneled operations (atomics and as_notify) that require setting the PHB ASN Compare/Mask register with a 16-bit indication.
  
  This register is currently initialized by enable_capi_mode(). But, as tunneled operations may also work in PCI mode, the ASN Compare/Mask register should rather be initialized in phb4_init_ioda3().
  
  This patch also adds “ibm,phb-indications” to the device tree, to tell Linux the values of CAPI, ASN, and NBW indications, when supported.
  
  Tunneled operations tested by IBM in CAPI mode, by Mellanox Technologies in PCI mode.
Tie tm-suspend fw-feature and opal_reinit_cpus() together

Currently opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED) always returns OPAL_UNSUPPORTED.

This ties the tm suspend fw-feature to the opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED) so that when tm suspend is disabled, we correctly report it to the kernel. For backwards compatibility, it’s assumed tm suspend is available if the fw-feature is not present.

Currently hostboot will clear fw-feature(TM_SUSPEND_ENABLED) on P9N DD2.1. P9N DD2.2 will set fw-feature(TM_SUSPEND_ENABLED). DD2.0 and below has TM disabled completely (not just suspend).

We are using opal_reinit_cpus() to determine this setting (rather than the device tree/HDAT) as some future firmware may let us change this dynamically after boot. That is not the case currently though.

Power Management

SLW: Increase stop4-5 residency by 10x

Using DGEMM benchmark we observed there was a drop of 5-9% throughput with and without stop4/5. In this benchmark the GPU waits on the cpu to wakeup and provide the subsequent data block to compute. The wakup latency accumulates over the run and shows up as a performance drop.

Linux enters stop4/5 more aggressively for its wakeup latency. Increasing the residency from 1ms to 10ms makes the performance drop <1%
occ: Set up OCC messaging even if we fail to setup pstates

This means that we no longer hit this bug if we fail to get valid pstates from the OCC.

[console-pexpect]#echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear [ 94.019971181,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8 [ 94.020098392,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8 [ 10.318805] Disabling lock debugging due to kernel taint [ 10.318808] Severe Machine check interrupt [Not recovered] [ 10.318812] NIP [000000003003e434]: 0x3003e434 [ 10.318813] Initiator: CPU [ 10.318815] Error type: Real address [Load/Store (foreign)] [ 10.318817] opal: Hardware platform error: Unrecoverable Machine Check exception [ 10.318821] CPU: 117 PID: 2745 Comm: sh Tainted: G M 4.15.9-openpower1 #3 [ 10.318823] NIP: 000000003003e434 LR: 000000003003025c CTR: 0000000030030240 [ 10.318825] REGS: c00000003fa7bd80 TRAP: 0200 Tainted: G M (4.15.9-openpower1) [ 10.318826] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 48002888 XER: 20040000 [ 10.318831] CFAR: 0000000030030258 DAR: 394a00147d5a03a6 DSISR: 00000008 SOFTE: 1

mbox based platforms

For platforms using the mbox protocol for host flash access (all BMC based OpenPOWER systems, most OpenBMC based systems) there have been some hardening efforts in the event of the BMC being poorly behaved.

mbox: Reduce default BMC timeouts

Rebooting a BMC can take 70 seconds. Skiboot cannot possibly spin for 70 seconds waiting for a BMC to come back. This also makes the current default of 30 seconds a bit pointless, is it far too short to be a worse case wait time but too long to avoid hitting hardlockup detectors and wrecking havoc inside host linux.

Just change it to three seconds so that host linux will survive and that, reads and writes will fail but at least the host stays up.

Also refactored the waiting loop just a bit so that it’s easier to read.
mbox: Harden against BMC daemon errors

Bugs present in the BMC daemon mean that skiboot gets presented with mbox windows of size zero. These windows cannot be valid and skiboot already detects these conditions.

Currently skiboot warns quite strongly about the occurrence of these problems. The problem for skiboot is that it doesn’t take any action. Initially I wanting to avoid putting policy like this into skiboot but since these bugs aren’t going away and skiboot barfing is leading to lockups and ultimately the host going down something needs to be done.

I propose that when we detect the problem we fail the mbox call and punt the problem back up to Linux. I don’t like it but at least it will cause errors to cascade and won’t bring the host down. I’m not sure how Linux is supposed to detect this or what it can even do but this is better than a crash.

Diagnosing a failure to boot if skiboot its self fails to read flash may be marginally more difficult with this patch. This is because skiboot will now only print one warning about the zero sized window rather than continuously spitting it out.

Fast Reboot Improvements

Around fast-reboot we have made several improvements to harden the fast reboot code paths and resort to a full IPL if something doesn’t look right.

core/fast-reboot: zero memory after fast reboot

This improves the security and predictability of the fast reboot environment.

There can not be a secure fence between fast reboots, because a malicious OS can modify the firmware itself. However a well-behaved OS can have a reasonable expectation that OS memory regions it has modified will be cleared upon fast reboot.

The memory is zeroed after all other CPUs come up from fast reboot, just before the new kernel is loaded and booted into. This allows image preloading to run concurrently, and will allow parallelisation of the clearing in future.
core/fast-reboot: verify mem regions before fast reboot

Run the mem_region sanity checkers before proceeding with fast reboot.

This is the beginning of proactive sanity checks on opal data for fast reboot (with complements the reactive disable_fast_reboot cases). This is encouraged to re-use and share any kind of debug code and unit test code.
fast-reboot: occ: Only delete /ibm, opal/power-mgt nodes if they exist
core/fast-reboot: disable fast reboot upon fundamental entry/exit/locking errors

This disables fast reboot in several more cases where serious errors like lock corruption or call re-entrancy are detected.
capp: Disable fast-reboot whenever enable_capi_mode() is called

This patch updates phb4_set_capi_mode() to disable fast-reboot whenever enable_capi_mode() is called, irrespective to its return value. This should prevent against a possibility of not disabling fast-reboot when some changes to enable_capi_mode() causing return of an error and leaving CAPP in enabled mode.
fast-reboot: occ: Delete OCC child nodes in /ibm, opal/power-mgt

Fast-reboot in P8 fails to re-init OCC data as there are chipwise OCC nodes which are already present in the /ibm,opal/power-mgt node. These per-chip nodes hold the voltage IDs for each pstate and these can be changed on OCC pstate table biasing. So delete these before calling the re-init code to re-parse and populate the pstate data.

Debugging/SRESET improvemens

Since skiboot-5.11-rc1:

core/cpu: Prevent clobbering of stack guard for boot-cpu

Commit 90d53934c2da (“core/cpu: discover stack region size before initialising memory regions”) introduced memzero for struct cpu_thread in init_cpu_thread(). This has an unintended side effect of clobbering the stack-guard cannery of the boot_cpu stack. This results in opal failing to init with this failure message:

CPU: P9 generation processor (max 4 threads/core) CPU: Boot CPU PIR is 0x0004 PVR is 0x004e1200 Guard skip = 0 Stack corruption detected ! Aborting! CPU 0004 Backtrace: S: 0000000031c13ab0 R: 0000000030013b0c .backtrace+0x5c S: 0000000031c13b50 R: 000000003001bd18 ._abort+0x60 S: 0000000031c13be0 R: 0000000030013bbc .__stack_chk_fail+0x54 S: 0000000031c13c60 R: 00000000300c5b70 .memset+0x12c S: 0000000031c13d00 R: 0000000030019aa8 .init_cpu_thread+0x40 S: 0000000031c13d90 R: 000000003001b520 .init_boot_cpu+0x188 S: 0000000031c13e30 R: 0000000030015050 .main_cpu_entry+0xd0 S: 0000000031c13f00 R: 0000000030002700 boot_entry+0x1c0

So the patch provides a fix by tweaking the memset() call in init_cpu_thread() to skip over the stack-guard cannery.
core/lock.c: ensure valid start value for lock spin duration warning

The previous fix in a8e6cc3f4 only addressed half of the problem, as we could also get an invalid value for start, causing us to fail in a weird way.

This was caught by the testcases.OpTestHMIHandling.HMI_TFMR_ERRORS test in op-test-framework.

You’d get to this part of the test and get the erroneous lock spinning warnings:

PATH=/usr/local/sbin:$PATH putscom -c 00000000 0x2b010a84 0003080000000000 0000080000000000 [ 790.140976993,4] WARNING: Lock has been spinning for 790275ms [ 790.140976993,4] WARNING: Lock has been spinning for 790275ms [ 790.140976918,4] WARNING: Lock has been spinning for 790275ms

This patch checks the validity of timebase before setting start, and only checks the lock timeout if we got a valid start value.

Since skiboot-5.10:

core/opal: allow some re-entrant calls

This allows a small number of OPAL calls to succeed despite re- entering the firmware, and rejects others rather than aborting.

This allows a system reset interrupt that interrupts OPAL to do something useful. Sreset other CPUs, use the console, which allows xmon to work or stack traces to be printed, reboot the system.

Use OPAL_INTERNAL_ERROR when rejecting, rather than OPAL_BUSY, which is used for many other things that does not mean a serious permanent error.
core/opal: abort in case of re-entrant OPAL call

The stack is already destroyed by the time we get here, so there is not much point continuing.
core/lock: Add lock timeout warnings

There are currently no timeout warnings for locks in skiboot. We assume that the lock will eventually become free, which may not always be the case.

This patch adds timeout warnings for locks. Any lock which spins for more than 5 seconds will throw a warning and stacktrace for that thread. This is useful for debugging siturations where a lock which hang, waiting for the lock to be freed.
core/lock: Add deadlock detection

This adds simple deadlock detection. The detection looks for circular dependencies in the lock requests. It will abort and display a stack trace when a deadlock occurs. The detection is enabled by DEBUG_LOCKS (enabled by default). While the detection may have a slight performance overhead, as there are not a huge number of locks in skiboot this overhead isn’t significant.
core/hmi: report processor recovery reason from core FIR bits on P9

When an error is encountered that causes processor recovery, HMI is generated if the recovery was successful. The reason is recorded in the core FIR, which gets copied into the WOF.

In this case dump the WOF register and an error string into the OPAL msglog.

A broken init setting led to HMIs reported in Linux as:

[ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered] [ 3.591648] Error detail: Processor Recovery done [ 3.591714] HMER: 2040000000000000

This patch would have been useful because it tells us exactly that the problem is in the d-side ERAT:

[ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred. [ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error: [ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc) [ 414.489702341,7] HMI: LSU - ERAT multi hit

In future it will be good to unify this reporting, so Linux could print something more useful. Until then, this gives some good data.

NPU2/NVLink2 Fixes

npu2: Add performance tuning SCOM inits

Peer-to-peer GPU bandwidth latency testing has produced some tunable values that improve performance. Add them to our device initialization.

File these under things that need to be cleaned up with nice #defines for the register names and bitfields when we get time.

A few of the settings are dependent on the system’s particular NVLink topology, so introduce a helper to determine how many links go to a single GPU.
hw/npu2: Assign a unique LPARSHORTID per GPU

This gets used elsewhere to index items in the XTS tables.
NPU2: dump NPU2 registers on npu2 HMI

Due to the nature of debugging npu2 issues, folk are wanting the full list of NPU2 registers dumped when there’s a problem.
npu2: Remove DD1 support

Major changes in the NPU between DD1 and DD2 necessitated a fair bit of revision-specific code.

Now that all our lab machines are DD2, we no longer test anything on DD1 and it’s time to get rid of it.

Remove DD1-specific code and abort probe if we’re running on a DD1 machine.
npu2: Disable fast reboot

Fast reboot does not yet work right with the NPU. It’s been disabled on NVLink and OpenCAPI machines. Do the same for NVLink2.

This amounts to a port of 3e4577939bbf (“npu: Fix broken fast reset”) from the npu code to npu2.
npu2: Use unfiltered mode in XTS tables

The XTS_PID context table is limited to 256 possible pids/contexts. To relieve this limitation, make use of “unfiltered mode” instead.

If an entry in the XTS_BDF table has the bit for unfiltered mode set, we can just use one context for that entire bdf/lpar, regardless of pid. Instead of of searching the XTS_PID table, the NMMU checkout request will simply use the entry indexed by lparshort id instead.

Change opal_npu_init_context() to create these lparshort-indexed wildcard entries (0-15) instead of allocating one for each pid. Check that multiple calls for the same bdf all specify the same msr value.

In opal_npu_destroy_context(), continue validating the bdf argument, ensuring that it actually maps to an lpar, but no longer remove anything from the XTS_PID table. If/when we start supporting virtualized GPUs, we might consider actually removing these wildcard entries by keeping a refcount, but keep things simple for now.

CAPI/OpenCAPI

Since skiboot-5.11-rc1:

capi: Poll Err/Status register during CAPP recovery

This patch updates do_capp_recovery_scoms() to poll the CAPP Err/Status control register, check for CAPP-Recovery to complete/fail based on indications of BITS-1,5,9 and then proceed with the CAPP-Recovery scoms iif recovery completed successfully. This would prevent cases where we bring-up the PCIe link while recovery sequencer on CAPP is still busy with casting out cache lines.

In case CAPP-Recovery didn’t complete successfully an error is returned from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4 fenced and mark it as broken.

The loop that implements polling of Err/Status register will also log an error on the PHB when it continues for more than 168ms which is the max time to failure for CAPP-Recovery.

Since skiboot-5.10:

npu2-opencapi: Add OpenCAPI OPAL API calls

Add three OPAL API calls that are required by the ocxl driver.
- OPAL_NPU_SPA_SETUP
  
  The Shared Process Area (SPA) is a table containing one entry (a “Process Element”) per memory context which can be accessed by the OpenCAPI device.
- OPAL_NPU_SPA_CLEAR_CACHE
  
  The NPU keeps a cache of recently accessed memory contexts. When a Process Element is removed from the SPA, the cache for the link must be cleared.
- OPAL_NPU_TL_SET
  
  The Transaction Layer specification defines several templates for messages to be exchanged on the link. During link setup, the host and device must negotiate what templates are supported on both sides and at what rates those messages can be sent.
npu2-opencapi: Train OpenCAPI links and setup devices

Scan the OpenCAPI links under the NPU, and for each link, reset the card, set up a device, train the link and register a PHB.

Implement the necessary operations for the OpenCAPI PHB type.

For bringup, test and debug purposes, we allow an NVRAM setting, “opencapi-link-training” that can be set to either disable link training completely or to use the prbs31 test pattern.

To disable link training:

nvram -p ibm,skiboot --update-config opencapi-link-training=none

To use prbs31:

nvram -p ibm,skiboot --update-config opencapi-link-training=prbs31
npu2-hw-procedures: Add support for OpenCAPI PHY link training

Unlike NVLink, which uses the pci-virt framework to fake a PCI configuration space for NVLink devices, the OpenCAPI device model presents us with a real configuration space handled by the device over the OpenCAPI link.

As a result, we have to train the OpenCAPI link in skiboot before we do PCI probing, so that config space can be accessed, rather than having link training being triggered by the Linux driver.
npu2-opencapi: Configure NPU for OpenCAPI

Scan the device tree for NPUs with OpenCAPI links and configure the NPU per the initialisation sequence in the NPU OpenCAPI workbook.
capp: Make error in capp timebase sync a non-fatal error

Presently when we encounter an error while synchronizing capp timebase with chip-tod at the end of enable_capi_mode() we return an error. This has an to unintended consequences. First this will prevent disabling of fast-reboot even though CAPP is already enabled by this point. Secondly, failure during timebase sync is a non fatal error or capp initialization as CAPP/PSL can continue working after this and an AFU will only see an error when it tries to read the timebase value from PSL.

So this patch updates enable_capi_mode() to not return an error in case call to chiptod_capp_timebase_sync() fails. The function will now just log an error and continue further with capp init sequence. This make the current implementation align with the one in kernel ‘cxl’ driver which also assumes the PSL timebase sync errors as non- fatal init error.
npu2-opencapi: Fix assert on link reset during init

We don’t support resetting an opencapi link yet.

Commit fe6d86b9 (“pci: Make fast reboot creset PHBs in parallel”) tries resetting any PHB whose slot defines a ‘run_sm’ callback. It raises an assert when applied to an opencapi PHB, as ‘run_sm’ calls the ‘freset’ callback, which is not yet defined for opencapi.

Fix it for now by removing the currently useless definition of ‘run_sm’ on the opencapi slot. It will print a message in the skiboot log because the PHB cannot be reset, which is correct. It will all go away when we add support for resetting an opencapi link.
capp: Add lid definition for P9 DD-2.2

Update fsp_lid_map to include CAPP ucode lid for phb4-chipid == 0x202d1 that corresponds to P9 DD-2.2 chip.
capp: Disable fast-reboot when capp is enabled

PCI

Since skiboot-5.11-rc1:

phb4: Reset FIR/NFIR registers before PHB4 probe

The function phb4_probe_stack() resets “ETU Reset Register” to unfreeze the PHB before it performs mmio access on the PHB. However in case the FIR/NFIR registers are set while entering this function, the reset of “ETU Reset Register” wont unfreeze the PHB and it will remain fenced. This leads to failure during initial CRESET of the PHB as mmio access is still not enabled and an error message of the form below is logged:

PHB#0000[0:0]: Initializing PHB4... PHB#0000[0:0]: Default system config: 0xffffffffffffffff PHB#0000[0:0]: New system config : 0xffffffffffffffff PHB#0000[0:0]: Initial PHB CRESET is 0xffffffffffffffff PHB#0000[0:0]: Waiting for DLP PG reset to complete... PHB#0000[0:0]: Timeout waiting for DLP PG reset ! PHB#0000[0:0]: Initialization failed

This is especially seen happening during the MPIPL flow where SBE would quiesces and fence the PHB so that it doesn’t stomp on the main memory. However when skiboot enters phb4_probe_stack() after MPIPL, the FIR/NFIR registers are set forcing PHB to re-enter fence after ETU reset is done.

So to fix this issue the patch introduces new xscom writes to phb4_probe_stack() to reset the FIR/NFIR registers before performing ETU reset to enable mmio access to the PHB.

Since skiboot-5.10:

pci: Reduce log level of error message

If a link doesn’t train, we can end up with error messages like this:

[ 63.027261959,3] PHB#0032[8:2]: LINK: Timeout waiting for electrical link [ 63.027265573,3] PHB#0032:00:00.0 Error -6 resetting

The first message is useful but the second message is just debug from the core PCI code and is confusing to print to the console.

This reduces the second print to debug level so it’s not seen by the console by default.
Revert “platforms/astbmc/slots.c: Allow comparison of bus numbers when matching slots”

This reverts commit bda7cc4d0354eb3f66629d410b2afc08c79f795f.

Ben says: It’s on purpose that we do NOT compare the bus numbers, they are always 0 in the slot table we do a hierarchical walk of the tree, matching only the devfn’s along the way bcs the bus numbering isn’t fixed this breaks all slot naming etc… stuff on anything using the “skiboot” slot tables (P8 opp typically)
core/pci-dt-slot: Fix booting with no slot map

Currently if you don’t have a slot map in the device tree in /ibm ,pcie-slots, you can crash with a back trace like this:

CPU 0034 Backtrace: S: 0000000031cd3370 R: 000000003001362c .backtrace+0x48 S: 0000000031cd3410 R: 0000000030019e38 ._abort+0x4c S: 0000000031cd3490 R: 000000003002760c .exception_entry+0x180 S: 0000000031cd3670 R: 0000000000001f10 * S: 0000000031cd3850 R: 00000000300b4f3e * cpu_features_table+0x1d9e S: 0000000031cd38e0 R: 000000003002682c .dt_node_is_compatible+0x20 S: 0000000031cd3960 R: 0000000030030e08 .map_pci_dev_to_slot+0x16c S: 0000000031cd3a30 R: 0000000030091054 .dt_slot_get_slot_info+0x28 S: 0000000031cd3ac0 R: 000000003001e27c .pci_scan_one+0x2ac S: 0000000031cd3ba0 R: 000000003001e588 .pci_scan_bus+0x70 S: 0000000031cd3cb0 R: 000000003001ee74 .pci_scan_phb+0x100 S: 0000000031cd3d40 R: 0000000030017ff0 .cpu_process_jobs+0xdc S: 0000000031cd3e00 R: 0000000030014cb0 .__secondary_cpu_entry+0x44 S: 0000000031cd3e80 R: 0000000030014d04 .secondary_cpu_entry+0x34 S: 0000000031cd3f00 R: 0000000030002770 secondary_wait+0x8c [ 73.016947149,3] Fatal MCE at 0000000030026054 .dt_find_property+0x30 [ 73.017073254,3] CFAR : 0000000030026040 [ 73.017138048,3] SRR0 : 0000000030026054 SRR1 : 9000000000201000 [ 73.017198375,3] HSRR0: 0000000000000000 HSRR1: 0000000000000000 [ 73.017263210,3] DSISR: 00000008 DAR : 7c7b1b7848002524 [ 73.017352517,3] LR : 000000003002602c CTR : 000000003009102c [ 73.017419778,3] CR : 20004204 XER : 20040000 [ 73.017502425,3] GPR00: 000000003002682c GPR16: 0000000000000000 [ 73.017586924,3] GPR01: 0000000031c23670 GPR17: 0000000000000000 [ 73.017643873,3] GPR02: 00000000300fd500 GPR18: 0000000000000000 [ 73.017767091,3] GPR03: fffffffffffffff8 GPR19: 0000000000000000 [ 73.017855707,3] GPR04: 00000000300b3dc6 GPR20: 0000000000000000 [ 73.017943944,3] GPR05: 0000000000000000 GPR21: 00000000300bb6d2 [ 73.018024709,3] GPR06: 0000000031c23910 GPR22: 0000000000000000 [ 73.018117716,3] GPR07: 0000000031c23930 GPR23: 0000000000000000 [ 73.018195974,3] GPR08: 0000000000000000 GPR24: 0000000000000000 [ 73.018278350,3] GPR09: 0000000000000000 GPR25: 0000000000000000 [ 73.018353795,3] GPR10: 0000000000000028 GPR26: 00000000300be6fb [ 73.018424362,3] GPR11: 0000000000000000 GPR27: 0000000000000000 [ 73.018533159,3] GPR12: 0000000020004208 GPR28: 0000000030767d38 [ 73.018642725,3] GPR13: 0000000031c20000 GPR29: 00000000300b3dc6 [ 73.018737925,3] GPR14: 0000000000000000 GPR30: 0000000000000010 [ 73.018794428,3] GPR15: 0000000000000000 GPR31: 7c7b1b7848002514

This has been seen in the lab on a witherspoon using the device tree entry point (ie. no HDAT).

This fixes the null pointer deref.

Bugs Fixed

Since skiboot-5.11-rc1:

cpufeatures: Fix setting DARN and SCV HWCAP feature bits

DARN and SCV has been assigned AT_HWCAP2 (32-63) bits:

#define PPC_FEATURE2_DARN 0x00200000 /* darn random number insn / #define PPC_FEATURE2_SCV 0x00100000 / scv syscall */

A cpufeatures-aware OS will not advertise these to userspace without this patch.
xive: disable store EOI support

Hardware has limitations which would require to put a sync after each store EOI to make sure the MMIO operations that change the ESB state are ordered. This is a killer for performance and the PHBs do not support the sync. So remove the store EOI for the moment, until hardware is improved.

Also, while we are at changing the XIVE source flags, let’s fix the settings for the PHB4s which should follow these rules :
- SHIFT_BUG for DD10
- STORE_EOI for DD20 and if enabled
- TRIGGER_PAGE for DDx0 and if not STORE_EOI

Since skiboot-5.10:

xive: fix opal_xive_set_vp_info() error path

In case of error, opal_xive_set_vp_info() will return without unlocking the xive object. This is most certainly a typo.
hw/imc: don’t access homer memory if it was not initialised

This can happen under mambo, at least.
nvram: run nvram_validate() after nvram_reformat()

nvram_reformat() sets nvram_valid = true, but it does not set skiboot_part_hdr. Call nvram_validate() instead, which sets everything up properly.
dts: Zero struct to avoid using uninitialised value
hw/imc: Don’t dereference possible NULL
libstb/create-container: munmap() signature file address
npu2-opencapi: Fix memory leak
npu2: Fix possible NULL dereference
occ-sensors: Remove NULL checks after dereference
core/ipmi-opal: Add interrupt-parent property for ipmi node on P9 and above.

dtc complains below warning with newer 4.2+ kernels.

dts: Warning (interrupts_property): Missing interrupt-parent for /ibm,opal/ipmi

This fix adds interrupt-parent property under /ibm,opal/ipmi DT node on P9 and above, which allows ipmi-opal to properly use the OPAL irqchip.

Other fixes and improvements

core/cpu: discover stack region size before initialising memory regions

Stack allocation first allocates a memory region sized to hold stacks for all possible CPUs up to the maximum PIR of the architecture, zeros the region, then initialises all stacks. Max PIR is 32768 on POWER9, which is 512MB for stacks.

The stack region is then shrunk after CPUs are discovered, but this is a bit of a hack, and it leaves a hole in the memory allocation regions as it’s done after mem regions are initialised.

0x000000000000..00002fffffff : ibm,os-reserve - OS 0x000030000000..0000303fffff : ibm,firmware-code - OPAL 0x000030400000..000030ffffff : ibm,firmware-heap - OPAL 0x000031000000..000031bfffff : ibm,firmware-data - OPAL 0x000031c00000..000031c0ffff : ibm,firmware-stacks - OPAL *** gap *** 0x000051c00000..000051d01fff : ibm,firmware-allocs-memory@0 - OPAL 0x000051d02000..00007fffffff : ibm,firmware-allocs-memory@0 - OS 0x000080000000..000080b3cdff : initramfs - OPAL 0x000080b3ce00..000080b7cdff : ibm,fake-nvram - OPAL 0x000080b7ce00..0000ffffffff : ibm,firmware-allocs-memory@0 - OS

This change moves zeroing into the per-cpu stack setup. The boot CPU stack is set up based on the current PIR. Then the size of the stack region is set, by discovering the maximum PIR of the system from the device tree, before mem regions are intialised.

This results in all memory being accounted within memory regions, and less memory fragmentation of OPAL allocations.
Make gard display show that a record is cleared

When clearing gard records, Hostboot only modifies the record_id portion to be 0xFFFFFFFF. The remainder of the entry remains. Without this change it can be confusing to users to know that the record they are looking at is no longer valid.
Reserve OPAL API number for opal_handle_hmi2 function.
dts: spl_wakeup: Remove all workarounds in the spl wakeup logic

We coded few workarounds in special wakeup logic to handle the buggy firmware. Now that is fixed remove them as they break the special wakeup protocol. As per the spec we should not de-assert beofre assert is complete. So follow this protocol.
build: use thin archives rather than incremental linking

This changes to build system to use thin archives rather than incremental linking for built-in.o, similar to recent change to Linux. built-in.o is renamed to built-in.a, and is created as a thin archive with no index, for speed and size. All built-in.a are aggregated into a skiboot.tmp.a which is a thin archive built with an index, making it suitable or linking. This is input into the final link.

The advantags of build size and linker code placement flexibility are not as great with skiboot as a bigger project like Linux, but it’s a conceptually better way to build, and is more compatible with link time optimisation in toolchains which might be interesting for skiboot particularly for size reductions.

Size of build tree before this patch is 34.4MB, afterwards 23.1MB.
core/init: Assert when kernel not found

If the kernel doesn’t load out of flash or there is nothing at KERNEL_LOAD_BASE, we end up with an esoteric message as we try to branch to out of skiboot into nothing

[ 0.007197688,3] INIT: ELF header not found. Assuming raw binary. [ 0.014035267,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13029 [ 0.014042254,3] *********************************************** [ 0.014069947,3] Fatal Exception 0xe40 at 0000000000000000 [ 0.014085574,3] CFAR : 00000000300051c4 [ 0.014090118,3] SRR0 : 0000000000000000 SRR1 : 0000000000000000 [ 0.014096243,3] HSRR0: 0000000000000000 HSRR1: 9000000000001000 [ 0.014102546,3] DSISR: 00000000 DAR : 0000000000000000 [ 0.014108538,3] LR : 00000000300144c8 CTR : 0000000000000000 [ 0.014114756,3] CR : 40002202 XER : 00000000 [ 0.014120301,3] GPR00: 000000003001447c GPR16: 0000000000000000

This improves the message and asserts in this case:

[ 0.014042685,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13049 bytes) [ 0.014049556,0] FATAL: Kernel is zeros, can't execute! [ 0.014054237,0] Assert fail: core/init.c:566:0 [ 0.014060472,0] Aborting!
core: Fix ‘opal-runtime-size’ property

We are populating ‘opal-runtime-size’ before calculating actual stack size. Hence we endup having wrong runtime size (ex: on P9 it shows ~540MB while actual size is around ~40MB). Note that only device tree property is shows wrong value, but reserved-memory reflects correct size.

init_all_cpus() calculates and updates actual stack size. Hence move this function call before add_opal_node().
mambo: Add fw-feature flags for security related settings

Newer firmwares report some feature flags related to security settings via HDAT. On real hardware skiboot translates these into device tree properties. For testing purposes just create the properties manually in the tcl.

These values don’t exactly match any actual chip revision, but the code should not rely on any exact set of values anyway. We just define the most interesting flags, that if toggled to “disable” will change Linux behaviour. You can see the actual values in the hostboot source in src/usr/hdat/hdatiplparms.H.

Also add an environment variable for easily toggling the top-level “security on” setting.
direct-controls: mambo fix for multiple chips
libflash/blocklevel: Correct miscalculation in blocklevel_smart_erase()

If blocklevel_smart_erase() detects that the smart erase fits entire in one erase block, it has an early bail path. In this path it miscaculates where in the buffer the backend needs to read from to perform the final write.
libstb/secureboot: Fix logging of secure verify messages.

Currently we are logging secure verify/enforce messages in PR_EMERG level even when there is no secureboot mode enabled. So reduce the log level to PR_ERR when secureboot mode is OFF.

Testing / Code coverage improvements

Improvements in gcov support include support for newer GCCs as well as easily exporting the area of memory you need to dump to feed to extract-gcov.

cpu_idle_job: relax a bit

This dramatically improves kernel boot time with GCOV builds

from ~3minutes between loading kernel and switching the HILE bit down to around 10 seconds.
gcov: Another GCC, another gcov tweak
Keep constructors with priorities

Fixes GCOV builds with gcc7, which uses this.
gcov: Add gcov data struct to sysfs

Extracting the skiboot gcov data is currently a tedious process which involves taking a mem dump of skiboot and searching for the gcov_info struct. This patch adds the gcov struct to sysfs under /opal/exports. Allowing the data to be copied directly into userspace and processed.

v5.11-rc1

6 years ago

skiboot-5.11-rc1

skiboot v5.11-rc1 was released on Wednesday March 28th 2018. It is the first release candidate of skiboot 5.11, which will become the new stable release of skiboot following the 5.10 release, first released February 23rd 2018.

It is not expected to keep the 5.11 branch around for long, and instead quickly move onto a 6.0, which will mark the basis for op- build v2.0 and will be required for POWER9 systems.

skiboot v5.11-rc1 contains all bug fixes as of skiboot-5.10.3 and skiboot-5.4.9 (the currently maintained stable releases). There may be more 5.10.x stable releases, it will depend on demand.

For how the skiboot stable releases work, see Skiboot stable tree rules and releases for details.

The current plan is to cut the final 5.11 in March, with skiboot 5.11 being for all POWER8 and POWER9 platforms in op-build v1.22. This release is targeted to early POWER9 systems.

Over skiboot-5.10, we have the following changes:

New Platforms

Add VESNIN platform support

The Vesnin platform from YADRO is a 4 socked POWER8 system with up to 8TB of memory with 460GB/s of memory bandwidth in only 2U. Many kudos to the team from Yadro for submitting their code upstream!

New Features

fast-reboot: enable by default for POWER9
- Fast reboot is disabled if NPU2 is present or CAPI2/OpenCAPI is used
PCI tunneled operations on PHB4
- phb4: set PBCQ Tunnel BAR for tunneled operations
  
  P9 supports PCI tunneled operations (atomics and as_notify) that are initiated by devices.
  
  A subset of the tunneled operations require a response, that must be sent back from the host to the device. For example, an atomic compare and swap will return the compare status, as swap will only performed in case of success. Similarly, as_notify reports if the target thread has been woken up or not, because the operation may fail.
  
  To enable tunneled operations, a device driver must tell the host where it expects tunneled operation responses, by setting the PBCQ Tunnel BAR Response register with a specific value within the range of its BARs.
  
  This register is currently initialized by enable_capi_mode(). But, as tunneled operations may also operate in PCI mode, a new API is required to set the PBCQ Tunnel BAR Response register, without switching to CAPI mode.
  
  This patch provides two new OPAL calls to get/set the PBCQ Tunnel BAR Response register.
  
  Note: as there is only one PBCQ Tunnel BAR register, shared between all the devices connected to the same PHB, only one of these devices will be able to use tunneled operations, at any time.
- phb4: set PHB CMPM registers for tunneled operations
  
  P9 supports PCI tunneled operations (atomics and as_notify) that require setting the PHB ASN Compare/Mask register with a 16-bit indication.
  
  This register is currently initialized by enable_capi_mode(). But, as tunneled operations may also work in PCI mode, the ASN Compare/Mask register should rather be initialized in phb4_init_ioda3().
  
  This patch also adds “ibm,phb-indications” to the device tree, to tell Linux the values of CAPI, ASN, and NBW indications, when supported.
  
  Tunneled operations tested by IBM in CAPI mode, by Mellanox Technologies in PCI mode.
Tie tm-suspend fw-feature and opal_reinit_cpus() together

Currently opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED) always returns OPAL_UNSUPPORTED.

This ties the tm suspend fw-feature to the opal_reinit_cpus(OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED) so that when tm suspend is disabled, we correctly report it to the kernel. For backwards compatibility, it’s assumed tm suspend is available if the fw-feature is not present.

Currently hostboot will clear fw-feature(TM_SUSPEND_ENABLED) on P9N DD2.1. P9N DD2.2 will set fw-feature(TM_SUSPEND_ENABLED). DD2.0 and below has TM disabled completely (not just suspend).

We are using opal_reinit_cpus() to determine this setting (rather than the device tree/HDAT) as some future firmware may let us change this dynamically after boot. That is not the case currently though.

Power Management

SLW: Increase stop4-5 residency by 10x

Using DGEMM benchmark we observed there was a drop of 5-9% throughput with and without stop4/5. In this benchmark the GPU waits on the cpu to wakeup and provide the subsequent data block to compute. The wakup latency accumulates over the run and shows up as a performance drop.

Linux enters stop4/5 more aggressively for its wakeup latency. Increasing the residency from 1ms to 10ms makes the performance drop <1%
occ: Set up OCC messaging even if we fail to setup pstates

This means that we no longer hit this bug if we fail to get valid pstates from the OCC.

[console-pexpect]#echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear [ 94.019971181,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8 [ 94.020098392,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8 [ 10.318805] Disabling lock debugging due to kernel taint [ 10.318808] Severe Machine check interrupt [Not recovered] [ 10.318812] NIP [000000003003e434]: 0x3003e434 [ 10.318813] Initiator: CPU [ 10.318815] Error type: Real address [Load/Store (foreign)] [ 10.318817] opal: Hardware platform error: Unrecoverable Machine Check exception [ 10.318821] CPU: 117 PID: 2745 Comm: sh Tainted: G M 4.15.9-openpower1 #3 [ 10.318823] NIP: 000000003003e434 LR: 000000003003025c CTR: 0000000030030240 [ 10.318825] REGS: c00000003fa7bd80 TRAP: 0200 Tainted: G M (4.15.9-openpower1) [ 10.318826] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 48002888 XER: 20040000 [ 10.318831] CFAR: 0000000030030258 DAR: 394a00147d5a03a6 DSISR: 00000008 SOFTE: 1

mbox based platforms

mbox: Reduce default BMC timeouts

Rebooting a BMC can take 70 seconds. Skiboot cannot possibly spin for 70 seconds waiting for a BMC to come back. This also makes the current default of 30 seconds a bit pointless, is it far too short to be a worse case wait time but too long to avoid hitting hardlockup detectors and wrecking havoc inside host linux.

Just change it to three seconds so that host linux will survive and that, reads and writes will fail but at least the host stays up.

Also refactored the waiting loop just a bit so that it’s easier to read.
mbox: Harden against BMC daemon errors

Bugs present in the BMC daemon mean that skiboot gets presented with mbox windows of size zero. These windows cannot be valid and skiboot already detects these conditions.

Currently skiboot warns quite strongly about the occurrence of these problems. The problem for skiboot is that it doesn’t take any action. Initially I wanting to avoid putting policy like this into skiboot but since these bugs aren’t going away and skiboot barfing is leading to lockups and ultimately the host going down something needs to be done.

I propose that when we detect the problem we fail the mbox call and punt the problem back up to Linux. I don’t like it but at least it will cause errors to cascade and won’t bring the host down. I’m not sure how Linux is supposed to detect this or what it can even do but this is better than a crash.

Diagnosing a failure to boot if skiboot its self fails to read flash may be marginally more difficult with this patch. This is because skiboot will now only print one warning about the zero sized window rather than continuously spitting it out.

Fast Reboot Improvements

Around fast-reboot we have made several improvements to harden the fast reboot code paths and resort to a full IPL if something doesn’t look right.

core/fast-reboot: zero memory after fast reboot

This improves the security and predictability of the fast reboot environment.

There can not be a secure fence between fast reboots, because a malicious OS can modify the firmware itself. However a well-behaved OS can have a reasonable expectation that OS memory regions it has modified will be cleared upon fast reboot.

The memory is zeroed after all other CPUs come up from fast reboot, just before the new kernel is loaded and booted into. This allows image preloading to run concurrently, and will allow parallelisation of the clearing in future.
core/fast-reboot: verify mem regions before fast reboot

Run the mem_region sanity checkers before proceeding with fast reboot.

This is the beginning of proactive sanity checks on opal data for fast reboot (with complements the reactive disable_fast_reboot cases). This is encouraged to re-use and share any kind of debug code and unit test code.
fast-reboot: occ: Only delete /ibm, opal/power-mgt nodes if they exist
core/fast-reboot: disable fast reboot upon fundamental entry/exit/locking errors

This disables fast reboot in several more cases where serious errors like lock corruption or call re-entrancy are detected.
capp: Disable fast-reboot whenever enable_capi_mode() is called

This patch updates phb4_set_capi_mode() to disable fast-reboot whenever enable_capi_mode() is called, irrespective to its return value. This should prevent against a possibility of not disabling fast-reboot when some changes to enable_capi_mode() causing return of an error and leaving CAPP in enabled mode.
fast-reboot: occ: Delete OCC child nodes in /ibm, opal/power-mgt

Fast-reboot in P8 fails to re-init OCC data as there are chipwise OCC nodes which are already present in the /ibm,opal/power-mgt node. These per-chip nodes hold the voltage IDs for each pstate and these can be changed on OCC pstate table biasing. So delete these before calling the re-init code to re-parse and populate the pstate data.

Debugging/SRESET improvemens

core/opal: allow some re-entrant calls

This allows a small number of OPAL calls to succeed despite re- entering the firmware, and rejects others rather than aborting.

This allows a system reset interrupt that interrupts OPAL to do something useful. Sreset other CPUs, use the console, which allows xmon to work or stack traces to be printed, reboot the system.

Use OPAL_INTERNAL_ERROR when rejecting, rather than OPAL_BUSY, which is used for many other things that does not mean a serious permanent error.
core/opal: abort in case of re-entrant OPAL call

The stack is already destroyed by the time we get here, so there is not much point continuing.
core/lock: Add lock timeout warnings

There are currently no timeout warnings for locks in skiboot. We assume that the lock will eventually become free, which may not always be the case.

This patch adds timeout warnings for locks. Any lock which spins for more than 5 seconds will throw a warning and stacktrace for that thread. This is useful for debugging siturations where a lock which hang, waiting for the lock to be freed.
core/lock: Add deadlock detection

This adds simple deadlock detection. The detection looks for circular dependencies in the lock requests. It will abort and display a stack trace when a deadlock occurs. The detection is enabled by DEBUG_LOCKS (enabled by default). While the detection may have a slight performance overhead, as there are not a huge number of locks in skiboot this overhead isn’t significant.
core/hmi: report processor recovery reason from core FIR bits on P9

When an error is encountered that causes processor recovery, HMI is generated if the recovery was successful. The reason is recorded in the core FIR, which gets copied into the WOF.

In this case dump the WOF register and an error string into the OPAL msglog.

A broken init setting led to HMIs reported in Linux as:

[ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered] [ 3.591648] Error detail: Processor Recovery done [ 3.591714] HMER: 2040000000000000

This patch would have been useful because it tells us exactly that the problem is in the d-side ERAT:

[ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred. [ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error: [ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc) [ 414.489702341,7] HMI: LSU - ERAT multi hit

In future it will be good to unify this reporting, so Linux could print something more useful. Until then, this gives some good data.

NPU2/NVLink2 Fixes

npu2: Add performance tuning SCOM inits

Peer-to-peer GPU bandwidth latency testing has produced some tunable values that improve performance. Add them to our device initialization.

File these under things that need to be cleaned up with nice #defines for the register names and bitfields when we get time.

A few of the settings are dependent on the system’s particular NVLink topology, so introduce a helper to determine how many links go to a single GPU.
hw/npu2: Assign a unique LPARSHORTID per GPU

This gets used elsewhere to index items in the XTS tables.
NPU2: dump NPU2 registers on npu2 HMI

Due to the nature of debugging npu2 issues, folk are wanting the full list of NPU2 registers dumped when there’s a problem.
npu2: Remove DD1 support

Major changes in the NPU between DD1 and DD2 necessitated a fair bit of revision-specific code.

Now that all our lab machines are DD2, we no longer test anything on DD1 and it’s time to get rid of it.

Remove DD1-specific code and abort probe if we’re running on a DD1 machine.
npu2: Disable fast reboot

Fast reboot does not yet work right with the NPU. It’s been disabled on NVLink and OpenCAPI machines. Do the same for NVLink2.

This amounts to a port of 3e4577939bbf (“npu: Fix broken fast reset”) from the npu code to npu2.
npu2: Use unfiltered mode in XTS tables

The XTS_PID context table is limited to 256 possible pids/contexts. To relieve this limitation, make use of “unfiltered mode” instead.

If an entry in the XTS_BDF table has the bit for unfiltered mode set, we can just use one context for that entire bdf/lpar, regardless of pid. Instead of of searching the XTS_PID table, the NMMU checkout request will simply use the entry indexed by lparshort id instead.

Change opal_npu_init_context() to create these lparshort-indexed wildcard entries (0-15) instead of allocating one for each pid. Check that multiple calls for the same bdf all specify the same msr value.

In opal_npu_destroy_context(), continue validating the bdf argument, ensuring that it actually maps to an lpar, but no longer remove anything from the XTS_PID table. If/when we start supporting virtualized GPUs, we might consider actually removing these wildcard entries by keeping a refcount, but keep things simple for now.

CAPI/OpenCAPI

npu2-opencapi: Add OpenCAPI OPAL API calls

Add three OPAL API calls that are required by the ocxl driver.
- OPAL_NPU_SPA_SETUP
  
  The Shared Process Area (SPA) is a table containing one entry (a “Process Element”) per memory context which can be accessed by the OpenCAPI device.
- OPAL_NPU_SPA_CLEAR_CACHE
  
  The NPU keeps a cache of recently accessed memory contexts. When a Process Element is removed from the SPA, the cache for the link must be cleared.
- OPAL_NPU_TL_SET
  
  The Transaction Layer specification defines several templates for messages to be exchanged on the link. During link setup, the host and device must negotiate what templates are supported on both sides and at what rates those messages can be sent.
npu2-opencapi: Train OpenCAPI links and setup devices

Scan the OpenCAPI links under the NPU, and for each link, reset the card, set up a device, train the link and register a PHB.

Implement the necessary operations for the OpenCAPI PHB type.

For bringup, test and debug purposes, we allow an NVRAM setting, “opencapi-link-training” that can be set to either disable link training completely or to use the prbs31 test pattern.

To disable link training:

nvram -p ibm,skiboot --update-config opencapi-link-training=none

To use prbs31:

nvram -p ibm,skiboot --update-config opencapi-link-training=prbs31
npu2-hw-procedures: Add support for OpenCAPI PHY link training

Unlike NVLink, which uses the pci-virt framework to fake a PCI configuration space for NVLink devices, the OpenCAPI device model presents us with a real configuration space handled by the device over the OpenCAPI link.

As a result, we have to train the OpenCAPI link in skiboot before we do PCI probing, so that config space can be accessed, rather than having link training being triggered by the Linux driver.
npu2-opencapi: Configure NPU for OpenCAPI

Scan the device tree for NPUs with OpenCAPI links and configure the NPU per the initialisation sequence in the NPU OpenCAPI workbook.
capp: Make error in capp timebase sync a non-fatal error

Presently when we encounter an error while synchronizing capp timebase with chip-tod at the end of enable_capi_mode() we return an error. This has an to unintended consequences. First this will prevent disabling of fast-reboot even though CAPP is already enabled by this point. Secondly, failure during timebase sync is a non fatal error or capp initialization as CAPP/PSL can continue working after this and an AFU will only see an error when it tries to read the timebase value from PSL.

So this patch updates enable_capi_mode() to not return an error in case call to chiptod_capp_timebase_sync() fails. The function will now just log an error and continue further with capp init sequence. This make the current implementation align with the one in kernel ‘cxl’ driver which also assumes the PSL timebase sync errors as non- fatal init error.
npu2-opencapi: Fix assert on link reset during init

We don’t support resetting an opencapi link yet.

Commit fe6d86b9 (“pci: Make fast reboot creset PHBs in parallel”) tries resetting any PHB whose slot defines a ‘run_sm’ callback. It raises an assert when applied to an opencapi PHB, as ‘run_sm’ calls the ‘freset’ callback, which is not yet defined for opencapi.

Fix it for now by removing the currently useless definition of ‘run_sm’ on the opencapi slot. It will print a message in the skiboot log because the PHB cannot be reset, which is correct. It will all go away when we add support for resetting an opencapi link.
capp: Add lid definition for P9 DD-2.2

Update fsp_lid_map to include CAPP ucode lid for phb4-chipid == 0x202d1 that corresponds to P9 DD-2.2 chip.
capp: Disable fast-reboot when capp is enabled

PCI

pci: Reduce log level of error message

If a link doesn’t train, we can end up with error messages like this:

[ 63.027261959,3] PHB#0032[8:2]: LINK: Timeout waiting for electrical link [ 63.027265573,3] PHB#0032:00:00.0 Error -6 resetting

The first message is useful but the second message is just debug from the core PCI code and is confusing to print to the console.

This reduces the second print to debug level so it’s not seen by the console by default.
Revert “platforms/astbmc/slots.c: Allow comparison of bus numbers when matching slots”

This reverts commit bda7cc4d0354eb3f66629d410b2afc08c79f795f.

Ben says: It’s on purpose that we do NOT compare the bus numbers, they are always 0 in the slot table we do a hierarchical walk of the tree, matching only the devfn’s along the way bcs the bus numbering isn’t fixed this breaks all slot naming etc… stuff on anything using the “skiboot” slot tables (P8 opp typically)
core/pci-dt-slot: Fix booting with no slot map

Currently if you don’t have a slot map in the device tree in /ibm ,pcie-slots, you can crash with a back trace like this:

CPU 0034 Backtrace: S: 0000000031cd3370 R: 000000003001362c .backtrace+0x48 S: 0000000031cd3410 R: 0000000030019e38 ._abort+0x4c S: 0000000031cd3490 R: 000000003002760c .exception_entry+0x180 S: 0000000031cd3670 R: 0000000000001f10 * S: 0000000031cd3850 R: 00000000300b4f3e * cpu_features_table+0x1d9e S: 0000000031cd38e0 R: 000000003002682c .dt_node_is_compatible+0x20 S: 0000000031cd3960 R: 0000000030030e08 .map_pci_dev_to_slot+0x16c S: 0000000031cd3a30 R: 0000000030091054 .dt_slot_get_slot_info+0x28 S: 0000000031cd3ac0 R: 000000003001e27c .pci_scan_one+0x2ac S: 0000000031cd3ba0 R: 000000003001e588 .pci_scan_bus+0x70 S: 0000000031cd3cb0 R: 000000003001ee74 .pci_scan_phb+0x100 S: 0000000031cd3d40 R: 0000000030017ff0 .cpu_process_jobs+0xdc S: 0000000031cd3e00 R: 0000000030014cb0 .__secondary_cpu_entry+0x44 S: 0000000031cd3e80 R: 0000000030014d04 .secondary_cpu_entry+0x34 S: 0000000031cd3f00 R: 0000000030002770 secondary_wait+0x8c [ 73.016947149,3] Fatal MCE at 0000000030026054 .dt_find_property+0x30 [ 73.017073254,3] CFAR : 0000000030026040 [ 73.017138048,3] SRR0 : 0000000030026054 SRR1 : 9000000000201000 [ 73.017198375,3] HSRR0: 0000000000000000 HSRR1: 0000000000000000 [ 73.017263210,3] DSISR: 00000008 DAR : 7c7b1b7848002524 [ 73.017352517,3] LR : 000000003002602c CTR : 000000003009102c [ 73.017419778,3] CR : 20004204 XER : 20040000 [ 73.017502425,3] GPR00: 000000003002682c GPR16: 0000000000000000 [ 73.017586924,3] GPR01: 0000000031c23670 GPR17: 0000000000000000 [ 73.017643873,3] GPR02: 00000000300fd500 GPR18: 0000000000000000 [ 73.017767091,3] GPR03: fffffffffffffff8 GPR19: 0000000000000000 [ 73.017855707,3] GPR04: 00000000300b3dc6 GPR20: 0000000000000000 [ 73.017943944,3] GPR05: 0000000000000000 GPR21: 00000000300bb6d2 [ 73.018024709,3] GPR06: 0000000031c23910 GPR22: 0000000000000000 [ 73.018117716,3] GPR07: 0000000031c23930 GPR23: 0000000000000000 [ 73.018195974,3] GPR08: 0000000000000000 GPR24: 0000000000000000 [ 73.018278350,3] GPR09: 0000000000000000 GPR25: 0000000000000000 [ 73.018353795,3] GPR10: 0000000000000028 GPR26: 00000000300be6fb [ 73.018424362,3] GPR11: 0000000000000000 GPR27: 0000000000000000 [ 73.018533159,3] GPR12: 0000000020004208 GPR28: 0000000030767d38 [ 73.018642725,3] GPR13: 0000000031c20000 GPR29: 00000000300b3dc6 [ 73.018737925,3] GPR14: 0000000000000000 GPR30: 0000000000000010 [ 73.018794428,3] GPR15: 0000000000000000 GPR31: 7c7b1b7848002514

This has been seen in the lab on a witherspoon using the device tree entry point (ie. no HDAT).

This fixes the null pointer deref.

Bugs Fixed

xive: fix opal_xive_set_vp_info() error path

In case of error, opal_xive_set_vp_info() will return without unlocking the xive object. This is most certainly a typo.
hw/imc: don’t access homer memory if it was not initialised

This can happen under mambo, at least.
nvram: run nvram_validate() after nvram_reformat()

nvram_reformat() sets nvram_valid = true, but it does not set skiboot_part_hdr. Call nvram_validate() instead, which sets everything up properly.
dts: Zero struct to avoid using uninitialised value
hw/imc: Don’t dereference possible NULL
libstb/create-container: munmap() signature file address
npu2-opencapi: Fix memory leak
npu2: Fix possible NULL dereference
occ-sensors: Remove NULL checks after dereference
core/ipmi-opal: Add interrupt-parent property for ipmi node on P9 and above.

dtc complains below warning with newer 4.2+ kernels.

dts: Warning (interrupts_property): Missing interrupt-parent for /ibm,opal/ipmi

This fix adds interrupt-parent property under /ibm,opal/ipmi DT node on P9 and above, which allows ipmi-opal to properly use the OPAL irqchip.

Other fixes and improvements

core/cpu: discover stack region size before initialising memory regions

Stack allocation first allocates a memory region sized to hold stacks for all possible CPUs up to the maximum PIR of the architecture, zeros the region, then initialises all stacks. Max PIR is 32768 on POWER9, which is 512MB for stacks.

The stack region is then shrunk after CPUs are discovered, but this is a bit of a hack, and it leaves a hole in the memory allocation regions as it’s done after mem regions are initialised.

0x000000000000..00002fffffff : ibm,os-reserve - OS 0x000030000000..0000303fffff : ibm,firmware-code - OPAL 0x000030400000..000030ffffff : ibm,firmware-heap - OPAL 0x000031000000..000031bfffff : ibm,firmware-data - OPAL 0x000031c00000..000031c0ffff : ibm,firmware-stacks - OPAL *** gap *** 0x000051c00000..000051d01fff : ibm,firmware-allocs-memory@0 - OPAL 0x000051d02000..00007fffffff : ibm,firmware-allocs-memory@0 - OS 0x000080000000..000080b3cdff : initramfs - OPAL 0x000080b3ce00..000080b7cdff : ibm,fake-nvram - OPAL 0x000080b7ce00..0000ffffffff : ibm,firmware-allocs-memory@0 - OS

This change moves zeroing into the per-cpu stack setup. The boot CPU stack is set up based on the current PIR. Then the size of the stack region is set, by discovering the maximum PIR of the system from the device tree, before mem regions are intialised.

This results in all memory being accounted within memory regions, and less memory fragmentation of OPAL allocations.
Make gard display show that a record is cleared

When clearing gard records, Hostboot only modifies the record_id portion to be 0xFFFFFFFF. The remainder of the entry remains. Without this change it can be confusing to users to know that the record they are looking at is no longer valid.
Reserve OPAL API number for opal_handle_hmi2 function.
dts: spl_wakeup: Remove all workarounds in the spl wakeup logic

We coded few workarounds in special wakeup logic to handle the buggy firmware. Now that is fixed remove them as they break the special wakeup protocol. As per the spec we should not de-assert beofre assert is complete. So follow this protocol.
build: use thin archives rather than incremental linking

This changes to build system to use thin archives rather than incremental linking for built-in.o, similar to recent change to Linux. built-in.o is renamed to built-in.a, and is created as a thin archive with no index, for speed and size. All built-in.a are aggregated into a skiboot.tmp.a which is a thin archive built with an index, making it suitable or linking. This is input into the final link.

The advantags of build size and linker code placement flexibility are not as great with skiboot as a bigger project like Linux, but it’s a conceptually better way to build, and is more compatible with link time optimisation in toolchains which might be interesting for skiboot particularly for size reductions.

Size of build tree before this patch is 34.4MB, afterwards 23.1MB.
core/init: Assert when kernel not found

If the kernel doesn’t load out of flash or there is nothing at KERNEL_LOAD_BASE, we end up with an esoteric message as we try to branch to out of skiboot into nothing

[ 0.007197688,3] INIT: ELF header not found. Assuming raw binary. [ 0.014035267,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13029 [ 0.014042254,3] *********************************************** [ 0.014069947,3] Fatal Exception 0xe40 at 0000000000000000 [ 0.014085574,3] CFAR : 00000000300051c4 [ 0.014090118,3] SRR0 : 0000000000000000 SRR1 : 0000000000000000 [ 0.014096243,3] HSRR0: 0000000000000000 HSRR1: 9000000000001000 [ 0.014102546,3] DSISR: 00000000 DAR : 0000000000000000 [ 0.014108538,3] LR : 00000000300144c8 CTR : 0000000000000000 [ 0.014114756,3] CR : 40002202 XER : 00000000 [ 0.014120301,3] GPR00: 000000003001447c GPR16: 0000000000000000

This improves the message and asserts in this case:

[ 0.014042685,5] INIT: Starting kernel at 0x0, fdt at 0x3044ad90 13049 bytes) [ 0.014049556,0] FATAL: Kernel is zeros, can't execute! [ 0.014054237,0] Assert fail: core/init.c:566:0 [ 0.014060472,0] Aborting!
core: Fix ‘opal-runtime-size’ property

We are populating ‘opal-runtime-size’ before calculating actual stack size. Hence we endup having wrong runtime size (ex: on P9 it shows ~540MB while actual size is around ~40MB). Note that only device tree property is shows wrong value, but reserved-memory reflects correct size.

init_all_cpus() calculates and updates actual stack size. Hence move this function call before add_opal_node().
mambo: Add fw-feature flags for security related settings

Newer firmwares report some feature flags related to security settings via HDAT. On real hardware skiboot translates these into device tree properties. For testing purposes just create the properties manually in the tcl.

These values don’t exactly match any actual chip revision, but the code should not rely on any exact set of values anyway. We just define the most interesting flags, that if toggled to “disable” will change Linux behaviour. You can see the actual values in the hostboot source in src/usr/hdat/hdatiplparms.H.

Also add an environment variable for easily toggling the top-level “security on” setting.
direct-controls: mambo fix for multiple chips
libflash/blocklevel: Correct miscalculation in blocklevel_smart_erase()

If blocklevel_smart_erase() detects that the smart erase fits entire in one erase block, it has an early bail path. In this path it miscaculates where in the buffer the backend needs to read from to perform the final write.
libstb/secureboot: Fix logging of secure verify messages.

Currently we are logging secure verify/enforce messages in PR_EMERG level even when there is no secureboot mode enabled. So reduce the log level to PR_ERR when secureboot mode is OFF.

Testing / Code coverage improvements

Improvements in gcov support include support for newer GCCs as well as easily exporting the area of memory you need to dump to feed to extract-gcov.

cpu_idle_job: relax a bit

This dramatically improves kernel boot time with GCOV builds

from ~3minutes between loading kernel and switching the HILE bit down to around 10 seconds.
gcov: Another GCC, another gcov tweak
Keep constructors with priorities

Fixes GCOV builds with gcc7, which uses this.
gcov: Add gcov data struct to sysfs

Extracting the skiboot gcov data is currently a tedious process which involves taking a mem dump of skiboot and searching for the gcov_info struct. This patch adds the gcov struct to sysfs under /opal/exports. Allowing the data to be copied directly into userspace and processed.

v5.10.5

6 years ago

skiboot-5.10.5

skiboot 5.10.5 was released on Tuesday April 24th, 2018. It replaces skiboot-5.10.4 as the current stable release in the 5.10.x series.

It is recommended that 5.10.5 be used instead of any previous 5.10.x version due to the bug fixes and debugging enhancements in it.

Over skiboot-5.10.4, we have four bug fixes:

npu2/hw-procedures: fence bricks on GPU reset

The NPU workbook defines a way of fencing a brick and getting the brick out of fence state. We do have an implementation of bringing the brick out of fenced/quiesced state. We do the latter in our procedures, but to support run time reset we need to do the former.

The fencing ensures that access to memory behind the links will not lead to HMI’s, but instead SUE’s will be populated in cache (in the case of speculation). The expectation is then that prior to and after reset, the operating system components will flush the cache for the region of memory behind the GPU.

This patch does the following:
1. Implements a npu2_dev_fence_brick() function to set/clear fence state
2. Clear FIR bits prior to clearing the fence status
3. Clear’s the fence status
4. We take the powerbus out of CQ fence much later now, in credits_check() which is the last hardware procedure called after link training.
hdata/spira: parse vpd to add part-number and serial-number to xscom@ node

Expected by FWTS and associates our processor with the part/serial number, which is obviously a good thing for one’s own sanity.
hw/imc: Check for pause_microcode_at_boot() return status

pause_microcode_at_boot() loops through all the chip’s ucode control block and pause the ucode if it is in the running state. But it does not fail if any of the chip’s ucode is not initialised.

Add code to return a failure if ucode is not initialized in any of the chip. Since pause_microcode_at_boot() is called just before attaching the IMC device nodes in imc_init(), add code to check for the function return.
core/cpufeatures: Fix setting DARN and SCV HWCAP feature bits

DARN and SCV has been assigned AT_HWCAP2 (32-63) bits:

#define PPC_FEATURE2_DARN 0x00200000 /* darn random number insn / #define PPC_FEATURE2_SCV 0x00100000 / scv syscall */

A cpufeatures-aware OS will not advertise these to userspace without this patch.