All information is provided for educational purposes only. Follow these instructions at your own risk. Neither the authors nor their employer are responsible for any direct or consequential damage or loss arising from any person or organization acting or failing to act on the basis of information contained in this page.
Introduction
The Structure and the Binary Format of Intel Atom Goldmont Microcode
Description of Some Important Microoperations
Text Labels For Microcode Addresses
Unresolved Questions
Content of the Publication
Usage
Research Team
License
Since Intel Atom CPUs are full-fledged, modern representatives of the x86 architecture supporting most of its instruction extensions (Intel VMX, Intel MPX, Intel SGX) the ability to view, understand and research the microcode of these CPUs is being considered by us as a very important game-changing opportunity in many areas of nowadays security/performance/functional analysis of x86 CPUs. The knowing of the x86 implementation in microcode even for the one representative can greatly empower researchers of the CPU transient execution vulnerabilities because now they can see much deeper what is going on inside one or another x86 instruction implementation and how it affects the microarchitecture (various buffers, registers and internal states). Performance engineers finally can estimate the true latency of Intel CPUs instructions, comparing it with official documentation and Hypervisors developers could see the genuine reason leading to VM exit without relying on numerous guesses. Unfortunately, the Chips Giant has kept this secret with seven seals for over 40 years, but now it seems to emerge.
So last year we managed to extract the microcode for the actual Intel Atom microprocessor having codename Goldmont. We don’t intend to describe the process now, but instead we would like to share our results of the reverse engineering that we’re doing for the Atom’s microcode. Here, we are publishing our microcode disassembler tool using which you can see the interpretation in plain, readable form of the binary microcode which we have already published last year (glm-ucode). Our disassembler is written in Python 3.x script language and prints the binary microoperations together with their text representation (mnemonic + operands). The text translation is done based on our understanding and the progress in the reverse engineering at the current stage, so we don’t claim its absolute accuracy. There can be errors as in the microoperation mnemonics naming as well in the arguments representation. Moreover, there still exist unknown operation codes (opcodes) for many microoperations (mostly, for XMM specific), but the basic control flow and ALU opcodes were determined. We encourage all researchers interested in the topic to continue with us the research and extend our disassembler fixing the errors and adding new opcodes. This is one of the goals for the current publication of the microcode disassembler tool intendent for Intel Atom CPUs microcode.
At first glance at the disassembler’s output researcher may be confused by the naming of some mnemonics especially for microoperations working with physical memory (e.g. LDPPHYSTICKLE_DSZ64_ASZ64_SC1) and he can raise the question of the source for those weird names. For now, we can say only that those mnemonics were acquired directly from Intel – they published on the one of their official internet resources the raw data representing log files from some microcode simulation tool for certain Big Core microarchitecture. Now, the link isn’t available, but we kept the data which have been subject to deep analysis where we got all those sophisticated mnemonics. By analogy, we invented and our own, where we were not able to find correspondent in the list using the logic and the existing mnemonics as a template. We’re publishing the original list of the opcodes’ mnemonics in separate file (misc/bigcore_opcodes.txt) to let researchers make independent decision about correctness of our choice in the naming and use it for new opcodes.
Next, we will describe the structure of Atom Goldmont microcode and the basic semantic of some most important microoperations. Further, we will describe the remaining unresolved problems which we encountered during our research.
The microcode of the Intel Atom CPUs consists from two large chunks of data – Microcode Triads and Sequence Words. These data are kept in the ROM area of a functional block inside CPU core that is called Microcode Sequencer (MS). We used debug port of MS exposed to CRBUS to extract the data.
Microcode triads represent a set of three microoperations which are processed under control of one sequence word. The addressing of each microoperation (uop) inside triad are global so the first uop of second triad has address 0x4 (starting from 0x0), where the address 0x3 belongs to non-existing microoperation (an attempt to read the address via the debug port returns zero). In our disassembler we preserved the same addressing scheme because it’s also used in uops performing direct transfer of microcode execution flow. We simply skip each fourth microoperation (don’t print zero data) making a one empty line gap to separate the triads.
Each microoperation of Atom Goldmont microarchitecture has the following 48-bit binary format (at the top here’re the bits indexes, at the bottom – the fields lengths, signs plus mark fields boundaries, vertical bars – bytes):
48 44 40 32 24 23 18 16 12 8 6 0
-|--+--+--+----|--------|--------|--+-----+--|----+----|--+------|
|??|m2|m1| opcode | imm0 |m0| imm1| dst | src1 | src0 |
-|--+--+--+----|--------|--------|--+-----+--|----+----|--+------|
2 1 1 12 8 1 5 6 6 6
Where:
opcode – 12-bit numeric microoperation code of operation representing the actual operation to perform (all opcodes which we’ve determined are placed in separate file opcodes.txt of our disassembler package)
src0/src1/dst – three 6-bits fields which select operands for the operation. You can find the meaning of all numeric selectors for the fields in the disassembler’s python code. For some microoperations, the field dst is actually src2 (represents third source operand, e.g. for memory store uops).
m0/m1/m2 – there bits representing modes of the operation altering its behavior which are specific for microoperations or to groups of microoperations. E.g. for TESTUSTATE uop (see the description below), bit m0 means NOT, and bits m1 and m2 select various sets of internal state bits to check. For ALU uops (ADD_DSZN, SUB_DSZN and so on), bit m0 allows to select various immediate values representing data of macro-instruction (MACRO IMMS) for which the microcode gets executed.
imm0/imm1 – represent bits #0-7 and #8-12 of immediate values embedded directly into uops. The bits #13-15 are extracted from the values in src0/src1 field (there’s a set of selectors representing immediate values and containing the last three bits of the values).
Bits #46 and #47 – present only in ucode patch in RAM area (aren’t set in uops of MSROM) and control some properties of uops substitution which we didn’t determined yet
Each sequence word has the following 30-bit binary format:
30 28 25 24 23 8 6 2 0
-+--+-----+--|--+--------------------+---+-------+---|
|??|sync | up2 | uaddr |up1| eflow |up0|
-+--+-----+--|--+--------------------+---+-------+---|
2 3 2 15 2 4 2
Where:
up0/up1/up2 – 2-bit pointers to microoperation inside triad. Values 0x0-0x2 point to one of three uops, the value 0x3 has special meaning (see below) for up1 and up2 (for up0 is unacceptable)
eflow – 4-bit field that controls execution flow for the microoperations triad. The bit layout of the field can be studied in disassembler’s python code, in process_seqword function. The values other than 0x0 imply the use of up0 field. The value 0x0 (and 0x8-0xb) of eflow field specifies sequential execution of next triad (if up1 has the value 0x3) or the triad at microcode address specified by uaddr field (for up1 values of 0x0:0x2). The up1 values 0x0:0x2 also point the last uop in the triad to execute (so, in each triad there can be executed less than three uops)
uaddr – 15-bit field that specifies the address in microcode ROM (or in patch RAM if uaddr is larger or equal to 0x7c00) for the next triad which accepts execution flow. This field is only applicable for certain values of eflow field (see above)
sync – 3-bit field that controls two synchronization aspects those apply for microoperations execution which is performed out of order based on dependency chains inside microoperations. Some values specify Load Fences, other specify Synchronization Barriers. See the process_seqword function in Python’s code for exact values. The field is processed (and has meaning) only if up2 field contains 0x0-0x2 values pointing to valid uop inside triad. The value 0x3 for up2 specifies that no sync control is defined inside correspondent triad
Bits #28-29 – unknown bits defining some undermined aspects of sequence words substitution via Patch RAM (probably, their meaning is the same as for bits #46-47 of uops)
There’re two groups of the most important microoperations:
We found these mnemonics (SAVEUIP/READUIP/URET) in the original list of opcodes for the Big Core. During the reverse engineering of Atom microcode, we understood that there’re two internal microarchitectural (uarch) registers accessed by the considered uops which allow some kind of procedure calling inside microcode. We named the registers UIP0 and UIP1.
It must be noted that the procedure calling mechanism allows the branching at most two nesting levels by default. However, using READUIP_REGOVR/SAVEUIP uops, the microcode can arrange more nesting levels saving and restoring the UIP0/1 values. Also, we note the fact that some eflow control values inside Sequence Words duplicate the functional of the considered uops (there’re control values inside eflow field of sequence words having the same effect on the ucode execution as SAVEUIP/SAVEUIP_REGOVR and URET uops).
One of most sophisticated microoperations which took a long time to understand is the uop for conditional execution of sequence words depending on various microarchitectural internal states and a set of bits which can be manipulated by ucode itself. We named the uop having opcode 0x00a as TESTUSTATE. This microoperation engages all three mode bits (m0/1/2) inside binary format of uop. We found the companion UPDATEUSTATE uop which can set/reset any bit of the internal 6-bit bitmask. This internal 6-bit state can be used in TESTUSTATE uop when m1/m2 bit are both zero. Other combinations of m1/m2 uop modes bits specify internal microarchitectural states to be tested. There’re two sets of the internal uarch states which we named: SYS and VMX states. Our disassembler prints the certain state for each TESTUSTATE uop as first operand. We marked the special case of the 6-bit bitmask manipulated by UPDATEUSTATE uop as UCODE. We investigated and assigned the names at the moment only to first nine SYS states. Among those are: UST_USER_MODE, UST_SMM, UST_VMX_GUEST and others. The VMX internal states are to be determined. TESTUSTATE microoperation operates as following:
This very strange microoperation which we ourselves named so can perform (replaces) the functionality of several other uops dealing with execution flow control in particular SAVEUIP, UPDATEUSTATE and some others based on its argument. The full set of uops which the uop can replace see in get_str_uop_uflow_ctrl_special_imms function. For the one value of the uop’s argument we were not able to determinate its purpose.
There exist many uops for conditional operations such as conditional jumps to microcode addresses. They all operate the same way when viewed from the condition part. The condition to test is a part of the microoperation opcode (and mnemonic), but the state to test is not obvious. We determined that execution of all ALU uops doesn’t affect global architectural Flags Register. Yes, there exist special uops to manipulate the Arithmetical Flags of the Flags Registers (e.g. MOVEINSERTFLGS_DSZN in special mode), but where the conditional uops get the state to check was not clear. Eventually, we determined that each microarchitectural register (tmp0-15) has associated set of arithmetical flags, which are assigned when the register is used as the destination for any ALU uop. The set of arithmetical flags is independent for each uarch register. There exist several uops to copy the flags between uarch registers (MOVEMERGEFLGS) and even to set the flags in numeric form to any microarchitectural register (MOVEINSERTFLGS_DSZ32). Thus, the conditional uops operate with the arithmetical flags associated with uarch register specified as first source operand. The architectural registers (rax, rbx, rcx and so on) don’t have such association and aren’t used in conditional uops as first operand.
So, i.e. UJMPCC_DIRECT_NOTTAKEN_CONDNZ(tmp0, UXXXX) uop test Zero Flag associated with tmp0 register and transfer the Microcode Sequencer’s execution flow to UXXX addr if the flag is not set, or to uop at next address in microcode otherwise.
All conditional jumps have NOTTAKEN attribute so they aren’t considered as transferring control in speculative execution (behind unresolved branches - other conditional jumps with unresolved source operands). However, the jumps performed by unconditional jumps - due to UJMP, URET uops or by correspondent sequence words processing are always considered as TAKEN.
Conditional selects operate as following: if the flag selected by the condition in opcode is set or clear (depending on the condition) in the associated flags with first source operand, the result of the uop (the value written to destination) is the second source operand else zero.
Conditional moves are similar to conditional selects, but their result is first operand (not second) if the condition is met and the second operand (not zero) if condition is not met. Conditional moves as well as selects have DSZX attribute specifying the size of the destination data.
The Control Register Bus is a fundamental communication mechanism inside CPU core by which all executive units (such as Instruction Fetch Unit, Data Cache Units, Microcode Sequencer, Execution Core and others) send control data between themselves. Each executive unit is connected to CRBUS and exposes its control registers to the bus’s address range. We used the following naming scheme in our disassembler for the control registers of the executive units (the same scheme is used in internal XML files of Intel DFx Abstraction Layer and Intel OpenIPC software packages):
<UNIT NAME>_CR_<REG NAME>
E.g. CORE_CR_CR0 is the control register of the unit performing execution of uops (execution pipline) and contains current value of architectural CR0 register, PMH_CR_CR3 contains architectural page directory physical address in Page Miss Handler unit. Our disassembler supports the assignment of the text names to the control registers via cregs.txt file, where for arbitrary CRBUS address the user can specify arbitrary text name to be used everywhere in disassembler’s listing where uops reference the control register. We determined a set of important CREGs and placed them into the creg.txt file to use in the disassembler. The microoperations MOVEFROMCREG_DSZ64/ MOVETOCREG_DSZ64 are simples uops to access CRBUS. There also exist a set of MOVETOCREG_BITOPX_DSZ64 uops, which perform the specified bit operation under first source operand and write the result to specified CREG.
Inside execution pipeline there exist special small random-access memory which is private to each CPU core instance. It has only 512 (0x200) 64-bit entries and is accessed by READURAM/WRITEURAM uops. We called the memory as URAM. The memory isn’t shared by other cores of CPU complex. We are convinced that the memory can be written by arbitrary data and its entries aren’t hardware registers, but it seems that executive units of CPU core can access the URAM independent of microcode. Studying the microcode simulation log files for some Big Core (see Overview chapter) we’ve seen that the Big Cores also have the dedicated small private microarchitectural memory, but they name it as FSCP. We don’t know certainly what the abbreviation means, but decided to name the entries in URAM also as FSCP_CR_XXX. So, in our disassembler package there exist fscp.txt file where the association between arbitrary URAM address and its text name can be set. There also exist uops performing bit operations on their arguments (by analogy of correspondent CRBUS uops) before the write to URAM, but for now we didn’t determine their mnemonics.
Our disassembler can assign text label to arbitrary address in microcode, so in all control flow uops, conditional and direct, the text label is used instead UXXX microcode address. The file has name labels.txt and placed nearby main python script. We already filled the file with several labels, which we assigned for different ucode procedures, such as performing cryptographic procedures and others. Especially note the labels ending with _xlat: they mark entry points for x86 instructions which we determined. XLAT is an abbreviation of “Translate” and underlines that the x86 entry points in ucode are selected by a static tabular mechanism (we’ve seen the same naming of x86 entry points for Big Cores in the ucode emulation log files). Using the ability to execute arbitrary ucode via Match/Patch mechanism (isn’t described in this write-up), we determined many entry points for x86 instructions and placed them into the labels.txt file to be used by researchers. Even more x86 entries aren’t determined yet. As you can see, each x86 instruction entry in the microcode has the following properties:
Our disassembler is far from complete. Here’re the open issues (how we see it) to be implemented:
Opcodes and semantic for most SSE uops
Although we found several uops processing MMX/XMM data and implemented the support in our disassembler for mixed uops operating with both MMX/XMM and GP registers (the selectors for the registers in src0/src1/dst fields are overlapped), we didn’t process all SSE microoperations: we added only simple SSE uops those map one to one to correspondent x86 instructions naming them as the instructions (in fact, the mnemonics names for uops may differ). There exist in microcode the procedure for fast SHA256 implementation using vectored SSE data – it almost completely consists from uops with unknown opcodes. That’s a good place to start researching SSE uops.
Two unknown bits for TESTUSTATE
From all possible 48 state bits which can be used in TESTUSTATE uop, only for two of them we don’t know where they are in the microarchitectural state (see description for the TESTUSTATE uop above). We didn’t find bit #1 from UCODE state and bit #13 from SYS state. To understand their meaning, it must be found at first where the bits exist in the microarchitecture (CRBUS, arch state, Fuse, FSCP and so on).
Text names for state bits of TESTUSTATE
We assigned the names for eight most important SYS states of TESTUSTATE uop. You can find the enumeration in Phyton’s function parsing the arguments of the uop (get_str_uop_xxx_ustate_special_imms). For remaining seven (one bit is unterminated) SYS states and for VMX states, their purpose must be determined by reverse engineering of microcode changing the states’ sources and appropriate names must be assigned (the Python code has dict for the names to be extended).
Many CRBUS registers
Unfortunately, we don’t have full list of CRBUS registers for Atom Goldmont microarchitecture (we do have the list for some Big Cores that was acquired from XML files of Intel DAL software package). However, the knowing of the Control Registers and their bit layout is very important for complete reverse engineering of the microcode (you will see how much code in MSROM works with CRBUS). We found and added to our disassembler some CRegs using their correlation with MSRs but they are very few of full set.
SIGEVENT numeric argument
This uop is used to raise x86 architectural exceptions. We found (using pure logic) two very important places where #UD and #GP exceptions are generated in microcode using the SIGEVENT uop, but we are not able to map the SIGEVENT argument to x86 exception vector. It seems there’s some other information in the numbers passed to SIGEVENT that must be understood, so the more convenient support for the SIGEVENT uop can be added to our disassembler.
UFLOWCTRL first argument’s value 0x01
We didn’t determinate the purpose of the UFLOWCTRL with first argument’s value of 0x01. It replaces some other uop but it’s unknown which for the argument.
Sequence Word’s UEND variations
We detected among eflow field bits of Sequence Words four values requesting the end of microcode sequencing for current macroinstruction. We marked them as UEND0, UEND1, UEND2 and UEND3. Although we suppose they are indented to deal with out of order execution of uops during the microcode sequencing and perhaps beyond the macroinstruction boundaries the certain purpose of each UENDX is to be determined.
Find an unfixable bug in CPU initialization code
We already found many interesting things using our disassembler, in particular the two undocumented x86 instructions for microarchitectural access, but the main goal remains unresolved: to find a bug in microcode performing CPU initialization from the Reset Entry Point in microcode (U4000) to call of x86 Reset Vector. It’s very probably that a bug in that code flow could not be fixed by microcode patch what makes a precedent of truly unfixable microcode bug and changes the approach of the industry to the microcode implementation.
glm_ucode_disasm.py
Usage: glm_ucode_disasm <ms_array0_file_path>
Example:
glm_ucode_disasm.py ..\ucode\ms_array0.txt
Output listing can be found in
cat ..\ucode\ucode_glm.txt
U0000: 00626803f200 tmp15:= MOVEFROMCREG_DSZ64(CORE_CR_CUR_UIP)
U0001: 000801030008 tmp0:= ZEROEXT_DSZ32(0x00000001)
018e5e40 SEQW GOTO U0e5e
------------------------------------------------------------------------------------
U0002: 004800013000 tmp7:= ZEROEXT_DSZ64(0x00000000)
U0004: 05b900013000 mm7:= unk_5b9(0x00000000)
U0005: 000a01000200 TESTUSTATE(UCODE, UST_MSLOOPCTR_NONZERO)
0b000240 ? SEQW GOTO U0002
U0006: 014800000000 SYNCWAIT-> URET(0x00)
------------------------------------------------------------------------------------
U0008: 000c6c97e208 tmp14:= SAVEUIP(0x01, U056c)
01890900 SEQW GOTO U0909
------------------------------------------------------------------------------------
U0009: 0005a407de08 tmp13:= SUB_DSZ32(0x000001a4, tmp8)
U000a: 01310023d23d tmp13:= SELECTCC_DSZ32_CONDNZ(tmp13, 0x00000800)
U000c: 00470003dc7d tmp13:= NOTAND_DSZ64(tmp13, tmp1)
U000d: 0150015c027d LFNCEWTMRK-> UJMPCC_DIRECT_NOTTAKEN_CONDZ(tmp13, U3701)
U000e: 000000000000 NOP
06a71180 SEQW GOTO generate_#GP
------------------------------------------------------------------------------------
U0010: 000c6c97e208 tmp14:= SAVEUIP(0x01, U056c)
0187e100 SEQW GOTO U07e1
------------------------------------------------------------------------------------
sha256_ret:
U0011: 00638e03d200 tmp13:= READURAM(0x008e, 64)
U0012: 00652003e23d tmp14:= SHR_DSZ64(tmp13, 0x00000020)
U0014: 003d0003df7e tmp13:= MOVEINSERTFLGS_DSZ32(tmp14, tmp13)
U0015: 00638d03e200 tmp14:= READURAM(0x008d, 64)
U0016: 015d00000ec0 UJMP(tmp11)
Mark Ermolov (@_markel___)
Maxim Goryachy (@h0t_max)
Dmitry Sklyarov (@_Dmit)
Copyright (c) 2021 Mark Ermolov, Dmitry Sklyarov at Positive Technologies and Maxim Goryachy (Independent Researcher)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.