Using DDR Octal PSRAM with the NXP MCXN947
June 24, 2024
Introduction
The MCXN947 is the flagship part in the MCX N family of microcontrollers. The superset part includes 2MB of internal flash memory and 512KB of internal SRAM. Additional memory can be added to an MCX N system design using the unique FlexSPI controller. FlexSPI is a unique peripheral that enables access to SPI (Serial Peripheral Interconnect) memories via the internal AHB bus. Though the term SPI often implies a single bit, synchronous serial data link, the concept has been extended to quad and octal data paths.
A key point about FlexSPI is that enables the SPI connected memories to appear in the system memory map. The details of the SPI transactions are managed by the FlexSPI controller. This feature enables Execute-in-Place (XIP) capability, where the CPU is executing directly from the external memory.
The FlexSPI controller has two ports which can be further subdivided into two separate interfaces allowing a maximum of 4 devices if required. SPI based memories typically are accessed in a burst mode, where a controller will read/write data in 512 bytes blocks. The block access nature of these memories can present challenge for typical MCU uses cases. The MCX N integrates a 64-bit data path, 16KB cache in front of the FlexSPI controller known as CACHE64.
The CACHE64 provides spatial and temporary locality of FlexSPI data to the system, smoothing the block transactional nature of the external SPI memories. The cache makes SPI based memories suitable for direct execution or data storage without necessarily incurring a large performance penalty.
For the most timing critical operations, low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI port can enable a large degree of application flexibility. Possible use cases PSRAM on the MCX include:
- Large model storage for the eIQ® Neutron NPU
- Long delay lines for digital audio algorithms such as reverbs.
- Implementing digital audio effects such as a looper
- Implementing large applications which can be loaded from commodity storage such as SD cards or a USB Drive
- Storage of dynamic graphics assets such as animations and bitmaps
- Large audio buffers for a digital sampler/synthesizer using the onboard 14-bit DAC with oversampling to 16-bit
- Long chains of image frame buffers for the SmartDMA/EZH camera interface
- Deep capture buffers with the onboard ADC
Connecting an 8MB DDR Octal PSRAM to the MCX N
The device used for this paper was the 8MB AP Memory APS6408L-3OBM-BA.
The APS6408L-3OBM-BA is packaged in a small 6mmx8mm, 24-ball, 1mm pitch BGA. This footprint is commonly used for Octal SPI memories and there are several vendors who produce pin compatible devices.
Octal SPI memories are DDR capable, supporting data transfers on both edges of a clock. Higher speed octal devices can support up to 400MB/s transfers with a 200MHz serial clock but generally use a lower voltage (+1.8v) interface. +3.3v based Quad/Octal devices are available but generally are limited to 133MHz serial clock rates. For this paper, we will use the APS6408L-3OBM-BA variant as it supports a 3.3v interface with a maximum clock speed of 133MHz equating to a maximum data transfer rate of 266MB/s. Note that the MCX N supports 1.8v IO for designs that want to use the higher speed devices.
The FRDM-MCXN947 development board comes with a W25Q64 Quad SPI flash memory part installed. This part is packaged in a wide SOIC8 package.
However, the PCB design has pads for attaching both quad SPI devices in a wide SOIC8 form factor or Octal SPI devices in the 6mmx8mm BGA24 form factor.
I chose the 3.3v variant of AP Memory APS6408L as it was the simplest configuration to use with the FRDM-MCXN947. Removing the W25Q64 is straightforward with use of a hot air rework tool, exposing the BGA24 pads underneath.
The APS6408L-3OBM-BA can be soldered with a hot air rework tool after adding some solder paste to the exposed pads.
Octal PSRAM Configuration
There is currently no sample in the MCUXpresso SDK for using octal PSRAM with the MCXN947. However, the FlexSPI controller is very similar to that in the MIMXRT685. Using a PSRAM sample from the i.MXRT685 SDK, I developed an example for the MCXN947.
https://github.com/wavenumber-eng/mcxn947_octal_psram
Inside of the repository is a project named “bunny_octal_psram_test” which can be imported into MCUXpresso IDE v11.9.0 [Build 2144] [2024-01-05] or later. This sample performs a basic memory test of the entire array as well as implementing some basic PSRAM transfer tests. The “bunny” naming convention comes from its origin in the bringup code for the of the MCXN947 “BunnyBoard” which also uses the APS6408L-3OBM-BA.
The FlexSPI memory controller was designed to be a future proof interface that enables the MCX N to interface with virtually any external SPI based memory. Using a programmable look-up table (LUT) approach, the FlexSPI controller can be adapted to single bit, dual, quad, or octal/DDR memories as needed. As new memory devices become available, the FlexSPI LUT can be adjusted to new command and data access sequences
PSRAM devices typically require additional configuration for parameters such as read latencies and burst sizes. This sample provides some defaults that will work with the APS6408L-3OBM-BA. Control register access is performed with the FlexSPI SDK API and the custom LUT.
The APS6408L-3OBM-BA is specified for 3.3v operation and a maximum 133MHz clock. PLL1 was used to generate the 133MHz clock for the FlexSPI controller.
Once the PSRAM is configured, it can be accessed through using normal memory access patterns such as with a pointer.
volatile uint32_t *psram = (volatile uint32_t *)(BUNNY_FLEXSPI_BASE_ADDRESS);
psram[0] = 0xAA551122;
It is possible to configure the build system such that the PSRAM can be used by the linker. Care must be taken to ensure that the PSRAM/FlexSPI is initialized before the standard C initialization and copy down routines. This is beyond the scope of this paper but will be a topic for a future paper.
PSRAM Performance Tests and Considerations
When using PSRAM with FlexSPI connected memories , it is important to understand how the cache and burst access nature of the PSRAM impacts system performance. Real world scenarios can often have a variety of memory access patterns making it difficult to develop a singular test to characterize performance. However, I typically will run tests that operate the memories at various limits, with the understanding that real world performance will fall somewhere in between.
For this work, there were two limiting case that were evaluated:
- Large Block Transactions
- Random access transactions over a wide address range
For the block transaction test, the code would read/write block sizes from 1KB to 32KB using both DMA and memcpy. The intent of this test case was to show the limiting behavior of the CPU interacting with the CACHE64 (best case scenario).
To achieve good memcpy performance, I linked this application against the newlib library. The newlib build used with MCUXpresso has a hand tuned, assembly language implementation of memcpy that performs better than redlib or newlib-nano. You can learn more about the newlib implementation via memfault’s excellent analysis.
The block transaction test performed 32-bit reads and writes to a predetermined range of randomized addresses. This test evaluates the limiting case where the transactions would almost always fall outside of the cache. The randomized access test cases included read-only, write-only and a write-then-read transactions.
The CPU cycle counts for the core code paths performing the transfers were gauged using the ARM SysTick timer. Tests were iterated 256 times and transfer rates were computed using an average cycle count value over the iterations. The clock cycle counting method was calibrated and the results are within a reasonable margin of error (+/-2 cycles). The transfer rates were computed using the CPU cycle count and the system clock rate.
Before measuring timings on the PSRAM, a control test was executed using a block of internal memory as a reference. Since the code paths for operating the DMA transfer, performing memcpy and implementing the random read/writes have their own performance characteristics, the control test establishes a baseline for comparison.
As stated before, the memcpy implementation uses a hand tuned assembly from the newlib library. The DMA transfer code has minimal overhead as the timed code path initiating a DMA transfer using SDK API and polling for the result. The random read/write code is a straightforward C for loop with double indexed array access. The test code was compiled with the -02 optimization flag. The codes used for the memory transfers does not represent any particular optimization or use case but are typical of what might be found in a real-world application.
Test Results
The results shown are copies from a serial debug terminal. Data was recorded, formatted, and printed by the MCXN947 PSRAM test firmware.
This test represents a control case as the source and destination buffers are both in internal SRAM and both of the buffers are in RAM banks on different AHB ports. There are many interesting features in the control data, but for now will consider this a baseline for how the test algorithms perform when using PSRAM.
There were a few notable features in from test run #1 which has the CACHE64 disabled.
- The memcpy reads were in some cases better than DMA, some of this was to be expected from the control run data but was surprising and warrants additional investigation into the how the SDK uses the DMA controller.
- PSRAM reads were generally much faster than writes in the block transfers. Some of this was to be expected based upon the published timing diagrams in the APS6408L-3OBM-BA datasheet, but the difference was quite remarkable and would warrant further study.
- The random-access tests were quite slow, which was to be expected. Access random words will trigger frequent FlexSPI page transactions with the PSRAM.
Test run #2 was identical to #1 except that the FlexSPI clock rate was increased to 150MHz. This is overclocking the APS6408L-3OBM-BA PSRAM which is not recommended in a production use case over the published temperature range. As expected, there was a slight increase in performance due to the faster clock.
Test Run #3 returns to the 133MHz clock rate and enables the CACHE64 module.
A few notable features in this dataset:
- The DMA read/write of characteristics match the control test. This is an indication that the CPU is primarily interacting with the cache, not the PSRAM.
- Once the block size is larger than 16Kb, we can see the read and write rates fall significantly. This is to be expected as this is the size of the CACHE64. When the block access is larger than 16Kb, the FlexSPI peripheral needs to perform external access to fetch 512-byte pages (cache miss).
- The random-access tests show that the using the cache when reads/writes constantly miss the incur a strong performance penalty. When there is a cache miss, the FlexSPI fetches an entire 512-byte block from the PSRAM. It is important to consider the use case to avoid this penalty.
- Inside of the 16KB cache boundary, the random accesses performance is improved.
Test run #4 is a repeat of #3 with the FlexSPI running at 150MHz
Final Thoughts
From this initial data we can observe the behavior of the FlexSPI controller coupled to an Octal PSRAM through the CACHE64 using some limiting test cases. Real world performance will vary, but this dataset and code can provide a starting point to assess suitability for a specific requirement. These test cases show some of the performance boundaries, so it is to be expected that real world performance will fall between these limits.
While it was out of scope of this paper, it is possible to execute code from FlexSPI/PSRAM. There is some precedent available with the LPC5536 microcontroller. It uses the same FlexSPI controller and a smaller 8Kb CACHE64 module. NXP Application Note AN13591 provides data on XIP performance as compared to code executing from internal flash on the LPC5536:
https://www.nxp.com/docs/en/application-note/AN13591.pdf
Interestingly, code execution performance is nearly identical when comparing CoreMark scores when running from Internal Flash, Octal SPI Flash and Octal SPI HyperRAM/PSRAM on the LPC5536
For the most timing critical operations, low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI interface can enable a large degree of flexibility in potential applications. Adding a large amount of non-volatile memory is simple from the PCB design point of view and does not add significantly to the system BOM.
Using the FRDM-MCXN947 is a simple way to evaluate FlexSPI/PSRAM based design at a low cost. You can get find more information about the FRDM-MCXN947 and the MCX947 microcontroller here with the following links.
https://www.nxp.com/design/design-center/development-boards-and-designs/general-purpose-mcus/frdm-development-board-for-mcx-n94-n54-mcus:FRDM-MCXN947
https://www.nxp.com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mcus/mcx-arm-cortex-m/mcx-n-series-microcontrollers/mcx-n94x-54x-highly-integrated-multicore-mcus-with-on-chip-accelerators-intelligent-peripherals-and-advanced-security:MCX-N94X-N54X
References
Code reference used for tests in this paper
https://github.com/wavenumber-eng/mcxn947_octal_psram.git
Understanding newlib memcpy performance
https://interrupt.memfault.com/blog/memcpy-newlib-nano
FlexSPI CoreMark Performance on LPC553x/LPC55S3x
https://www.nxp.com/docs/en/application-note/AN13591.pdf
FRDM-MCXN947 Product Page
MCXN947 Product Page
AP Memory APD6408 PSRAM Datasheet
Introduction
The MCXN947 is the flagship part in the MCX N family of microcontrollers. The superset part includes 2MB of internal flash memory and 512KB of internal SRAM. Additional memory can be added to an MCX N system design using the unique FlexSPI controller. FlexSPI is a unique peripheral that enables access to SPI (Serial Peripheral Interconnect) memories via the internal AHB bus. Though the term SPI often implies a single bit, synchronous serial data link, the concept has been extended to quad and octal data paths.
A key point about FlexSPI is that enables the SPI connected memories to appear in the system memory map. The details of the SPI transactions are managed by the FlexSPI controller. This feature enables Execute-in-Place (XIP) capability, where the CPU is executing directly from the external memory.
The FlexSPI controller has two ports which can be further subdivided into two separate interfaces allowing a maximum of 4 devices if required. SPI based memories typically are accessed in a burst mode, where a controller will read/write data in 512 bytes blocks. The block access nature of these memories can present challenge for typical MCU uses cases. The MCX N integrates a 64-bit data path, 16KB cache in front of the FlexSPI controller known as CACHE64.
The CACHE64 provides spatial and temporary locality of FlexSPI data to the system, smoothing the block transactional nature of the external SPI memories. The cache makes SPI based memories suitable for direct execution or data storage without necessarily incurring a large performance penalty.
For the most timing critical operations, low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI port can enable a large degree of application flexibility. Possible use cases PSRAM on the MCX include:
- Large model storage for the eIQ® Neutron NPU
- Long delay lines for digital audio algorithms such as reverbs.
- Implementing digital audio effects such as a looper
- Implementing large applications which can be loaded from commodity storage such as SD cards or a USB Drive
- Storage of dynamic graphics assets such as animations and bitmaps
- Large audio buffers for a digital sampler/synthesizer using the onboard 14-bit DAC with oversampling to 16-bit
- Long chains of image frame buffers for the SmartDMA/EZH camera interface
- Deep capture buffers with the onboard ADC
Connecting an 8MB DDR Octal PSRAM to the MCX N
The device used for this paper was the 8MB AP Memory APS6408L-3OBM-BA.
The APS6408L-3OBM-BA is packaged in a small 6mmx8mm, 24-ball, 1mm pitch BGA. This footprint is commonly used for Octal SPI memories and there are several vendors who produce pin compatible devices.
Octal SPI memories are DDR capable, supporting data transfers on both edges of a clock. Higher speed octal devices can support up to 400MB/s transfers with a 200MHz serial clock but generally use a lower voltage (+1.8v) interface. +3.3v based Quad/Octal devices are available but generally are limited to 133MHz serial clock rates. For this paper, we will use the APS6408L-3OBM-BA variant as it supports a 3.3v interface with a maximum clock speed of 133MHz equating to a maximum data transfer rate of 266MB/s. Note that the MCX N supports 1.8v IO for designs that want to use the higher speed devices.
The FRDM-MCXN947 development board comes with a W25Q64 Quad SPI flash memory part installed. This part is packaged in a wide SOIC8 package.
However, the PCB design has pads for attaching both quad SPI devices in a wide SOIC8 form factor or Octal SPI devices in the 6mmx8mm BGA24 form factor.
I chose the 3.3v variant of AP Memory APS6408L as it was the simplest configuration to use with the FRDM-MCXN947. Removing the W25Q64 is straightforward with use of a hot air rework tool, exposing the BGA24 pads underneath.
The APS6408L-3OBM-BA can be soldered with a hot air rework tool after adding some solder paste to the exposed pads.
Octal PSRAM Configuration
There is currently no sample in the MCUXpresso SDK for using octal PSRAM with the MCXN947. However, the FlexSPI controller is very similar to that in the MIMXRT685. Using a PSRAM sample from the i.MXRT685 SDK, I developed an example for the MCXN947.
https://github.com/wavenumber-eng/mcxn947_octal_psram
Inside of the repository is a project named “bunny_octal_psram_test” which can be imported into MCUXpresso IDE v11.9.0 [Build 2144] [2024-01-05] or later. This sample performs a basic memory test of the entire array as well as implementing some basic PSRAM transfer tests. The “bunny” naming convention comes from its origin in the bringup code for the of the MCXN947 “BunnyBoard” which also uses the APS6408L-3OBM-BA.
The FlexSPI memory controller was designed to be a future proof interface that enables the MCX N to interface with virtually any external SPI based memory. Using a programmable look-up table (LUT) approach, the FlexSPI controller can be adapted to single bit, dual, quad, or octal/DDR memories as needed. As new memory devices become available, the FlexSPI LUT can be adjusted to new command and data access sequences
PSRAM devices typically require additional configuration for parameters such as read latencies and burst sizes. This sample provides some defaults that will work with the APS6408L-3OBM-BA. Control register access is performed with the FlexSPI SDK API and the custom LUT.
The APS6408L-3OBM-BA is specified for 3.3v operation and a maximum 133MHz clock. PLL1 was used to generate the 133MHz clock for the FlexSPI controller.
Once the PSRAM is configured, it can be accessed through using normal memory access patterns such as with a pointer.
volatile uint32_t *psram = (volatile uint32_t *)(BUNNY_FLEXSPI_BASE_ADDRESS);
psram[0] = 0xAA551122;
It is possible to configure the build system such that the PSRAM can be used by the linker. Care must be taken to ensure that the PSRAM/FlexSPI is initialized before the standard C initialization and copy down routines. This is beyond the scope of this paper but will be a topic for a future paper.
PSRAM Performance Tests and Considerations
When using PSRAM with FlexSPI connected memories , it is important to understand how the cache and burst access nature of the PSRAM impacts system performance. Real world scenarios can often have a variety of memory access patterns making it difficult to develop a singular test to characterize performance. However, I typically will run tests that operate the memories at various limits, with the understanding that real world performance will fall somewhere in between.
For this work, there were two limiting case that were evaluated:
- Large Block Transactions
- Random access transactions over a wide address range
For the block transaction test, the code would read/write block sizes from 1KB to 32KB using both DMA and memcpy. The intent of this test case was to show the limiting behavior of the CPU interacting with the CACHE64 (best case scenario).
To achieve good memcpy performance, I linked this application against the newlib library. The newlib build used with MCUXpresso has a hand tuned, assembly language implementation of memcpy that performs better than redlib or newlib-nano. You can learn more about the newlib implementation via memfault’s excellent analysis.
The block transaction test performed 32-bit reads and writes to a predetermined range of randomized addresses. This test evaluates the limiting case where the transactions would almost always fall outside of the cache. The randomized access test cases included read-only, write-only and a write-then-read transactions.
The CPU cycle counts for the core code paths performing the transfers were gauged using the ARM SysTick timer. Tests were iterated 256 times and transfer rates were computed using an average cycle count value over the iterations. The clock cycle counting method was calibrated and the results are within a reasonable margin of error (+/-2 cycles). The transfer rates were computed using the CPU cycle count and the system clock rate.
Before measuring timings on the PSRAM, a control test was executed using a block of internal memory as a reference. Since the code paths for operating the DMA transfer, performing memcpy and implementing the random read/writes have their own performance characteristics, the control test establishes a baseline for comparison.
As stated before, the memcpy implementation uses a hand tuned assembly from the newlib library. The DMA transfer code has minimal overhead as the timed code path initiating a DMA transfer using SDK API and polling for the result. The random read/write code is a straightforward C for loop with double indexed array access. The test code was compiled with the -02 optimization flag. The codes used for the memory transfers does not represent any particular optimization or use case but are typical of what might be found in a real-world application.
Test Results
The results shown are copies from a serial debug terminal. Data was recorded, formatted, and printed by the MCXN947 PSRAM test firmware.
This test represents a control case as the source and destination buffers are both in internal SRAM and both of the buffers are in RAM banks on different AHB ports. There are many interesting features in the control data, but for now will consider this a baseline for how the test algorithms perform when using PSRAM.
There were a few notable features in from test run #1 which has the CACHE64 disabled.
- The memcpy reads were in some cases better than DMA, some of this was to be expected from the control run data but was surprising and warrants additional investigation into the how the SDK uses the DMA controller.
- PSRAM reads were generally much faster than writes in the block transfers. Some of this was to be expected based upon the published timing diagrams in the APS6408L-3OBM-BA datasheet, but the difference was quite remarkable and would warrant further study.
- The random-access tests were quite slow, which was to be expected. Access random words will trigger frequent FlexSPI page transactions with the PSRAM.
Test run #2 was identical to #1 except that the FlexSPI clock rate was increased to 150MHz. This is overclocking the APS6408L-3OBM-BA PSRAM which is not recommended in a production use case over the published temperature range. As expected, there was a slight increase in performance due to the faster clock.
Test Run #3 returns to the 133MHz clock rate and enables the CACHE64 module.
A few notable features in this dataset:
- The DMA read/write of characteristics match the control test. This is an indication that the CPU is primarily interacting with the cache, not the PSRAM.
- Once the block size is larger than 16Kb, we can see the read and write rates fall significantly. This is to be expected as this is the size of the CACHE64. When the block access is larger than 16Kb, the FlexSPI peripheral needs to perform external access to fetch 512-byte pages (cache miss).
- The random-access tests show that the using the cache when reads/writes constantly miss the incur a strong performance penalty. When there is a cache miss, the FlexSPI fetches an entire 512-byte block from the PSRAM. It is important to consider the use case to avoid this penalty.
- Inside of the 16KB cache boundary, the random accesses performance is improved.
Test run #4 is a repeat of #3 with the FlexSPI running at 150MHz
Final Thoughts
From this initial data we can observe the behavior of the FlexSPI controller coupled to an Octal PSRAM through the CACHE64 using some limiting test cases. Real world performance will vary, but this dataset and code can provide a starting point to assess suitability for a specific requirement. These test cases show some of the performance boundaries, so it is to be expected that real world performance will fall between these limits.
While it was out of scope of this paper, it is possible to execute code from FlexSPI/PSRAM. There is some precedent available with the LPC5536 microcontroller. It uses the same FlexSPI controller and a smaller 8Kb CACHE64 module. NXP Application Note AN13591 provides data on XIP performance as compared to code executing from internal flash on the LPC5536:
https://www.nxp.com/docs/en/application-note/AN13591.pdf
Interestingly, code execution performance is nearly identical when comparing CoreMark scores when running from Internal Flash, Octal SPI Flash and Octal SPI HyperRAM/PSRAM on the LPC5536
For the most timing critical operations, low-latency internal memory is the preferred storage method. However, using external PSRAM on the FlexSPI interface can enable a large degree of flexibility in potential applications. Adding a large amount of non-volatile memory is simple from the PCB design point of view and does not add significantly to the system BOM.
Using the FRDM-MCXN947 is a simple way to evaluate FlexSPI/PSRAM based design at a low cost. You can get find more information about the FRDM-MCXN947 and the MCX947 microcontroller here with the following links.
https://www.nxp.com/design/design-center/development-boards-and-designs/general-purpose-mcus/frdm-development-board-for-mcx-n94-n54-mcus:FRDM-MCXN947
https://www.nxp.com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mcus/mcx-arm-cortex-m/mcx-n-series-microcontrollers/mcx-n94x-54x-highly-integrated-multicore-mcus-with-on-chip-accelerators-intelligent-peripherals-and-advanced-security:MCX-N94X-N54X
References
Code reference used for tests in this paper
https://github.com/wavenumber-eng/mcxn947_octal_psram.git
Understanding newlib memcpy performance
https://interrupt.memfault.com/blog/memcpy-newlib-nano
FlexSPI CoreMark Performance on LPC553x/LPC55S3x
https://www.nxp.com/docs/en/application-note/AN13591.pdf
FRDM-MCXN947 Product Page
MCXN947 Product Page
AP Memory APD6408 PSRAM Datasheet