26 November 2013

Multi GPU Machines and PCIE In-Band vs. Out-of-Band Presence Detection

TLDR: 
To improve PCIE bus initialization during boot when trying to run x16 GPUs via various PCIE risers, short pin A1 to B17 on ALL PCIE x1 risers (in the unlikely event you are using x4/x8 to x16 risers, look up the proper x4/x8 PRSNT#2 pin and short that one to A1 instead).

I guess I should preface everything below by saying, I'd be happy to hear how PCIE bus initialization really works (in relation to the problem described below) from someone who knows PCIE buses.


Full Story

Last night I had a problem getting multiple Radeon 7970 GPUs running in a single motherboard, 4 cards good --> 5 cards pure chaos, lspci showing 3 cards only, swapping slots, risers, cards all showed that everything was good, just not in any combination over 4 and in fact, some combinations of 4 no good. (Maybe even some with 3).

7970 GPUs



Some quick googling says shorting PCIE  pins A1 and B17 produces various "better" results.

PCIE x1 Slot DIY Presence Detect

This picture below shows how a PCIE cards is supposed to short a specific x1 or x4 x8 or x16 pin(indicating it's size) to pin A1(present on all PCIE cards).

More Mysterious Problems

Reading more and more internet postings never really cleared up my question on which slots to short and why. In all the reports of success/failure I could see that different slots( x1, x16, etc.) were being shorted in different conditions(risers with powered/non-powered, x1-x16, x1-x1, x16-x16) with different results, sometimes more cards working, sometimes not.

Basically it seemed that the mysterious behavior of 4+ GPUs with x1-x16 risers in motherboards without shorting anything was still mysterious and now had exponentially more permutations with shorting all or some slots added into the mix.

Ohming Out My Hardware

Ohming out the presence detect pins on my GPUs  and motherboard, plus reading the doc above I was still a confused. As I ohmed out my x1-x16 risers I figured out that the risers have no 1x presence detect pins shorted within the riser, they are simply passed up to the x16 connector. I then found that GPU has no 1x presence detect pin shorted either since it's a x16 board and has the x16 pin shorted (as the picture above indicates it's supposed to do)

I also saw that my motherboard did have the PRSNT#2 pins pulled up, and not grounded or left unconnected as mentioned in the pictures above. This should mean that my motherboard actually used them, or at least, could be using them. 

PCIE Bus Initialization Mayhem

So, with my x1-x16 risers, the motherboard has no way of knowing that a card is even inserted, much less what size it is?!? This can't be because the cards are working! So the PCIE chipset must be "watching" the PCIE links on each slot and trying to talk over it to see if there is a card there. Some sort of fallback method to support cards that don't implement presence detect pins.

That detection method seems a little error prone, especially if you consider a 16x slot... the PCIE chipset would have to try talking on all 16 x1 links to see if there was any response and determine if that response was valid, garbled etc. Even if the PCIE card is able to tell the PCIE chipset its' size after the first x1 link comes up, the card is going to say "I'm x16" and then proceed to fail to talk over 15 x1 links(due to the x1-x16 riser). I guess it's possible that if the first additional x1 link fails, the PCIE chipset will give up on the remaining 14 and say "you're now x1" which would speed up the process... if it does that.

Also add in things like a motherboard manual saying "if Slot x1_3 is used along with Slot  x4_1, the bandwidth is shared", and x16 slots becoming only x8 if another x8 slot is used . With all that slot size reconfiguration and link availability/sharing on the motherboard side, there must be a lot of (failed) negotiation going on, especially when we are inserting nothing but x16 cards that will only be able to work at x1 (due to x1-x16 risers). This could explain hung boots and cards that disappear when additional cards (also without presence detect pins due to risers) are inserted in various slots.


Soldering 

I started to think that with the real physical presence detect pins wired as recommended by the standard, then the PCIE chipset can immediately know what size the card is and skip trying to determine size via looking for RX terminations or TX'ing on various channels/slots. This should in theory speed up and drastically reduce the complexity of BIOS boot up or at least the PCIE bus initialization.
As I looked for a way to apply this A1-B17 short, I decided to put the short on the 16x connector end of my 16x-1x risers. This would make my GPU cards on 16x-1x risers look like a true 1x card in any slot you put it in. Also it will not confuse the PCIE chipset with a shorted presence detect pin on a empty slot if you still get to swapping around boards etc.

The green wire is the short:







Results

I only have a sample size of one, but the first boot with A1-B17 shorts on all non-x16 risers worked flawlessly with 5 GPUs.


Yeah the temp reporting is not working in this screenshot, things are still in flux.

Further Info

I did some quick googling on PCIE presence detection. Some PCI SIG presentations describe  hotplug/insertion via In-Band and Out-of-Band methods of presence detection. Out-of-Band is the presence detect pins we are discussing shorting. In-Band sure sounds like it's done in the x1 links between the card and the PCIE chipset, which is what I believe leads to far too much confusion when loading in multiple cards in shared bandwidth slots etc.

In my quick searching I didn't turn up too much on In-Band hot plug detection but, essentially in-band presence detection is an active mode that the chipset will have to look for RX terminations on the pcie cards and something called "beacon" signals. I think this supports the idea that leaving presence detection up to the motherboard to figure out causes all sorts of mayhem on the board and things probably end up garbled and/or timing out, coupled with slots that share x1 links, etc, I think it's a recipe for unstable boot up and poor detection of boards actually present.


x1 to x1 Riser Hack  
You can see an example the end cutoff of a x1 connector here: http://blog.zorinaq.com/images/pcie-carved-1.jpg

Thinking this through, with 1x-1x risers(assuming they are pin for pin straight through with no PRSNT#2 short within the riser) where the end of the GPU-side slot/connector is cut out, the situation is the same since the GPU card only has the x16 presence detect pin shorted, so even though the x1-x1 riser has the x1 presence detect pin passed through from the GPU board, it's not used on the GPU board so the motherboard still does not receive definitive notification that the board is present and that it's size is x1.


PCI Latency Fix
Another posting I had saw last night reported success at 5+ cards when increasing PCIE latency in BIOS... my bios has no adjustment for that though.  I'm far from understanding PCI latency in detail but it seems to me that it could conceivably be allowing for more negotiation time or in some manner,  less mayhem during initialization of the PCIE bus(es) or links?

7 comments:

  1. So just to clarify, we are trying to run 6 GPU's on a MB with 3 x 1x and 3 x 16x slots,
    we are using 1x-16x powered risers, are we to short all the risers, or only those to be used in the 1x slot ?

    ReplyDelete
  2. In your case I would short them all, so they appear as a proper x1 card to the x1 slots, and also to the x16 slots.

    You can see the presence detect short on your x16 GPU card edge as the two pins that are slightly shorter (farther from the edge) than all the other pins. When using the x1-x16 adapter, that short is not passed down to the motherboard (even for a x16 slot).

    ReplyDelete
  3. My motherboards x1 slot does not have pins A1 B17? So it is impossible to use GPU on x1? I tried with powered 1x-16x cable with and without shorting. I soldered shorter "pins", connected them with small wire on smaller end but no? Only fans working...

    ReplyDelete
  4. Hi,
    Yesterday I've notcied weird thing. In the past in order to connect 5 gpu to to my mobo I had to short 1 slot...So I had 5 GPU's running on AsRock 970 extreme 4, it has 5 slots (3 slots 16x and 2 slots 1x)..but pretty quick after startup 1 card was getting SICK..i disconnect all of the cards and this one slot was not being detected so I short pin on this slot as well..now all of them were running..each card manually can pull pretty good hashrate, but when I connected all of them, they couldn't keep the same hashrate and the last added card had to run on lower settings or it would get SICK in cgminer..I start suspecting pci-e lanes..I'm at work now but Your article just cleared to me, that maybe I run out of pci-e lanes. I will try to short all of the pins..plz tell me, can I short all of the ports on mobo not on risers like You, cause I don't have necessary equipment? Also, can I power on mobo with all shorted slots when for example only 1 card is in? will it damage the mobo? cause it's pretty annoying to take this tiny wires out each time i take card out of the slot...

    ReplyDelete
  5. Does anyone know where they can purchase short-ed 1x to 16x powered risers? My soldering skills are not that good.

    ReplyDelete
    Replies
    1. I agree - I would happily pay for short'ed risers!

      Delete
  6. Bios doesnt have the allocation resources (reserved PCIe Memory space) to initialize the 5th card

    ReplyDelete