To improve PCIE bus initialization during boot when trying to run x16 GPUs via various PCIE risers, short pin A1 to B17 on ALL PCIE x1 risers (in the unlikely event you are using x4/x8 to x16 risers, look up the proper x4/x8 PRSNT#2 pin and short that one to A1 instead).
I guess I should preface everything below by saying, I'd be happy to hear how PCIE bus initialization really works (in relation to the problem described below) from someone who knows PCIE buses.
Full StoryLast night I had a problem getting multiple Radeon 7970 GPUs running in a single motherboard, 4 cards good --> 5 cards pure chaos, lspci showing 3 cards only, swapping slots, risers, cards all showed that everything was good, just not in any combination over 4 and in fact, some combinations of 4 no good. (Maybe even some with 3).
Some quick googling says shorting PCIE pins A1 and B17 produces various "better" results.
|PCIE x1 Slot DIY Presence Detect|
This picture below shows how a PCIE cards is supposed to short a specific x1 or x4 x8 or x16 pin(indicating it's size) to pin A1(present on all PCIE cards).
More Mysterious ProblemsReading more and more internet postings never really cleared up my question on which slots to short and why. In all the reports of success/failure I could see that different slots( x1, x16, etc.) were being shorted in different conditions(risers with powered/non-powered, x1-x16, x1-x1, x16-x16) with different results, sometimes more cards working, sometimes not.
Basically it seemed that the mysterious behavior of 4+ GPUs with x1-x16 risers in motherboards without shorting anything was still mysterious and now had exponentially more permutations with shorting all or some slots added into the mix.
Ohming Out My HardwareOhming out the presence detect pins on my GPUs and motherboard, plus reading the doc above I was still a confused. As I ohmed out my x1-x16 risers I figured out that the risers have no 1x presence detect pins shorted within the riser, they are simply passed up to the x16 connector. I then found that GPU has no 1x presence detect pin shorted either since it's a x16 board and has the x16 pin shorted (as the picture above indicates it's supposed to do)
I also saw that my motherboard did have the PRSNT#2 pins pulled up, and not grounded or left unconnected as mentioned in the pictures above. This should mean that my motherboard actually used them, or at least, could be using them.
PCIE Bus Initialization MayhemSo, with my x1-x16 risers, the motherboard has no way of knowing that a card is even inserted, much less what size it is?!? This can't be because the cards are working! So the PCIE chipset must be "watching" the PCIE links on each slot and trying to talk over it to see if there is a card there. Some sort of fallback method to support cards that don't implement presence detect pins.
That detection method seems a little error prone, especially if you consider a 16x slot... the PCIE chipset would have to try talking on all 16 x1 links to see if there was any response and determine if that response was valid, garbled etc. Even if the PCIE card is able to tell the PCIE chipset its' size after the first x1 link comes up, the card is going to say "I'm x16" and then proceed to fail to talk over 15 x1 links(due to the x1-x16 riser). I guess it's possible that if the first additional x1 link fails, the PCIE chipset will give up on the remaining 14 and say "you're now x1" which would speed up the process... if it does that.
Also add in things like a motherboard manual saying "if Slot x1_3 is used along with Slot x4_1, the bandwidth is shared", and x16 slots becoming only x8 if another x8 slot is used . With all that slot size reconfiguration and link availability/sharing on the motherboard side, there must be a lot of (failed) negotiation going on, especially when we are inserting nothing but x16 cards that will only be able to work at x1 (due to x1-x16 risers). This could explain hung boots and cards that disappear when additional cards (also without presence detect pins due to risers) are inserted in various slots.
SolderingI started to think that with the real physical presence detect pins wired as recommended by the standard, then the PCIE chipset can immediately know what size the card is and skip trying to determine size via looking for RX terminations or TX'ing on various channels/slots. This should in theory speed up and drastically reduce the complexity of BIOS boot up or at least the PCIE bus initialization.
As I looked for a way to apply this A1-B17 short, I decided to put the short on the 16x connector end of my 16x-1x risers. This would make my GPU cards on 16x-1x risers look like a true 1x card in any slot you put it in. Also it will not confuse the PCIE chipset with a shorted presence detect pin on a empty slot if you still get to swapping around boards etc.
The green wire is the short:
ResultsI only have a sample size of one, but the first boot with A1-B17 shorts on all non-x16 risers worked flawlessly with 5 GPUs.
Yeah the temp reporting is not working in this screenshot, things are still in flux.
Further InfoI did some quick googling on PCIE presence detection. Some PCI SIG presentations describe hotplug/insertion via In-Band and Out-of-Band methods of presence detection. Out-of-Band is the presence detect pins we are discussing shorting. In-Band sure sounds like it's done in the x1 links between the card and the PCIE chipset, which is what I believe leads to far too much confusion when loading in multiple cards in shared bandwidth slots etc.
In my quick searching I didn't turn up too much on In-Band hot plug detection but, essentially in-band presence detection is an active mode that the chipset will have to look for RX terminations on the pcie cards and something called "beacon" signals. I think this supports the idea that leaving presence detection up to the motherboard to figure out causes all sorts of mayhem on the board and things probably end up garbled and/or timing out, coupled with slots that share x1 links, etc, I think it's a recipe for unstable boot up and poor detection of boards actually present.
x1 to x1 Riser Hack
You can see an example the end cutoff of a x1 connector here: http://blog.zorinaq.com/images/pcie-carved-1.jpg
Thinking this through, with 1x-1x risers(assuming they are pin for pin straight through with no PRSNT#2 short within the riser) where the end of the GPU-side slot/connector is cut out, the situation is the same since the GPU card only has the x16 presence detect pin shorted, so even though the x1-x1 riser has the x1 presence detect pin passed through from the GPU board, it's not used on the GPU board so the motherboard still does not receive definitive notification that the board is present and that it's size is x1.
PCI Latency Fix
Another posting I had saw last night reported success at 5+ cards when increasing PCIE latency in BIOS... my bios has no adjustment for that though. I'm far from understanding PCI latency in detail but it seems to me that it could conceivably be allowing for more negotiation time or in some manner, less mayhem during initialization of the PCIE bus(es) or links?