Gmane
Favicon
From: Jeff Garzik <jgarzik <at> redhat.com>
Subject: tg3 update available...
Newsgroups: gmane.linux.hardware.dell.poweredge
Date: 2003-02-14 22:58:08 GMT (6 years, 19 weeks, 6 days, 19 hours and 48 minutes ago)
Ladies and gentlemen,

For your weekend stress-testing pleasure, the latest release of tg3 is
available from

http://people.redhat.com/jgarzik/pub/legolas2-7.x/	(redhat 7.x)
http://people.redhat.com/jgarzik/pub/legolas2-8.0/	(redhat 8.0)

This kernel rpms are based on the latest errata kernel (2.4.18-24), and
like the other rpms I have posted, they are unofficial, not for
production use, not passed Red Hat Q/A, etc., etc.

This version of tg3 adds several workarounds for hardware bugs in the
BroadCom chips, and all users are recommended to be using this driver
version (tg3 v1.4).

Addressing the recent furor on this list...

Bad things come in spades, it seems.  I had to disappear due to a
family medical emergency, and was on the left coast all last week.
And of course that's when the errata furor happens.  :/

[BIG DISCLAIMER:  My presence on this list, and my statement
following, are purely of my own choice.  This is my own opinion,
not my employer's, and not representative of any sales agreements,
contracts, etc.]

1) "arrrrrrgh!!!! why not bcm5700?????"

This is a big question, but it's critical you guys understand the
answer.  A small but not insignificant part of it is political:
BroadCom doesn't release hardware docs, and didn't work [past
tense] with the Linux community.  So from Red Hat's perspective,
we get no support, just "pray that each new release works."  That is
unacceptable.

On a technical level, there are several issues with all BroadCom drivers
that prevent us from shipping it.  The big item of the driver is an
interrupt stack corruption issue that is remotely trigger-able.
Read:  bad guys can crash your production box, with bcm5700.  Or it
might crash if you're just unlucky, and hit the same condition.  But
there are other operational and portability bugs too, which bite us on
other Red Hat platforms such as IA64 [where bcm5700 simply doesn't work
at all].

2) "But bcm5700 works for me and tg3 doesn't!  Ship it!"

The tg3 that went out in the kernel errata was, at the time,
the results of successful stress testing on a bunch of systems.
Several not-tg3 issues were also identified and resolved, that
potentially played a role in the problems people were seeing.

BroadCom unexpectedly turned up the day after the errata kernel was
cut, with a list of NIC hardware bugs tg3 needed to work around.
We integrated these fixes, but clearly timing was awful as these
fixes did not make it into the errata kernel.  Those fixes are in tg3
version 1.4, posted in the above rpms.

Anyway, for the kernel errata, Red Hat had two drivers, one of which
was known to have serious bugs and simply fails to work on many of
Red Hat's systems [bcm5700], and one of which successfully passed
stress tests on our local systems [tg3].  The choice was obvious,
though in hindsight from this list, more could have been done.

3) What's the deal with "noapic"?

That changes the way interrupts are delivered, from the "new way" back
to the "old way."  [more details upon request]   Why does it solve some
people's tg3 problems?  Because they are seeing problems unrelated to
tg3, that tg3 is just fast enough to trigger.

4) What are good temporary solutions?

Running a uniprocessor kernel instead of SMP will likely fix all
NIC-related problems.  "noapic" on an SMP kernel helps, but the deadlock
potential is still there, just decreased.

Using e1000 NIC+driver should be very unlikely to trigger a freeze, too.

5) Is there hope, Luke?

Yes!  Just today, Red Hat has identified two non-tg3 problems that are
likely causes of several of the reported freezes.  These problems can
trigger on bcm5700, e1000, and other NICs, but are much more likely to
trigger on tg3 because it is very aggressive with its net stack usage.

Add that, in addition to the above BroadCom-requested tg3 fixes, and the
future is looking very rosy.

Questions, comments and flames are all appreciated.  I'm only the
kernel hacker tasked to fix net drivers, not a Red Hat Support guy,
but I'll do the best I can to address people's concerns.

	Jeff

P.S.  The "RPMS" subdirectory of
http://people.redhat.com/jgarzik/tg3/tg3-1.4/ is simply a symlink to the
above "legolas2-8.0" kernel, and is exactly the same.

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge <at> dell.com
http://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq or search the list archives at http://lists.us.dell.com/htdig/