Features Download
From: Andrew Morton <akpm <at> linux-foundation.org>
Subject: Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
Newsgroups: gmane.linux.kernel
Date: Thursday 22nd December 2011 22:54:26 UTC (over 4 years ago)
On Thu, 22 Dec 2011 14:41:17 -0800
Tejun Heo  wrote:

> Hello, Andrew.
> On Thu, Dec 22, 2011 at 02:20:58PM -0800, Andrew Morton wrote:
> > Don't just consider my suggestions - please try to come up with your
> > alternatives too!  If all else fails then this patch is a last resort.
> Umm... this is my alternative.

We're beyond the point where aany additional kernel complexity should
be considered a regression.

> > > but apparently those percpu stats reduced
> > > CPU overhead significantly.
> > 
> > Deleting them would save even more CPU.
> > 
> > Or make them runtime or compile-time configurable, so only the
> > developers see the impact.
> > 
> > Some specifics on which counters are causing the problems would help
> These stats are userland visible and quite useful ones if blkcg is in
> use.  I don't really see how these can be removed.

What stats?

And why are we doing percpu *allocation* so deep in the code?  You mean
we're *creating* stats counters on an IO path?  Sounds odd.  Where is
this code?

> > > > Or how about we fix the percpu memory allocation code so that it
> > > > propagates the gfp flags, then delete this patchset?
> > > 
> > > Oh, no, this is gonna make things *way* more complex.  I tried.
> > 
> > But there's a difference between fixing a problem and working around
> Yeah, that was my first direction too.  The reason why percpu can't do
> NOIO is the same one why vmalloc can't do it.  It reaches pretty deep
> into page table code and I don't think doing all that churning is
> worthwhile or even desirable.  An altnernative approach would be
> implementing transparent front buffer to percpu allocator, which I
> *might* do if there really are more of these users, but I think
> keeping percpu allocator painful to use from reclaim context isn't
> such a bad idea.
> There have been multiple requests for atomic allocation and they all
> have been successfully pushed back, but IMHO this is a valid one and I
> don't see a better way around the problem, so while I agree using
> mempool for this is a workaround, I think it is a right choice, for
> now, anyway.

For starters, doing pagetable allocation on the I/O path sounds nutty.

Secondly, GFP_NOIO is a *weaker* allocation mode than GFP_KERNEL.  By
permitting it with this patchset, we have a kernel which is more likely
to get oom failures.  Fixing the kernel to not perform GFP_NOIO
allocations for these counters will result in a more robust kernel. 
This is a good thing, which improves the kernel while avoiding adding
more compexity elsewhere.

This patchset is the worst option and we should try much harder to avoid
applying it!

> > > If we're gonna have many more NOIO percpu users, which I don't
> > > think we would or should, that might make sense but, for fringe
> > > cases, extending mempool to cover percpu is a much better sized
> > > solution.
> > 
> > I've long felt that we goofed with the gfp_flags thing and that it
> > should be a field in the task_struct.  Now *that* would be a large
> > patch!
> Yeah, some of PF_* flags already carry related role information.  I'm
> not too sure how much pushing the whole thing into task_struct would
> change tho.  We would need push/popping.  It could be simpler in some
> cases but in essence wouldn't we have just relocated the position of
> parameter?

The code would get considerably simpler.  The big benefit comes when
you have deep call stacks - we're presently passing a gfp_t down five
layers of function call while none of the intermediate functions even
use the thing - they just pass it on to the next guy.  Pass it via the
task_struct and all that goes away.  It would make maintenance a lot
easier - at present if you want to add a new kmalloc() to a leaf
function you need to edit all five layers of caller functions.
CD: 13ms