IdleTimeoutFilter.queue leaking

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

IdleTimeoutFilter.queue leaking

Marc Arens
Hello,

we're currently seeing a problem for a single customer with
org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting cleaned
properly so the server runs out of memory within ~48 hours. As far as i
understand Connections aren't removed from the queue until
org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
to close the connection via closeSilently() which results in
org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
being called.

Are/were there known bugs i missed or is there an easy explanation for the
behaviour we're seeing? We don't see this on other installations and i could
reproduce a similar behaviour by making a Servlet's service hangs endlessly.
While the connections from the apache in front of grizzly were closed after a
apache timeout of 100 seconds, the queue didn't shrink at all.

I found the  org.glassfish.grizzly.utils.ActivityCheckFilter which sound useful
for this kind of situation but how does it work in combination with suspended
requests e.g. Comet?

We are using grizzly 2.2.x

Best regards,
Marc
Reply | Threaded
Open this post in threaded view
|

Re: IdleTimeoutFilter.queue leaking

oleksiys
Administrator
Hi Marc,

On 13.01.14 08:37, Marc Arens wrote:

> Hello,
>
> we're currently seeing a problem for a single customer with
> org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting cleaned
> properly so the server runs out of memory within ~48 hours. As far as i
> understand Connections aren't removed from the queue until
> org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
> org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
> to close the connection via closeSilently() which results in
> org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
> being called.
>
> Are/were there known bugs i missed or is there an easy explanation for the
> behaviour we're seeing? We don't see this on other installations and i could
> reproduce a similar behaviour by making a Servlet's service hangs endlessly.
> While the connections from the apache in front of grizzly were closed after a
> apache timeout of 100 seconds, the queue didn't shrink at all.
IdleTimeoutFilter keeps track of all FilterChainContext associated with
a Connection and allows to close the Connection only when it doesn't
have any active FilterChainContext associated with the Connection.
In your case, when you have endless Servlet.service() method - you keep
FilterChainContext active, so IdleTimeoutFilter doesn't treat the
Connection as idle.

I'm not aware of any bugs related to IdleTimeoutFilter, but if you have
a testcase - we'll definitely take a look and fix the problem.

>
> I found the  org.glassfish.grizzly.utils.ActivityCheckFilter which sound useful
> for this kind of situation but how does it work in combination with suspended
> requests e.g. Comet?
ActivityCheckFilter is more lax, it doesn't track live
FilterChainContexts, but monitors Connection activity (read, write,
other events), and if there was no activity registered for a given
period of time - it closes the Connection.

Both Filters could work with Comet, both of them meant to detect idle
Connections, but use different approaches for that. So it's up to you to
decide which one is better for a specific usecase.

Thanks.

WBR,
Alexey.
Reply | Threaded
Open this post in threaded view
|

Re: IdleTimeoutFilter.queue leaking

Marc Arens
Hello Alexey,

> On 13 January 2014 at 22:28 Oleksiy Stashok <[hidden email]>
> wrote:
>
>
> Hi Marc,
>
> On 13.01.14 08:37, Marc Arens wrote:
> > Hello,
> >
> > we're currently seeing a problem for a single customer with
> > org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting
> > cleaned
> > properly so the server runs out of memory within ~48 hours. As far as i
> > understand Connections aren't removed from the queue until
> > org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
> > org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
> > to close the connection via closeSilently() which results in
> > org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
> > being called.
> >
> > Are/were there known bugs i missed or is there an easy explanation for the
> > behaviour we're seeing? We don't see this on other installations and i could
> > reproduce a similar behaviour by making a Servlet's service hangs endlessly.
> > While the connections from the apache in front of grizzly were closed after
> > a
> > apache timeout of 100 seconds, the queue didn't shrink at all.
> IdleTimeoutFilter keeps track of all FilterChainContext associated with
> a Connection and allows to close the Connection only when it doesn't
> have any active FilterChainContext associated with the Connection.
> In your case, when you have endless Servlet.service() method - you keep
> FilterChainContext active, so IdleTimeoutFilter doesn't treat the
> Connection as idle.
>
> I'm not aware of any bugs related to IdleTimeoutFilter, but if you have
> a testcase - we'll definitely take a look and fix the problem.

No testcase but we finally managed to reproduce/watch via remote debugging it on
a special debian squeeze derived distribution some customers are using. I
assumed it might have been caused by an old openjdk-6 runtime but could be seen
with sun-6 and even oracle-7 but only on this system.

We had configured a idleTimeout of FOREVER during grizzly startup

final KeepAlive keepAlive = networkListener.getKeepAlive();
keepAlive.setIdleTimeoutInSeconds(-1);
keepAlive.setMaxRequestsCount(-1);

now watching a request get handled showed the following:

during handleClose the IdleTimeoutFilter sets timeoutMillis to 0 as observed in
the debugger:

Daemon Thread [OXWorker-0000036] (Suspended (breakpoint at line 325 in
IdleTimeoutFilter$Resolver))
        IdleTimeoutFilter$Resolver.removeTimeout(Connection) line: 325
        IdleTimeoutFilter$Resolver.removeTimeout(Object) line: 321
        DelayedExecutor$DelayQueue<E>.remove(E) line: 196
        IdleTimeoutFilter.handleClose(FilterChainContext) line: 18

public boolean removeTimeout(final Connection connection) {
    IDLE_ATTR.get(connection).timeoutMillis.set(0);
    return true;
}

resulting in

(org.glassfish.grizzly.nio.transport.TCPNIOConnection)
TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
peerSocketAddress=/127.0.0.1:57573}
IDLE_ATTR.get(connection)
         (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
IDLE_ATTR.get(connection).timeoutMillis
         (java.util.concurrent.atomic.AtomicLong) 0

This should normally cause the next run of DelayedExecutor.DelayedRunnable to
remove the element from the DelayQueue as currentTimeMillis - 0 >= 0

final Object element = it.next();
final Long timeoutMillis = resolver.getTimeoutMillis(element);
...
  } else if (currentTimeMillis - timeoutMillis >= 0) {
        it.remove();
...

but when examining this part in the debugger again we se that timeoutMillis was
reset from 0 to FOREVER(9223372036854775807) again

(org.glassfish.grizzly.nio.transport.TCPNIOConnection)
TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
peerSocketAddress=/127.0.0.1:57573}
IDLE_ATTR.get(connection)
         (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
IDLE_ATTR.get(connection).timeoutMillis
         (java.util.concurrent.atomic.AtomicLong) 9223372036854775807

this seems to be a concurrency issue as it doesn't always happen when i add an
additional breakpoint in
IdleTimeoutFilter.ContextCompletionListener.onComplete()

The same code/config just works on a default debian. Any thoughts on this one?

>
> >
> > I found the  org.glassfish.grizzly.utils.ActivityCheckFilter which sound
> > useful
> > for this kind of situation but how does it work in combination with
> > suspended
> > requests e.g. Comet?
> ActivityCheckFilter is more lax, it doesn't track live
> FilterChainContexts, but monitors Connection activity (read, write,
> other events), and if there was no activity registered for a given
> period of time - it closes the Connection.
>
> Both Filters could work with Comet, both of them meant to detect idle
> Connections, but use different approaches for that. So it's up to you to
> decide which one is better for a specific usecase.
>
> Thanks.
>
> WBR,
> Alexey.

Best regards,
Marc
Reply | Threaded
Open this post in threaded view
|

Re: IdleTimeoutFilter.queue leaking

oleksiys
Administrator
Hi Marc,

makes sense, I commited the fix on 2.2.x branch and released new
2.2.22-SNAPSHOT, can you pls. give it a try?

Thanks.

WBR,
Alexey.

On 16.01.14 14:06, Marc Arens wrote:

> Hello Alexey,
>
>> On 13 January 2014 at 22:28 Oleksiy Stashok <[hidden email]>
>> wrote:
>>
>>
>> Hi Marc,
>>
>> On 13.01.14 08:37, Marc Arens wrote:
>>> Hello,
>>>
>>> we're currently seeing a problem for a single customer with
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting
>>> cleaned
>>> properly so the server runs out of memory within ~48 hours. As far as i
>>> understand Connections aren't removed from the queue until
>>> org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
>>> to close the connection via closeSilently() which results in
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
>>> being called.
>>>
>>> Are/were there known bugs i missed or is there an easy explanation for the
>>> behaviour we're seeing? We don't see this on other installations and i could
>>> reproduce a similar behaviour by making a Servlet's service hangs endlessly.
>>> While the connections from the apache in front of grizzly were closed after
>>> a
>>> apache timeout of 100 seconds, the queue didn't shrink at all.
>> IdleTimeoutFilter keeps track of all FilterChainContext associated with
>> a Connection and allows to close the Connection only when it doesn't
>> have any active FilterChainContext associated with the Connection.
>> In your case, when you have endless Servlet.service() method - you keep
>> FilterChainContext active, so IdleTimeoutFilter doesn't treat the
>> Connection as idle.
>>
>> I'm not aware of any bugs related to IdleTimeoutFilter, but if you have
>> a testcase - we'll definitely take a look and fix the problem.
> No testcase but we finally managed to reproduce/watch via remote debugging it on
> a special debian squeeze derived distribution some customers are using. I
> assumed it might have been caused by an old openjdk-6 runtime but could be seen
> with sun-6 and even oracle-7 but only on this system.
>
> We had configured a idleTimeout of FOREVER during grizzly startup
>
> final KeepAlive keepAlive = networkListener.getKeepAlive();
> keepAlive.setIdleTimeoutInSeconds(-1);
> keepAlive.setMaxRequestsCount(-1);
>
> now watching a request get handled showed the following:
>
> during handleClose the IdleTimeoutFilter sets timeoutMillis to 0 as observed in
> the debugger:
>
> Daemon Thread [OXWorker-0000036] (Suspended (breakpoint at line 325 in
> IdleTimeoutFilter$Resolver))
>          IdleTimeoutFilter$Resolver.removeTimeout(Connection) line: 325
>          IdleTimeoutFilter$Resolver.removeTimeout(Object) line: 321
>          DelayedExecutor$DelayQueue<E>.remove(E) line: 196
>          IdleTimeoutFilter.handleClose(FilterChainContext) line: 18
>
> public boolean removeTimeout(final Connection connection) {
>      IDLE_ATTR.get(connection).timeoutMillis.set(0);
>      return true;
> }
>
> resulting in
>
> (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> peerSocketAddress=/127.0.0.1:57573}
> IDLE_ATTR.get(connection)
>           (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
> IDLE_ATTR.get(connection).timeoutMillis
>           (java.util.concurrent.atomic.AtomicLong) 0
>
> This should normally cause the next run of DelayedExecutor.DelayedRunnable to
> remove the element from the DelayQueue as currentTimeMillis - 0 >= 0
>
> final Object element = it.next();
> final Long timeoutMillis = resolver.getTimeoutMillis(element);
> ...
>    } else if (currentTimeMillis - timeoutMillis >= 0) {
>          it.remove();
> ...
>
> but when examining this part in the debugger again we se that timeoutMillis was
> reset from 0 to FOREVER(9223372036854775807) again
>
> (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> peerSocketAddress=/127.0.0.1:57573}
> IDLE_ATTR.get(connection)
>           (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
> IDLE_ATTR.get(connection).timeoutMillis
>           (java.util.concurrent.atomic.AtomicLong) 9223372036854775807
>
> this seems to be a concurrency issue as it doesn't always happen when i add an
> additional breakpoint in
> IdleTimeoutFilter.ContextCompletionListener.onComplete()
>
> The same code/config just works on a default debian. Any thoughts on this one?
>
>>> I found the  org.glassfish.grizzly.utils.ActivityCheckFilter which sound
>>> useful
>>> for this kind of situation but how does it work in combination with
>>> suspended
>>> requests e.g. Comet?
>> ActivityCheckFilter is more lax, it doesn't track live
>> FilterChainContexts, but monitors Connection activity (read, write,
>> other events), and if there was no activity registered for a given
>> period of time - it closes the Connection.
>>
>> Both Filters could work with Comet, both of them meant to detect idle
>> Connections, but use different approaches for that. So it's up to you to
>> decide which one is better for a specific usecase.
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
> Best regards,
> Marc

Reply | Threaded
Open this post in threaded view
|

Re: IdleTimeoutFilter.queue leaking

Marc Arens
Hello Alexey,

> On 17 January 2014 at 08:17 Oleksiy Stashok <[hidden email]>
> wrote:
>
>
> Hi Marc,
>
> makes sense, I commited the fix on 2.2.x branch and released new
> 2.2.22-SNAPSHOT, can you pls. give it a try?
>

I picked the fix and did a 2 hour long test on the affected system. The heapdump
after the test didn't show any leaked DelayedExecutors while it would have shown
a huge leak within just 10 minutes of testing before your fix. Great work,
thanks for the very fast fix and have a nice weekend!

> Thanks.
>
> WBR,
> Alexey.
>
> On 16.01.14 14:06, Marc Arens wrote:
> > Hello Alexey,
> >
> >> On 13 January 2014 at 22:28 Oleksiy Stashok <[hidden email]>
> >> wrote:
> >>
> >>
> >> Hi Marc,
> >>
> >> On 13.01.14 08:37, Marc Arens wrote:
> >>> Hello,
> >>>
> >>> we're currently seeing a problem for a single customer with
> >>> org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting
> >>> cleaned
> >>> properly so the server runs out of memory within ~48 hours. As far as i
> >>> understand Connections aren't removed from the queue until
> >>> org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
> >>> org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
> >>> to close the connection via closeSilently() which results in
> >>> org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
> >>> being called.
> >>>
> >>> Are/were there known bugs i missed or is there an easy explanation for the
> >>> behaviour we're seeing? We don't see this on other installations and i
> >>> could
> >>> reproduce a similar behaviour by making a Servlet's service hangs
> >>> endlessly.
> >>> While the connections from the apache in front of grizzly were closed
> >>> after
> >>> a
> >>> apache timeout of 100 seconds, the queue didn't shrink at all.
> >> IdleTimeoutFilter keeps track of all FilterChainContext associated with
> >> a Connection and allows to close the Connection only when it doesn't
> >> have any active FilterChainContext associated with the Connection.
> >> In your case, when you have endless Servlet.service() method - you keep
> >> FilterChainContext active, so IdleTimeoutFilter doesn't treat the
> >> Connection as idle.
> >>
> >> I'm not aware of any bugs related to IdleTimeoutFilter, but if you have
> >> a testcase - we'll definitely take a look and fix the problem.
> > No testcase but we finally managed to reproduce/watch via remote debugging
> > it on
> > a special debian squeeze derived distribution some customers are using. I
> > assumed it might have been caused by an old openjdk-6 runtime but could be
> > seen
> > with sun-6 and even oracle-7 but only on this system.
> >
> > We had configured a idleTimeout of FOREVER during grizzly startup
> >
> > final KeepAlive keepAlive = networkListener.getKeepAlive();
> > keepAlive.setIdleTimeoutInSeconds(-1);
> > keepAlive.setMaxRequestsCount(-1);
> >
> > now watching a request get handled showed the following:
> >
> > during handleClose the IdleTimeoutFilter sets timeoutMillis to 0 as observed
> > in
> > the debugger:
> >
> > Daemon Thread [OXWorker-0000036] (Suspended (breakpoint at line 325 in
> > IdleTimeoutFilter$Resolver))
> >          IdleTimeoutFilter$Resolver.removeTimeout(Connection) line: 325
> >          IdleTimeoutFilter$Resolver.removeTimeout(Object) line: 321
> >          DelayedExecutor$DelayQueue<E>.remove(E) line: 196
> >          IdleTimeoutFilter.handleClose(FilterChainContext) line: 18
> >
> > public boolean removeTimeout(final Connection connection) {
> >      IDLE_ATTR.get(connection).timeoutMillis.set(0);
> >      return true;
> > }
> >
> > resulting in
> >
> > (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> > TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> > peerSocketAddress=/127.0.0.1:57573}
> > IDLE_ATTR.get(connection)
> >           (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> > org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
> > IDLE_ATTR.get(connection).timeoutMillis
> >           (java.util.concurrent.atomic.AtomicLong) 0
> >
> > This should normally cause the next run of DelayedExecutor.DelayedRunnable
> > to
> > remove the element from the DelayQueue as currentTimeMillis - 0 >= 0
> >
> > final Object element = it.next();
> > final Long timeoutMillis = resolver.getTimeoutMillis(element);
> > ...
> >    } else if (currentTimeMillis - timeoutMillis >= 0) {
> >          it.remove();
> > ...
> >
> > but when examining this part in the debugger again we se that timeoutMillis
> > was
> > reset from 0 to FOREVER(9223372036854775807) again
> >
> > (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> > TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> > peerSocketAddress=/127.0.0.1:57573}
> > IDLE_ATTR.get(connection)
> >           (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> > org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord@6ce13334
> > IDLE_ATTR.get(connection).timeoutMillis
> >           (java.util.concurrent.atomic.AtomicLong) 9223372036854775807
> >
> > this seems to be a concurrency issue as it doesn't always happen when i add
> > an
> > additional breakpoint in
> > IdleTimeoutFilter.ContextCompletionListener.onComplete()
> >
> > The same code/config just works on a default debian. Any thoughts on this
> > one?
> >
> >>> I found the  org.glassfish.grizzly.utils.ActivityCheckFilter which sound
> >>> useful
> >>> for this kind of situation but how does it work in combination with
> >>> suspended
> >>> requests e.g. Comet?
> >> ActivityCheckFilter is more lax, it doesn't track live
> >> FilterChainContexts, but monitors Connection activity (read, write,
> >> other events), and if there was no activity registered for a given
> >> period of time - it closes the Connection.
> >>
> >> Both Filters could work with Comet, both of them meant to detect idle
> >> Connections, but use different approaches for that. So it's up to you to
> >> decide which one is better for a specific usecase.
> >>
> >> Thanks.
> >>
> >> WBR,
> >> Alexey.
> > Best regards,
> > Marc
>

Best regards,
Marc