Fixing FireMonkey Heisenbugs
Every once in a while, every developer encounters random bugs that happen only in production
and cannot be reproduced at will. If you cannot reproduce it, you can hardly fix it. In such
situations, recording exceptions with various error loggers can help us find the culprit and
fix the error. However, sometimes the information collected simply does not contain enough data to do so.
This post is inspired by the following Stack Overflow question How to know the exact line number that produce an exception where the logger has recorded an exception and its call stack.
The real question that should have been asked is "How to prevent or stop an application from crashing?"
The actual place where the exception was raised is in
The next is
In this case, there is only one
No. Not even close.
No problem... there are still many lines in call stack... but... if we take look at the call's origin - it came from the message loop handler while processing a paint request - and we have no clue where that request originated.
Let's go back to the
So we have several options that could mess up the indicies:
The ultimate goal of this bug chasing endeavour is preventing application crashes. If you cannot locate the piece of code where the issue originates, maybe you can change the piece of code where you know the exception occurs. Of course, in this case that means making changes in the FMX framework, but since it is not an interface breaking change, we can just put a changed FMX.Controls unit under our project and it will be picked up and used instead of the original one. Of course, this will not work if your application is using the FMX framework as a runtime library.
The original code accesses list twice. The first thing to do is to limit that to a single access point.
The original exception is caused by accessing an out-of-range index. What would happen if we used additional index check before we access the list and, in the event of an invalid index, do nothing?
Well, this is painting code. The worst thing that could happen is that some control wouldn't get painted. Since that is the control that is also no longer visible - not in the controls list - nothing bad would happen. If, by any remote chance, there is a more serious painting problem behind this, we would get a visual cue of where the error lies - some part of the user interface would not be painted correctly - which is still better than crashing.
Well, not really.
Even if a background thread is the cause, the thing with threading issues is that slight variations in code, like changing the original FMX code to prevent
This post is inspired by the following Stack Overflow question How to know the exact line number that produce an exception where the logger has recorded an exception and its call stack.
Argument out of range
At address: $002CDD4B (Generics.Collections.TListHelper.CheckItemRange(Integer) + 62)
Call stack:
MyApp $00BB153D Grijjy.Errorreporting.backtrace(Pointer*, Integer) + 8
MyApp $00BB1427 Grijjy.Errorreporting.TgoExceptionReporter.GlobalGetExceptionStackInfo(TExceptionRecord*) + 74
MyApp $001C4D83 Sysutils.Exception.RaisingException(TExceptionRecord*) + 38
MyApp $001E903D Sysutils.RaiseExceptObject(TExceptionRecord*) + 44
MyApp $001B0D9D _RaiseAtExcept(TObject*, Pointer) + 164
MyApp $001B1007 _RaiseExcept(TObject*) + 14
MyApp $002CDD4B Generics.Collections.TListHelper.CheckItemRange(Integer) + 62
MyApp $0059D4B3 Fmx.Controls.TControl.PaintChildren() + 222
MyApp $005BB987 Fmx.Controls.TControl.PaintInternal().DoPaintInternal(Pointer) + 1162
MyApp $005BC165 Fmx.Controls.TControl.PaintInternal().PaintAndClipChild(Pointer) + 500
MyApp $005B8F09 Fmx.Controls.TControl.PaintInternal() + 376
MyApp $007569D5 Fmx.Forms.TCustomForm.PaintRects(Types.TRectF const*, Integer) + 1008
MyApp $0074A001 __stub_in660v62__ZN3Fmx5Forms17TCommonCustomForm10PaintRectsEPKN6System5Types6TRectFEi + 24
MyApp $0068257D Fmx.Platform.Ios.TFMXView3D.drawRect(Iosapi.Foundation.NSRect) + 204
MyApp $00C2BA57 DispatchToDelphi + 82
MyApp $00C2B927 dispatch_first_stage_intercept + 18
QuartzCore $246A9F63 <redacted> + 106
QuartzCore $2468E551 <redacted> + 204
QuartzCore $2468E211 <redacted> + 24
QuartzCore $2468D6D1 <redacted> + 368
QuartzCore $2468D3A5 <redacted> + 520
QuartzCore $24686B2B <redacted> + 138
CoreFoundation $220456C9 <redacted> + 20
CoreFoundation $220439CD <redacted> + 280
CoreFoundation $22043DFF <redacted> + 958
CoreFoundation $21F93229 CFRunLoopRunSpecific + 520
CoreFoundation $21F93015 CFRunLoopRunInMode + 108
GraphicsServices $23583AC9 GSEventRunModal + 160
UIKit $26667189 UIApplicationMain + 144
MyApp $003CBF15 Iosapi.Uikit.UIApplicationMain(Integer, Byte**, Pointer, Pointer) + 8
MyApp $00676843 Fmx.Platform.Ios.TPlatformCocoaTouch.Run() + 70
MyApp $006767FB __stub_in92s__ZN3Fmx8Platform3Ios19TPlatformCocoaTouch3RunEv + 10
MyApp $0074628F Fmx.Forms.TApplication.Run() + 182
MyApp $00C2B893 main + 246
$1FE2EF0F
Asking the right question
So, the question asked is how to find the exact line of code where the exception happened. That is a valid question on its own. However, in this particular case, knowing the answer to that question will not provide a solution to the real problem - preventing the application crash.The real question that should have been asked is "How to prevent or stop an application from crashing?"
Finding the answer to the wrong question
So, let's walk down the call stack and see what happened:The actual place where the exception was raised is in
Generics.Collections.TListHelper.CheckItemRange
procedure TListHelper.CheckItemRange(AIndex: Integer);
begin
if Cardinal(AIndex) >= Cardinal(FCount) then
ErrorArgumentOutOfRange;
end;
Here it is fairly obvious where the exception happened and why. Accessing the array (list) of
items at an index that is larger than the list's size - hence Argument out of range
. But that method
is called quite often, and it is not specific enough to locate the real source of trouble.The next is
Fmx.Controls.TControl.PaintChildren
procedure TControl.PaintChildren;
var
I, J: Integer;
R: TRectF;
AllowPaint: Boolean;
Control: TControl;
begin
if (FScene <> nil) and (ControlsCount > 0) then
for I := GetFirstVisibleObjectIndex to GetLastVisibleObjectIndex - 1 do
if FControls[I].Visible then
begin
Control := FControls[I];
if Control.FScene = nil then
Continue;
if not Control.FInPaintTo and Control.UpdateRect.IsEmpty then
Continue;
if (ClipChildren or SmallSizeControl) and not IntersectRect(Self.UpdateRect, Control.UpdateRect) then
Continue;
AllowPaint := False;
if Control.FInPaintTo then
AllowPaint := True;
if not AllowPaint then
begin
if Assigned(Control.CustomSceneAddRect) then
AllowPaint := True
else
begin
R := UnionRect(Control.GetChildrenRect, Control.UpdateRect);
for J := 0 to FScene.GetUpdateRectsCount - 1 do
if IntersectRect(FScene.GetUpdateRect(J), R) then
begin
AllowPaint := True;
Break;
end;
end;
end;
if AllowPaint then
Control.PaintInternal;
end;
end;
A bit better, but still very vague. And this is the method that prompted the question - how to
find the exact line where an exception was raised in the above code.In this case, there is only one
TList<T>
access that could directly call the TListHelper.CheckItemRange
method - on the third line:
if FControls[I].Visible then
So, the answer to the original question - which line triggered the exception - is right here. But are
we any closer to solving the real problem?No. Not even close.
Why?
Just likeCheckItemRange
, the PaintChildren
method is also called often and is not specific enough.No problem... there are still many lines in call stack... but... if we take look at the call's origin - it came from the message loop handler while processing a paint request - and we have no clue where that request originated.
Finding the answer to the right question
If we have additional logs, where we logged users' activity and from which we could tell what was used exactly before the paint request was triggered, maybe we could locate a piece of the code that brought up the issue. But even with that, it may be hard to reproduce and fix the issue.Let's go back to the
PaintChildren
method and how iteration through the controls tried to access
an out-of-range index. This is a UI operation, and as we all know those must run in the context of main
UI thread because they are not thread safe. (Well, there are some bits and pieces of UI code
here and there that are thread safe, but this is not one of them).So we have several options that could mess up the indicies:
-
Touching the UI from a background thread - particularly removing some of the controls from the list
-
Errors in
GetFirstVisibleObjectIndex
orGetLastVisibleObjectIndex
, as they are virtual and their implementations can potentially return the wrong index
-
Changing the list of controls within any code called during the iteration - for instance
Control.PaintInternal
Desperate times call for desperate measures and a bit of creative thinking
While finding the real issue and fixing it is always the preferable solution, when you run out of options there is always another thing you can do.The ultimate goal of this bug chasing endeavour is preventing application crashes. If you cannot locate the piece of code where the issue originates, maybe you can change the piece of code where you know the exception occurs. Of course, in this case that means making changes in the FMX framework, but since it is not an interface breaking change, we can just put a changed FMX.Controls unit under our project and it will be picked up and used instead of the original one. Of course, this will not work if your application is using the FMX framework as a runtime library.
The original code accesses list twice. The first thing to do is to limit that to a single access point.
if FControls[I].Visible then
begin
Control := FControls[I];
can be replaced with
Control := FControls[I];
if Control.Visible then
begin
The above change does not solve the problem, but it is a step closer.The original exception is caused by accessing an out-of-range index. What would happen if we used additional index check before we access the list and, in the event of an invalid index, do nothing?
Well, this is painting code. The worst thing that could happen is that some control wouldn't get painted. Since that is the control that is also no longer visible - not in the controls list - nothing bad would happen. If, by any remote chance, there is a more serious painting problem behind this, we would get a visual cue of where the error lies - some part of the user interface would not be painted correctly - which is still better than crashing.
if I < FControls.Count then
begin
Control := FControls[I];
if Control.Visible then
begin
Problem solved.Well, not really.
Background thread touching UI
If the real culprit is the code executing in a background thread, then you are out of luck. Protecting UI from background threads can only be solved in code that executes in the context of background thread, synchronizing parts that access the UI. Or changing the logic completely to prevent UI interaction in the first place.Even if a background thread is the cause, the thing with threading issues is that slight variations in code, like changing the original FMX code to prevent
Argument out of range
, can have
impact on how often threads collide. You can make things worse, but you can also make them
better, reducing the number of crashes - even to the point that you don't experience them at all.
That does not mean that the threading issue is fixed, but it is the next best thing you can get - it
will be less prominent.Really desperate measures?
If you are seriously out of options, you can always just wrap the entirePaintChildren
method in a try..except
block. But, seriously... don't do that. At some point, you just have
to give up.
With a lot of FireMonkey experience under my belt there is one thing that comes to mind.
ReplyDeleteIt is common in VCL programming to dynamically create controls, place them, then remove them when needed. In FireMonkey this practice can lead to problems.
Instead of using Free or FreeAndNil on the control, use Release. Over the iterations from 10.0 - 10.3 it has been a rocky ride. Sometimes the Free or FreeAndNil approach worked and sometimes it was best to use Release. With 10.3 Release seems to be pretty solid.
Seriously, you never surround an exception throwing code with try...except just to eat the exception.
ReplyDeleteA real programmer never gives up on a problem. Background thread touching GUI problem is usually fixed by moving GUI touching code into synchronized or queued functions (or methods).