Fixing FireMonkey Heisenbugs

Every once in a while, every developer encounters random bugs that happen only in production and cannot be reproduced at will. If you cannot reproduce it, you can hardly fix it. In such situations, recording exceptions with various error loggers can help us find the culprit and fix the error. However, sometimes the information collected simply does not contain enough data to do so.

This post is inspired by the following Stack Overflow question How to know the exact line number that produce an exception where the logger has recorded an exception and its call stack.

Argument out of range At address: $002CDD4B (Generics.Collections.TListHelper.CheckItemRange(Integer) + 62) Call stack: MyApp $00BB153D Grijjy.Errorreporting.backtrace(Pointer*, Integer) + 8 MyApp $00BB1427 Grijjy.Errorreporting.TgoExceptionReporter.GlobalGetExceptionStackInfo(TExceptionRecord*) + 74 MyApp $001C4D83 Sysutils.Exception.RaisingException(TExceptionRecord*) + 38 MyApp $001E903D Sysutils.RaiseExceptObject(TExceptionRecord*) + 44 MyApp $001B0D9D _RaiseAtExcept(TObject*, Pointer) + 164 MyApp $001B1007 _RaiseExcept(TObject*) + 14 MyApp $002CDD4B Generics.Collections.TListHelper.CheckItemRange(Integer) + 62 MyApp $0059D4B3 Fmx.Controls.TControl.PaintChildren() + 222 MyApp $005BB987 Fmx.Controls.TControl.PaintInternal().DoPaintInternal(Pointer) + 1162 MyApp $005BC165 Fmx.Controls.TControl.PaintInternal().PaintAndClipChild(Pointer) + 500 MyApp $005B8F09 Fmx.Controls.TControl.PaintInternal() + 376 MyApp $007569D5 Fmx.Forms.TCustomForm.PaintRects(Types.TRectF const*, Integer) + 1008 MyApp $0074A001 __stub_in660v62__ZN3Fmx5Forms17TCommonCustomForm10PaintRectsEPKN6System5Types6TRectFEi + 24 MyApp $0068257D Fmx.Platform.Ios.TFMXView3D.drawRect(Iosapi.Foundation.NSRect) + 204 MyApp $00C2BA57 DispatchToDelphi + 82 MyApp $00C2B927 dispatch_first_stage_intercept + 18 QuartzCore $246A9F63 <redacted> + 106 QuartzCore $2468E551 <redacted> + 204 QuartzCore $2468E211 <redacted> + 24 QuartzCore $2468D6D1 <redacted> + 368 QuartzCore $2468D3A5 <redacted> + 520 QuartzCore $24686B2B <redacted> + 138 CoreFoundation $220456C9 <redacted> + 20 CoreFoundation $220439CD <redacted> + 280 CoreFoundation $22043DFF <redacted> + 958 CoreFoundation $21F93229 CFRunLoopRunSpecific + 520 CoreFoundation $21F93015 CFRunLoopRunInMode + 108 GraphicsServices $23583AC9 GSEventRunModal + 160 UIKit $26667189 UIApplicationMain + 144 MyApp $003CBF15 Iosapi.Uikit.UIApplicationMain(Integer, Byte**, Pointer, Pointer) + 8 MyApp $00676843 Fmx.Platform.Ios.TPlatformCocoaTouch.Run() + 70 MyApp $006767FB __stub_in92s__ZN3Fmx8Platform3Ios19TPlatformCocoaTouch3RunEv + 10 MyApp $0074628F Fmx.Forms.TApplication.Run() + 182 MyApp $00C2B893 main + 246 $1FE2EF0F

Asking the right question

So, the question asked is how to find the exact line of code where the exception happened. That is a valid question on its own. However, in this particular case, knowing the answer to that question will not provide a solution to the real problem - preventing the application crash.

The real question that should have been asked is "How to prevent or stop an application from crashing?"

Finding the answer to the wrong question

So, let's walk down the call stack and see what happened:

The actual place where the exception was raised is in Generics.Collections.TListHelper.CheckItemRange
procedure TListHelper.CheckItemRange(AIndex: Integer); begin if Cardinal(AIndex) >= Cardinal(FCount) then ErrorArgumentOutOfRange; end;
Here it is fairly obvious where the exception happened and why. Accessing the array (list) of items at an index that is larger than the list's size - hence Argument out of range. But that method is called quite often, and it is not specific enough to locate the real source of trouble.

The next is Fmx.Controls.TControl.PaintChildren
procedure TControl.PaintChildren; var I, J: Integer; R: TRectF; AllowPaint: Boolean; Control: TControl; begin if (FScene <> nil) and (ControlsCount > 0) then for I := GetFirstVisibleObjectIndex to GetLastVisibleObjectIndex - 1 do if FControls[I].Visible then begin Control := FControls[I]; if Control.FScene = nil then Continue; if not Control.FInPaintTo and Control.UpdateRect.IsEmpty then Continue; if (ClipChildren or SmallSizeControl) and not IntersectRect(Self.UpdateRect, Control.UpdateRect) then Continue; AllowPaint := False; if Control.FInPaintTo then AllowPaint := True; if not AllowPaint then begin if Assigned(Control.CustomSceneAddRect) then AllowPaint := True else begin R := UnionRect(Control.GetChildrenRect, Control.UpdateRect); for J := 0 to FScene.GetUpdateRectsCount - 1 do if IntersectRect(FScene.GetUpdateRect(J), R) then begin AllowPaint := True; Break; end; end; end; if AllowPaint then Control.PaintInternal; end; end;
A bit better, but still very vague. And this is the method that prompted the question - how to find the exact line where an exception was raised in the above code.

In this case, there is only one TList<T> access that could directly call the TListHelper.CheckItemRange method - on the third line:
if FControls[I].Visible then
So, the answer to the original question - which line triggered the exception - is right here. But are we any closer to solving the real problem?

No. Not even close.

Why?

Just like CheckItemRange, the PaintChildren method is also called often and is not specific enough.

No problem... there are still many lines in call stack... but... if we take look at the call's origin - it came from the message loop handler while processing a paint request - and we have no clue where that request originated.

Finding the answer to the right question

If we have additional logs, where we logged users' activity and from which we could tell what was used exactly before the paint request was triggered, maybe we could locate a piece of the code that brought up the issue. But even with that, it may be hard to reproduce and fix the issue.

Let's go back to the PaintChildren method and how iteration through the controls tried to access an out-of-range index. This is a UI operation, and as we all know those must run in the context of main UI thread because they are not thread safe. (Well, there are some bits and pieces of UI code here and there that are thread safe, but this is not one of them).

So we have several options that could mess up the indicies:
  1. Touching the UI from a background thread - particularly removing some of the controls from the list
  2. Errors in GetFirstVisibleObjectIndex or GetLastVisibleObjectIndex, as they are virtual and their implementations can potentially return the wrong index
  3. Changing the list of controls within any code called during the iteration - for instance Control.PaintInternal
Now, if you have the previously mentioned activity log then maybe, just maybe, you could inspect the code involved and spot any of the mentioned errors. If you find them, great - problem solved - but what if you cannot? You are still stuck with a crashing application and no solution in sight.

Desperate times call for desperate measures and a bit of creative thinking

While finding the real issue and fixing it is always the preferable solution, when you run out of options there is always another thing you can do.

The ultimate goal of this bug chasing endeavour is preventing application crashes. If you cannot locate the piece of code where the issue originates, maybe you can change the piece of code where you know the exception occurs. Of course, in this case that means making changes in the FMX framework, but since it is not an interface breaking change, we can just put a changed FMX.Controls unit under our project and it will be picked up and used instead of the original one. Of course, this will not work if your application is using the FMX framework as a runtime library.

The original code accesses list twice. The first thing to do is to limit that to a single access point.
if FControls[I].Visible then begin Control := FControls[I];
can be replaced with
Control := FControls[I]; if Control.Visible then begin
The above change does not solve the problem, but it is a step closer.

The original exception is caused by accessing an out-of-range index. What would happen if we used additional index check before we access the list and, in the event of an invalid index, do nothing?

Well, this is painting code. The worst thing that could happen is that some control wouldn't get painted. Since that is the control that is also no longer visible - not in the controls list - nothing bad would happen. If, by any remote chance, there is a more serious painting problem behind this, we would get a visual cue of where the error lies - some part of the user interface would not be painted correctly - which is still better than crashing.
if I < FControls.Count then begin Control := FControls[I]; if Control.Visible then begin
Problem solved.

Well, not really.

Background thread touching UI

If the real culprit is the code executing in a background thread, then you are out of luck. Protecting UI from background threads can only be solved in code that executes in the context of background thread, synchronizing parts that access the UI. Or changing the logic completely to prevent UI interaction in the first place.

Even if a background thread is the cause, the thing with threading issues is that slight variations in code, like changing the original FMX code to prevent Argument out of range, can have impact on how often threads collide. You can make things worse, but you can also make them better, reducing the number of crashes - even to the point that you don't experience them at all. That does not mean that the threading issue is fixed, but it is the next best thing you can get - it will be less prominent.

Really desperate measures?

If you are seriously out of options, you can always just wrap the entire PaintChildren method in a try..except block. But, seriously... don't do that. At some point, you just have to give up.

Comments

  1. With a lot of FireMonkey experience under my belt there is one thing that comes to mind.

    It is common in VCL programming to dynamically create controls, place them, then remove them when needed. In FireMonkey this practice can lead to problems.

    Instead of using Free or FreeAndNil on the control, use Release. Over the iterations from 10.0 - 10.3 it has been a rocky ride. Sometimes the Free or FreeAndNil approach worked and sometimes it was best to use Release. With 10.3 Release seems to be pretty solid.

    ReplyDelete
  2. Seriously, you never surround an exception throwing code with try...except just to eat the exception.
    A real programmer never gives up on a problem. Background thread touching GUI problem is usually fixed by moving GUI touching code into synchronized or queued functions (or methods).

    ReplyDelete

Post a Comment

Popular posts from this blog

Catch Me If You Can - Part II

Delphi 12.1 & New Quality Portal Released

Coming in Delphi 12: Disabled Floating-Point Exceptions