Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Solving ANRs 101:

Diving into the


Android framework

embrace.io
Executive
summary
We recently released an eBook about why the complex mobile ecosystem – and
the Android mobile ecosystem in particular – makes it so difficult to identify,
prioritize, and solve Application Not Responding (ANRs) errors. We explored the
causes and impact to your app and business. In this eBook, we’ll dive deeper
into the technical causes of ANRs so your team can rapidly identify and solve
them.

While Android’s documentation might read as all-encompassing, our engineers


– driven by curiosity and innovation – have dug deep into the source code itself
to unlock important insights about these critical errors. In fact, what they’ve
found has thoroughly busted the most common myths about ANRs, and we’re
excited to share those insights with you here.

You likely know that an ANR is “officially” triggered when the main thread has
been blocked for at least 5 seconds. But, do you know what the source code
actually says about ANRs for specific Android components such as activities,
services, and broadcast receivers? (Hint: It’s not always 5 seconds!) Do you
know exactly how the Google Play Console collects and reports ANR data
compared to Firebase Crashlytics, and why it matters during troubleshooting?

In this eBook, we’ll help you better understand


the relationship between ANRs and the Android
framework, including:

• How the Android framework truly measures


and triggers ANRs for Android components,
including activities, services, and broadcast
receivers.

• The limitations of Google Play Console and


Crashlytics when monitoring, identifying, and
solving ANR errors.

• How Embrace can better help you identify and


eliminate ANRs.

Let’s start with a quick recap.

2
How the Android
framework
measures ANRs
Quick recap: Why ANRs matter
The Google Play Store is extremely particular about user experience. It sets a strict threshold requiring that a
maximum of only 0.47% of daily active users can experience an ANR. (They’ve also recently added a second bad
behavior threshold, where only 8% of daily active users on a single device model can experience an ANR.) If you
exceed either of these thresholds, you can expect:

• A lower Play Store ranking, meaning less visibility in search results.

• Negative reviews, which impact your business by making user acquisition increasingly difficult.

• User churn, resulting from frustration associated with a frozen app experience.

The consequences of exceeding Google’s ANR threshold can cascade throughout your whole organization.
When you have negative reviews, don’t show up in search results, or provide a poor user experience, your user
engagement and revenue are impacted. You’ll have fewer purchases (both in-app or at checkout, in the case of
e-commerce apps). Your users will engage less as they experience frustration. And, you’ll have fewer installs which
can taint the rest of your brand — a compounding issue that can slow momentum for other apps in your company’s
portfolio.

Here’s a quick snapshot of how high ANR rates can impact your business:

IMPACT OF ANR IMPACT OF ANR ON REVENUE

Lower Google Play Store ranking Lower visibility and fewer organic installs

Poor user experience Less engagement and decrease in user retention

User flow interruptions Fewer in-app purchases

Negative app store reviews Poor brand perception and impact on your other apps

Reminder: ANRs are not crashes


It’s easy to confuse crashes with ANRs. This is especially true because Google provides data as if ANRs were
crashes. While we’ll get into that detail later in this eBook, for now, it’s critical to remember that ANRs and crashes
are not the same. Mistaking the two as one and the same can result in a ton of wasted time chasing down root
causes that are totally unrelated. As a quick refresher:

• A crash is a coding error that kills your app immediately.

• An ANR is a frozen experience for your user.

According to the Android documentation, Google will report an ANR with an associated stack trace if the main
thread is blocked for 5 seconds*. We know it can be helpful to see examples from the real world, so here’s an
example of what an ANR and crash look like in an e-commerce app.

4
E-commerce app crash flow
A user goes to the checkout screen and the app crashes.

$19

Buy Checkout

Product Page Checkout Crash


Screen

THE CAUSE: A bug in the checkout code causes the crash due to an uncaught exception or a C signal.

THE RESULT: Google provides a stack trace, which makes it easy to identify the cause of the crash and
address the issue.

E-commerce app ANR flow


A user goes to checkout screen and the screen is unresponsive.

$19 $19

Buy Checkout Checkout

Product Page Checkout ANR


Screen

THE CAUSE: The ANR could result from many possible root causes, including:

• A heavy UI layout blocking the main thread for too long.

• A misbehaving third-party SDK.

• Background services blocking the main thread.

THE RESULT: At the end of 5 seconds of non-responsiveness, Google will provide you with a stack trace
to indicate an ANR. Unfortunately, that stack trace alone is rarely sufficient to get at the root cause of the
ANR.

* While Android documentation states an ANR is triggered after 5 seconds, Embrace engineers have found exceptions to this “rule” within Google’s own
source code. We’ll explain those exceptions and bust the myth around the 5-second ANR trigger in the following section.

5
How Google reports ANRs
The Android operating system monitors your main thread from a background thread. Once it detects that the main
thread is blocked, it starts an ANR timer on the background thread. According to the Android documentation, if that
timer exceeds a particular time threshold – 5 seconds – the Android OS triggers an ANR.

Because the Android OS treats ANRs the same way it treats crashes, it generates a single stack trace at the time
the ANR is triggered. Unfortunately, this means you don’t have insight into what has been happening from the
moment the ANR begins. We’ll dive further into why this is an issue later in this eBook.

Here are some more insights into the differences between the two main tools Google uses to provide ANR data:
Firebase Crashlytics and the Google Play Console.

FIREBASE CRASHLYTICS GOOGLE PLAY CONSOLE

Records the “exit reason” for a Utilizes Android OS’s built-in trace file
Type of data collected process on the device each time an mechanism and records on the device’s file
ANR occurs system

Collected from Firebase for delivery Up to 24-hour delay between ANR


When the data is
when the user relaunches the app occurring and diagnostic data showing in
collected
after an ANR Play Console

Android versions
Android 11+ All Android versions
supported

6
The truth
about ANRs
and the limitations of Google Play
Console and Crashlytics
What really triggers ANRs in Android
and why it matters for you
If we gave you a pop quiz and asked, “what triggers ANRs?”,
you’d technically be correct if you answered, “when the main
thread has been blocked for 5 seconds.” After all, it’s what the
Android documentation says. INSTEAD OF A UNIVERSAL
5-SECOND ANR TRIGGER,
But at Embrace, we believe in a healthy skepticism. We are
THERE ARE SEPARATE CHECKS
driven by a deep curiosity and a desire to help our customers
AND SEPARATE TIMING
optimize their user experience. So, our engineers went beyond
THRESHOLDS FOR EACH
the documentation and closely studied the Android source
ANDROID COMPONENT.
code. They found that what triggers ANRs in the Android
environment is far more nuanced than what the documentation
says.

In fact, there are specific components and situations that are quite different from the “universal” 5-second rule.
That’s why we created this eBook – so you’ll have an easy reference manual and can understand the important
nuances.

As a quick refresher, you know there are four main components in Android: activities, services, broadcast receivers,
and content providers. Every time one of these components performs work on the main thread, the Android
framework creates a timer on a monitor thread. Instead of a universal 5-second ANR trigger, there are separate
checks and separate timing thresholds for each component. Knowing the real timing will help you optimize how you
code your apps, and how best to understand what might really be causing an ANR.

To help, let’s dive into a few of these components to see how the Android source code helped us bust the myth of
the 5-second ANR trigger.

Busting the myths about activities & ANRs


Recap: An activity is a component that hosts some sort of user interface for user interaction. For example, an email
app might have one activity that shows a list of new emails, another activity to compose an email, and yet another
activity for reading emails.

When do activities trigger an ANR? Activities trigger an ANR if input dispatching takes more than 5 seconds. The
Android OS interprets actions like touches, entries, or taps on the screen or keyboard as an “input event.” The OS
places these input events onto an input dispatch queue and processes them on the main thread, where a watchdog
thread checks whether the processing takes more than 5 seconds. If the main thread is blocked or busy, the input
dispatch queue will not empty within 5 seconds. At this point, the watchdog thread will trigger an ANR.

MYTH
There is a 5-second threshold for an ANR to be
triggered for activities.

REALITY
There is actually a 5-second threshold for ANRs to be
triggered, but the timer only starts after an input event
has been dispatched!

8
Busting the myths about services &
ANRs
Recap: A service is a component on Android which does not provide
a user interface, and typically performs long-running operations in
the background. This can include a service that plays music in the
background while the user is in a different app, tracks location or
performance via a consistency check on the database at regular
intervals, or fetches data over the network without blocking user What happens
interaction with an activity. when the Android
When do services trigger an ANR? There are two conditions when framework
an ANR can be triggered in a service. If a foreground service does
not call startForeground in under 10 seconds, the Android OS will
detects an ANR in
trigger an ANR. As a reminder, a foreground service is something any component?
which is typically performing work in the foreground, like music
playback. If the service is in the background and doesn’t start
or bind in under 20 seconds, it will trigger an ANR. This longer The Android framework
threshold is because it doesn’t have as high a priority on the CPU. schedules work in the
We believe the thresholds are higher than 5 seconds for services component, and then
because users aren’t interacting with services the same way they starts a timer to continually
are with activities. Unlike the activities component, there’s no user check whether the work is
input event required for the services component. completed within the allotted
time period.

If the ANR threshold is


exceeded, the timer finishes
MYTH its countdown, and the
There is a 5-second threshold for an Android framework calls
ANR to be triggered for services. into the ANR helper from the
monitor thread.
REALITY
There is actually a 10-second The Android OS records
threshold for foreground services and a trace file, including a stack
20-second threshold for background trace, and shows an ANR
services! dialogue to your user.

Your user can decide to


kill your app (in which case
your process is terminated)
or they can allow your app to
continue waiting and see if it
Busting the myths about broadcast catches up.
receivers & ANRs
Recap: A broadcast receiver is a component that enables the
system to deliver events to the app outside of a regular user
flow, allowing the app to respond to system-wide broadcast
announcements. This can encompass a broad array of functionality
including state changes or events on the device. Examples
include messaging apps that hook into SMS events and perform
functionality in the application, changes in network connection, or
locale changes that prompt language changes.

9
When do broadcast receivers trigger an ANR? Broadcast receivers
trigger an ANR if they take longer than 10 seconds to process a
message. However, there is an interesting exception to this: when
the Android OS is booting, a lot of CPU work is underway so early
broadcasts could be false positives.

MYTH Easy tips for


There is a 5-second threshold for an minimizing ANRs
ANR to be triggered by a broadcast
receiver. Minimize network
operations on the main thread
REALITY – they can take a long time to
There is actually a 10-second complete!
threshold for a broadcast receiver!
Minimize file operations
on the main thread. Offload
them wherever possible to
the background thread, and
asynchronously wait on the
Why Google ANR reporting is results.

insufficient for each Android Minimize synchronization

component on the main thread. The goal


is to minimize the amount of
We’ve busted the myths that ANRs are universally triggered for each waiting on locks and mutexes
component when the main thread is blocked for 5 seconds. As you by confining it to the smallest
know, all Android components typically run on the main thread – space possible. This can be
and often simultaneously. This can make it difficult to unmask the challenging depending on
culprit of which component is specifically responsible for an ANR. how you use concurrency and
synchronization.
For example, any Android component can starve the main thread
so that an activity has no time to respond to input events. This Don’t forget about your
can trigger ANRs within an activity which indicate a problem in a Native Development Kit (NDK)
different component. code! If the NDK layer blocks
the main thread, you’ll trigger
When you receive a stack trace from Google, it can be impossible an ANR if input dispatching
to determine which component is responsible – it depends which times out.
component is running at the time the stack trace is generated.

For example, if a broadcast receiver blocks for 4.8 seconds and an


activity then blocks for 0.3 seconds, the activity will be the culprit in
the ANR stack trace.

This can make ANR error reports highly misleading when viewed
in isolation. Typical Android apps can have dozens of components
running at the same time. This means debugging the true cause
of ANRs in the real world is often even more difficult than in our
relatively simple example above. Stack traces provided by Google
aren’t helpful because they don’t let you see what is happening
from the moment the ANR begins and throughout the duration of
the ANR.

10
ANR trace file
generated

User input
event

00:00 00:01 00:02 00:03 00:04 00:05 00:06

As a visual example, picture the components running on the main thread as a series of cars driving on a road.
Google’s ANR stack trace would highlight the last blue car as the cause of the ANR. In reality, the freeze started
during the first red car. Collecting stack traces as soon as the app freezes is crucial for getting to the actual root
cause.

11
How Embrace makes
solving ANRs faster
and easier
An alternative solution for ANRs
Embrace is a data-driven toolset to help mobile engineers build better experiences. Because of this mobile-centric
approach, we have a very different method of data collection.

Unlike event-based monitoring solutions which limit data collection and can only help you solve known issues,
Embrace collects 100% of the data from every user session to provide capabilities that were previously impossible.

With full visibility across every user experience (including both foreground and background sessions), Embrace
enables you to see complete technical and behavioral data, giving you the context to solve both known and
unknown issues.

5 ways Embrace enables rapid prioritization and resolution of


ANRs

01 Embrace accurately detects ANR stack traces


with superior data capture
When it comes to capturing ANR sample stack traces, the Google Play Console and other point solutions only show
a portion of the picture by capturing a stack trace 5 seconds after the ANR has occurred. This means you’re seeing
sample stack traces long after the ANR was triggered, which can lead to incorrect diagnosis. The sampling for the
Google Play Console may even be delayed beyond 5 seconds due to the load on the device during the ANR, which
means you may be missing valuable ANR stack trace data. Lastly, this method only captures data for fatal ANRs and
neglects non-fatal ANRs, which are equally vital for enhancing your user’s experience.

To accurately triage and resolve ANRs for good, it’s important to understand what code was running from the
moment the ANR is triggered to the end of the ANR interval. With Embrace intelligent ANR Reporting, teams can
auto-capture and surface a stack trace as soon as the main thread is blocked for 1 second, followed by auto-
collecting main thread stack traces every 100ms until the app recovers, force quits, or the ANR dialog appears. By
capturing these additional stack traces engineers can gain deep insights into the code execution and its evolution
throughout the ANR interval to get to the true root cause, all without introducing any unnecessary overhead. This
level of detail empowers engineers to quickly spot and resolve both fatal and non-fatal ANRs and ultimately drive
better user experiences.

13
02 Embrace filters out noisy ANRs with intelligent grouping
Finding which stack traces to focus on can also be challenging when you’re basing your decision solely on the
final stack trace. Oftentimes, the stack traces exhibit enough differences that identifying the root cause becomes
challenging. Because Embrace auto-collects stack traces throughout the entire ANR interval, teams gain access
to a broader range of stack trace samples, providing deeper insights into the underlying cause. Teams can filter
sample stack traces by “Most Representative”, “First Sample” or “Ad SDK” to help visualize the data in different
ways and quickly surface the right ANRs. When selecting “Most Representative” Embrace will scan and analyze
sample stack traces for you and group them by the most relevant method to identify the code sections likely
contributing to the ANR. Embrace then ranks them by volume, sessions impacted, or users impacted and maps
issues to a category (Ads or Concurrency) to help you cut out the noise and focus on what matters most for you and
your team.

03 Embrace’s powerful flame graphs let you drill down


to the line of code
Even with advanced grouping it may still be overwhelming to sift through stack traces. Using flame graph
visualizations, we make it easy to surface critical stack traces that contain known problematic methods. Using
the flame graph, each span represents the number of sample stack traces. The length of the span represents how
many sample stack traces the method appears in. The wider and deeper the span, the more likely the code path
contributed to an ANR. Simply select a method (shown in red) to pinpoint which first- or third-party code may have
contributed to the ANR.

14
Selecting a method will take you to the ANR Method Troubleshooting graph. This allows you to see all the code
paths that lead to the selected problematic method ( “callers”) and the code paths following the method (“callees”)
to help you nail down the line of code that may be contributing to the ANR.

15
04 Embrace highlights what contributed to the
ANR with complete out-of-the-box user session context
Capturing multiple stack traces may not be enough to pinpoint every ANR root cause. ANRs can also stem
from unpredictable factors in the device and from the user that you may need to take into account when
troubleshooting. Factors like failed network calls, low connectivity, heavy view rendering and more can lead to
an ANR. Embrace gives you the ability to easily pivot from a high priority sample stack trace in the flame graph or
summary view directly into sample affected user sessions (‘Sample Sessions’) to help you understand the ANR in
more depth.

With User Session Insights, Embrace collects all behavioral and technical user activity leading up to the ANR,
completely out-of-the-box, so engineers can quickly and accurately reproduce the ANR and understand how other
factors may be contributing to it. All this without wasting time cobbling together log, ANR, and product analytics
data. If you find that low connectivity, heavy view rendering, and bad code all contributed to the ANR, you can
quickly share session details with the right teams using out-of-the-box Jira and Slack integrations.

16
05 Embrace identifies patterns and trends in real-time with
high-level ANR overview and proactive alerting
Embrace continuously analyzes millions of data points across your applications to help you proactively spot critical
ANRs before your users do. Our real-time alerting can help you separate important ANRs from the noise with
context rich alerting that identifies spikes and drops for critical ANR indicators. You can set up alerts for important
user flows, like payment flows, add to cart, or during paid advertisements to optimize user experiences for
revenue-generating moments.

With ANR Summary, anyone can get an out-of-the-box overview of critical ANR metrics like ANR-Free Sessions and
ANR-Free Users, Total ANRs, Affected Users and more in a single view. Teams can surface patterns and anomalies
by version and deployment across your user base.

17
Closing
thoughts
ANRs aren’t just annoying for users to encounter and for developers to debug.
They’re critical errors that have an outsized effect on user engagement,
acquisition, and retention, and can ultimately drag down your bottom line.

In this eBook, we’ve taken you beyond the Android documentation, busted the
myth of the universal 5-second ANR trigger, and provided key insights to help
you stay above Google’s stringent bad behavior thresholds.

By underlining the relationship between ANRs and the Android framework, we


further explained issues with event-based monitoring tools and the need for a
complete data-driven approach to building the best mobile experiences.

While we hope this eBook serves as a valuable reference guide for future ANR
debugging, Embrace can provide even more support through the use of our
platform.

From providing flame graphs that help you quickly get at the root cause of an
ANR, to the ability to stitch multiple sessions together for deeper insight and
analysis, Embrace is the best option for Android developers who care about
providing superior mobile experiences.

Learn how Embrace can help you get a handle on ANRs and optimize your app
for greater visibility today.

Get started today with


1 million free user sessions.

Try Embrace free


Embrace is a data-driven toolset to help Contact
engineers manage the complexity of mobile.
Using automated data collection and a unified 8569 Higuera St, Culver City, CA 90232
digital platform, Embrace reduces the toil
of mining for insight across disparate tools. (424)-326-9004
Engineers can identify, prioritize, and resolve
problems in their apps, while also surfacing contact@embrace.io
opportunities to perfect app performance
and delight their end users. Learn more at embrace.io
embrace.io or follow Embrace on LinkedIn,
Facebook, or Twitter.

You might also like