Closed Bug 519423 Opened 15 years ago Closed 14 years ago

add tracking and alerts for "explosive" crash signatures.

Categories

(Socorro :: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chofmann, Assigned: ryansnyder)

References

Details

Attachments

(2 files, 1 obsolete file)

In the past month we have had 2 incidents where non-existent or low volume crashes have exploded and zoomed to the top of the crash list within hours or days.  

For details see 

[Bug 519039] CoolIris Top Crasher [@ cooliris19.dll@0x351f2 ] and [@ cooliris19.dll@0x351a2 ] and [@ libcooliris19.dylib@0x31ea2 ]

and

Bug 512122  KB article: Possible Adware.DoubleD related Crash [@ NPFFAddOn.dll@0x11867][@ NPFFAddOn.dll@0xceb8][@ NPFFAddOn.dll@0x11657][@ NPFFAddOn.dll@0xe707][@ NPFFAddOn.dll@0xe590]

These were caught because we had eyeballs on the top crash report just happened to be around to see events starting to unfold.

We need to find ways to detect these events earlier and notify a few investigators to start looking at the problems sooner.

https://bug519039.bugzilla.mozilla.org/attachment.cgi?id=403459 shows that we might have caught that bug a day earlier when it when from 2 or 3 crashes per minute to over 10 or 15 crashes per minute.

We will need to figure out at what rate we can monitor all crashes and what thresholds make most sense for existing and new crashes.   hourly and some pct. increase over recent similar daytime time slices might work.   we will need to do more examination of how much crashes stretch with daily ebb and flow of general browser use so we don't kick of a steady stream of alerts.

this might be a case were we want to start real simple and then expand.

I suspect we will see more, not fewer of these kinds of events as the user base continues to grow.
related  is bug 411397 Need to add changes in rank to top crash reports
I added a comment over there.  that bug seem to be about trying to do a better job of measuring the relative behavior and ranking of crashes against other crashes.

I think this is more about trying to measure the current behavior of a single signature against its own past performance or behavior.

we could also factor in overall past performance of all crashes entering the system to help set the thresholds.

the attached chart shows we are running at an overall mean rate of 125 crashes per minute entering the system.   if we get about 65% above that mark, or drop to below 75% percent of that figure there might be an interesting event going on that needs some attention.  In this case going below 65 crashes per minute might mean network or system problems that inhibit the flow of incoming crash reports.  

If we get above that some crashing plugin, website, or new release of software might be driving the numbers higher.  producing and watching this kind of over all report might give us a better idea of the dynamics of incoming crashes
This is the type of thing that predictive monitoring software does very well. 

If we instrument the number of crashes per minute (by product and version) and then pick this up in our standard IT monitoring (cactus or whatever) then we could page or email when the number is above a threshold.

I don't know if we can do predictive monitoring with our current monitoring software, but this would analyze historical patterns in and predict a high and low band. If the crash rate per minute goes out of band then an alert would fire.
We use nagios for monitoring and cacti for trending.  Neither of those two do any kind of predictive alerts/checks.  If you want this as a part of the monitoring system, you'd have to write the corresponding plugins, figure out a way to store and analyze historic data and alert on it.  That kind of capability isn't present in the systems we use now.  Seems like this would be an excellent candidate for the metrics super-cluster.
So.. I was wrong.. there is a way to do this in nagios, however it will involve a lot of setup etc, to the point that this should be a quarterly goal sort of thing.

http://cricket.sourceforge.net/aberrant/rrd_hw.htm

That doc talks about the stuff involved.  First, we'd need a way to create rrd databases of any interesting crash signatures, and automate that system to pick up new signatures over time.  Then we need a nagios check to examine these rrd databases and use the method described in that link to alert us.

This would be something worth pursuing, but is a huge time sink for whoever is doing the work.
We can do this will max thresholds. The predictive thing is just much nicer
from a maintenance perspective.

Okay I can provide a cacti [1] script that prints out the total number of crashes in the
last 5 minutes.

So if this script was run at 2009-10-14 11:34:59 it would have outputed

Firefox_3.0.1:4 Firefox_3.0.10:2 Firefox_3.0.11:5 Firefox_3.0.14:38
Firefox_3.0.2:1 Firefox_3.0.3:1 Firefox_3.0.4:1 Firefox_3.0.7:2 Firefox_3.0.8:3
Firefox_3.0.9:1 Firefox_3.5.2:5 Firefox_3.5.3:77 Thunderbird_3.0b1pre: 1

Option 1:
The field names would change through time as they only report product/versions
that had crashes during the last 5 minutes.

You could add and remove cacti outputs on this data source as needed w/o any
changes to the script. 

Option 2:
The script could take a list of inputs so that you get back some expected
output. So given the input:
Firefox_2.0.0.18,Firefox_3.5.2,Firefox_3.5.3,Thunderbird_3.0b1pre
it would output:
Firefox_2.0.0.18:0 Firefox_3.5.2:5 Firefox_3.5.3:77 Thunderbird_3.0b1pre: 1

Which Option would work/be easiest to manage with cacti?

Do we have an alert/alarm pl or alarm plugin installed on cacti?

[1] http://docs.cacti.net/manual:087:3a_advanced_topics.1_data_input_methods#data_input_methods
(In reply to comment #6)
These data points are to detect a surge in overall crashes, but it doesn't detect a burst of a specific crash signature.

We can also set up a query that looks at total number of unique signatures for a 1 hour time period and find's ones that are over a certain threashold for % of total crashes.

Example:
10/14 10am
All Crashes:
Firefox	3.5.3	5834

Top Crashers:
Firefox	3.5.3	74	nsCycleCollectingAutoRefCnt::decr(nsISupports*)
Firefox	3.5.3	74	UserCallWinProcCheckWow
Firefox	3.5.3	65	nsEventListenerManager::Release()
Firefox	3.5.3	64	nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*)
Firefox	3.5.3	55	GraphWalker::DoWalk(nsDeque&)
Firefox	3.5.3	54	RtlpWaitOnCriticalSection

Highest % for a unique signature 74/5834 -> 1.3 %

Running this on 9/26 at 10am (cool iris day)
All Crashes:
Firefox	3.5.3	6206

Top Crashers:
Firefox	3.5.3	458	cooliris19.dll@0x351f2
Firefox	3.5.3	103	RtlpWaitOnCriticalSection
Firefox	3.5.3	73	nsEventListenerManager::Release()
Firefox	3.5.3	73	nsGlobalWindow::cycleCollection...
Firefox	3.5.3	73	libcooliris19.dylib@0x31ea2
Firefox	3.5.3	71	nsCycleCollectingAutoRefCnt::de...

Highest % for a unique signature 458/6206 -> 7.4%

I don't know how to set that up in a cacti friendly way... but we could make an email alert or some other mechanism for the occasion where this % goes over a threshold.

I switched from a 5 minute time slice to 1 hour, because there were only tens of crashes per 5 minutes per signature.
Here is the SQl for comment #7
--
-- signature bursts
SELECT product, version, COUNT(date_processed), signature
FROM reports
WHERE
    date_processed > '2009-09-26 10:34:59' AND
    date_processed <= '2009-09-26 11:34:59' AND
    signature IS NOT NULL
GROUP BY product, version, signature
HAVING COUNT(date_processed) > 10
ORDER BY product, version, COUNT(date_processed) DESC;

--- Taking product and versions from above query... get total # of crashes
SELECT product, version, COUNT(date_processed)
FROM reports
WHERE
    date_processed > '2009-09-26 10:34:59' AND
    date_processed <= '2009-09-26 11:34:59' AND
    signature IS NOT NULL AND
    ((product = 'Firefox' AND version = '3.0.14') OR
     (product = 'Firefox' AND version = '3.5.3'))
GROUP BY product, version;
I'm not sure about the 5 minute or even the 1 hour time slice.  The gainlarity of those periods might be too small and deliver too many false positives.  We need to model in the ebb and flow of intra-day browser traffic and weekday/weekend effects.

From comment 0 this is a profile of the kind of thing we are trying to montior

https://bug519039.bugzilla.mozilla.org/attachment.cgi?id=403459

We want the system to tell us something was up sometime before that friday morning crash peak.  a 12 hour or even 24 hour cycle might be good enough for the kind of warning we need with out generating too many false positives.

If we only looking at the top 10 or top 10% of crashes because of performnance reasons that might be useful, but that top 10% needs to be a dynamically updated so we are looking mostly at new entries into that list.

Another vulable aspect of this is to aleart on the introduction of new signatures we have never seen before.  they might even be low volume signatures.  here is an example of that.    

https://bugzilla.mozilla.org/show_bug.cgi?id=523529

adobe ships a new acrobat reader on  oct 13th and we start seeing one or more new crashes the next day or within hours of the release.

Its the old QA premise.  Finding bugs as close to their introduction as possible makes them easier and faster to diagnose and fix.   Most of the time the top 10 list or the top 10% list is pretty stable and uninteresting.  the cooliris example is more of a rare case where the signature made it to the top of the crash list.   we actually would prefer to get alerted well before it gets near the top 10%.
(In reply to comment #9)
Thanks, I'm still digesting this...

Any comment of monitoring burst of total number of crashes with cacti? Would that be useful? (this is comments 2-6) It will be very quick and easy to build.

I'll keep working through comment #9.
Wearing my "I'm picky" hat about comment #8: The standard behavior for a range of items is:

floor <= item AND item < ceiling // Start at floor, never quite reach ceiling

All the materialized views now follow that standard. 

FYI: I once had to deal with legacy code similar to that in comment #8, and it caused endless subtle problems, so I'm gun shy about it.

There's another small problem with the code in comment #8: date_processed is a timestamp with very small granularity. You would want to use a count of bins, not a count of individual date_processed, which will very seldom be equal.
lars had some ideas on this, and maybe it could be done for the next socorro release.

here is another place where getting some alerts sometime after 1pm yesterday might have been helpful in starting the analysis sooner.

https://bugzilla.mozilla.org/show_bug.cgi?id=538998#c2

one idea would be to just just watch the rank changes in the 3 day report would be a start

http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6/3

if the rank changes by more than say some threashhold of 50 or 100 ranking slots then send e-mail.

we could also hook this up to a report for all versions of firefox, instead fo specific versions.  that would not have the noise around specfic releases of firefox and be more atuned to catching the spread of malware and external crashes that we aren't watching as closely for.

then maybe throttleing can be turned down to a 1 day, 12 hour, 6 hour rank changes.
OS: Mac OS X → All
Target Milestone: --- → 1.4
Target Milestone: 1.4 → 1.5
Assignee: nobody → ryan
Target Milestone: 1.5 → 1.6
Still awaiting feedback on this.  

In 1.5 we released a new UI that contained top moving top crashers on each product and version dashboard.

The primary question is whether or not the dashboards provide enough information on explosive crash signatures, or if other information or communication mechanisms are necessary.  If more is needed, please explain in detail.

Pushing to 1.7 to allow time for proper feedback / specs / implementation.
Target Milestone: 1.6 → 1.7
one possible quick fix would be to change the "top changer" report at
http://crash-stats.mozilla.com/products/Firefox to cut it down to a 2 or 3 day window, and also only show the the red (trending upward) signatures.  I think it does the later, but right now I get "Top changers currently unavailable." when trying to view the page.

The reduced window would allow quicker spotting up upward trending signature for people that happend to visit the page.

next would be to add (e-mail and/or possibily rss feed ) notifications interested trcker of this stuff don't have to actually visit the page to learn of spiking crashes.  

as we get these foundations in place we could
Has 3.6.2 been released?  I don't see it on the Firefox download page.  As such, it shouldn't be showing up in the Firefox dashboard, and the reason it is showing up is because the dates for 3.6.2 are incorrect in the admin panel: 
https://crash-stats.mozilla.com/admin/branch_data_sources

Here is what top crashers look like on the 3.6 dashboard:
http://crash-stats.mozilla.com/products/Firefox/versions/3.6

We can add a 3 day window to each of the dashboards.

I like the RSS feed idea, because that would be the easiest/quickest solution to implement.
(In reply to comment #16)
> Has 3.6.2 been released?  I don't see it on the Firefox download page.  As
> such, it shouldn't be showing up in the Firefox dashboard, and the reason it is
> showing up is because the dates for 3.6.2 are incorrect in the admin panel: 
> https://crash-stats.mozilla.com/admin/branch_data_sources
> 

ok, I see the problem here.  going to http://crash-stats.mozilla.com  redirects
a page that ends up with a blank top changer section.  Maybe thats what we need to fix.

> Here is what top crashers look like on the 3.6 dashboard:
> http://crash-stats.mozilla.com/products/Firefox/versions/3.6
> 
> We can add a 3 day window to each of the dashboards.
> 
> I like the RSS feed idea, because that would be the easiest/quickest solution
> to implement.

http://crash-stats.mozilla.com/products/Firefox/versions/3.6 looks pretty good.  one thing to add in addition to the trending info would be to add the current ranking.  

that would help to provide some context of where the movement is happening.   If its up 500 slots  to move to the #1000 top crash, we might give it a few more hours or days to establish the trend and keep an eye on it, than if its jumped 500 slots into the top 100.

I think doing these couple of small things might yield some good improvments and then we could evaluate again looking closer at each of the use cases in the "explosive" bug list to determine what things might have been done to detect and notify people sooner.
Trending bugs -> 1.9
Target Milestone: 1.7 → 1.9
I have this in progress at the moment.  

I am ensuring that the changeInRank and currentRank values for each trending top crasher is available on the dashboard, so that the severity of the trend will be readily apparent.  I am also creating a separate trending top crasher page, which will have the data available via RSS and CSV.  

The last piece to put in place will be to add a 3 day date range to the already existing values of 7, 14 and 28 days.

All other notifications for these trends will take place in #525316.
Status: NEW → ASSIGNED
Target Milestone: 1.9 → 1.8
Attached patch Patch 1 for 519423 (obsolete) — Splinter Review
See comment 19 for the changes this patch encompasses.

To see this in my sandbox, please visit the dashboard for a product / version:
http://rsnyder.khan.mozilla.org/reporter/products/Firefox/versions/3.6.7

Or the trending top crashes page for a product / version:
http://rsnyder.khan.mozilla.org/reporter/products/Firefox/versions/3.6.7/topchangers

To apply this patch, in application/config/products.php, you will need to replace $config['topchangers_count'] with:

/**
 * The number of topchangers to feature on the product dashboard.
 */
$config['topchangers_count_dashboard'] = 15;

/**
 * The number of topchangers to feature on the product dashboard.
 */
$config['topchangers_count_page'] = 50;
Attachment #459313 - Flags: review?(ozten.bugs)
Attachment #459313 - Flags: feedback?
Submitting an updated patch.  The rss and csv links for the trending top crashers did not contain the duration variable in the url.
Attachment #459313 - Attachment is obsolete: true
Attachment #459944 - Flags: review?(ozten.bugs)
Attachment #459313 - Flags: review?(ozten.bugs)
Attachment #459313 - Flags: feedback?
Comment on attachment 459944 [details] [diff] [review]
Patch 2 for 5159423

Wow, thanks for the quick turnaround... Lots of code!
Thanks for fixing those docstrings.

Looks great!
Attachment #459944 - Flags: review?(ozten.bugs) → review+
Thanks Austin.  Filed https://bugzilla.mozilla.org/show_bug.cgi?id=581679 to get the config file updated on stage.  Added documentation to the rollout procedure for 1.8 at http://code.google.com/p/socorro/wiki/SocorroUpgrade#Socorro_1.8 .

==

Sending        webapp-php/application/config/products.php-dist
Sending        webapp-php/application/config/routes.php
Sending        webapp-php/application/controllers/products.php
Sending        webapp-php/application/views/common/dashboard_product.php
Adding         webapp-php/application/views/common/product_topchangers.php
Sending        webapp-php/application/views/layout.php
Sending        webapp-php/application/views/moz_pagination/nav.php
Adding         webapp-php/application/views/products/product_topchangers.php
Sending        webapp-php/css/screen.css
Sending        webapp-php/js/socorro/daily.js
Sending        webapp-php/js/socorro/dashboard.js
Transmitting file data ...........
Committed revision 2247.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Review for possible inclusion in 1.7.6.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 1.8 → 1.7.6
Updated upgrade docs at https://code.google.com/p/socorro/wiki/SocorroUpgrade

Updated Bug 612981 to include config change.

Working on integrating remaining code changes.
This will resolve Bug 603561 as well.

Committing.

==

Sending        webapp-php/application/config/products.php-dist
Sending        webapp-php/application/config/routes.php
Sending        webapp-php/application/views/common/dashboard_product.php
Adding         webapp-php/application/views/common/product_topchangers.php
Sending        webapp-php/application/views/layout.php
Adding         webapp-php/application/views/products/product_topchangers.php
Sending        webapp-php/js/socorro/daily.js
Sending        webapp-php/js/socorro/dashboard.js
Transmitting file data ........
Committed revision 2752.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: