Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on upgrade from 8.20.0 -> 8.25.0 via Span.finish #3977

Open
nebsta opened this issue May 14, 2024 · 37 comments
Open

Crash on upgrade from 8.20.0 -> 8.25.0 via Span.finish #3977

nebsta opened this issue May 14, 2024 · 37 comments

Comments

@nebsta
Copy link

nebsta commented May 14, 2024

Platform

iOS

Environment

Develop

Installed

Swift Package Manager

Version

8.25.0

Did it work on previous versions?

8.20.0

Steps to Reproduce

  1. Install a version of the app that is using version 8.20.0 of the Sentry iOS SDK
  2. Update to a version of the app that is using version 8.25.0 of the Sentry iOS SDK
  3. Crash occurs on startup

Expected Result

We expect the crash not to occur

Actual Result

330344319-4c996226-fbde-4d92-b44f-eaeee4a9839f

Screenshot 2024-05-14 at 10 52 38

Screenshot 2024-05-14 at 10 52 48

Few other notes:

  • The span being finished is one that captures the boot time for all our internal systems at startup
  • After the crash, if you restart the app on the 8.25.0 version there won't be any crash after that

Are you willing to submit a PR?

No response

@philipphofmann
Copy link
Member

Hey @nebsta,

I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.

It seems like the culprit is #3892, which moved capturing the transaction to a background thread.

@philipphofmann philipphofmann moved this from Needs Discussion to Needs More Information in Mobile & Cross Platform SDK May 22, 2024
@getsantry getsantry bot moved this to Waiting for: Community in GitHub Issues with 👀 3 May 22, 2024
@nebsta
Copy link
Author

nebsta commented May 28, 2024

Hey @nebsta,

I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.

It seems like the culprit is #3892, which moved capturing the transaction to a background thread.

@philipphofmann sorry for the late reply. Didn't get a notification about you responding. Here is a snippet of what the wrapper looks like. We mainly have this so that we can swap out reporting libraries in the future if we want to. As you can see it's pretty simple, nothing too fancy.

// Copyright © 2023 Trustpilot. All rights reserved.

import Foundation
import Sentry

class SentrySpan: SentrySpanProtocol {
  private let sentrySpan: Span
  
  init(span: Span) {
    self.sentrySpan = span
  }
  
  func startChild(operation: String) -> SentrySpanProtocol {
    let sentrySpan = self.sentrySpan.startChild(operation: operation)
    return SentrySpan(span: sentrySpan)
  }
  
  func finish(status: TrustpilotSpanStatus) {
    sentrySpan.finish(status: status.toSentrySpanStatus())
  }
  
  func finish() {
    sentrySpan.finish()
  }
}

extension TrustpilotSpanStatus {
  func toSentrySpanStatus() -> SentrySpanStatus {
    switch self {
    case .internalError: return .internalError
    }
  }
}

// Copyright © 2023 Trustpilot. All rights reserved.

import Foundation

protocol SentrySpanProtocol {
  func startChild(operation: String) -> SentrySpanProtocol
  func finish(status: TrustpilotSpanStatus)
  func finish()
}

enum TrustpilotSpanStatus {
  case internalError
}

I'll see if I can get a full crash report from the sentry dashboard and post it here soon.

@getsantry getsantry bot moved this from Waiting for: Community to Waiting for: Product Owner in GitHub Issues with 👀 3 May 28, 2024
@philipphofmann
Copy link
Member

Thanks for the update, @nebsta. The full crash report would be helpful. I think I see a couple of occurrences of this in our internal SDK crash detection.

@nebsta
Copy link
Author

nebsta commented May 29, 2024

@philipphofmann classically I can't seem to replicate the issue anymore, so I'm unable to get the crash log for you. But by the sounds of it, it seems like it's a known issue if you have seen occurrences of it.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 May 29, 2024
@philipphofmann philipphofmann moved this from Needs More Information to Needs Discussion in Mobile & Cross Platform SDK May 29, 2024
@philipphofmann
Copy link
Member

Yes, I think we should have enough information to fix it. Thanks for the update.

@nebsta
Copy link
Author

nebsta commented Jun 12, 2024

@philipphofmann any update on a fix?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jun 12, 2024
@brustolin
Copy link
Contributor

The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?

Finishing or refinishing a span should not matter; we have safeguards for these scenarios.

If you could provide us with a reproduction of your flow and how you use transactions/spans, it would help us a lot in identifying whether we have a bug in the said safeguards.

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

@brustolin I've outlined an explanation of how we're using them below, including some code examples as well. Hopefully that gives you a better picture.

The main class where we use the spans and where this exception is happening is called the Bootstrapper. This is an actor, which goes through a list of BootstrapItem instances (which are each themselves actor instances as well), and executes some setup work that we need to perform on the startup of our application.

You can see a full gist of what the Boostrapper looks like here, and what one of the BootstrapItem implementations look like here

If you look within the Bootstrapper.loadAll method, you'll see where we are creating/finishing all the spans for the bootstrap process. We start 1 span for the entire bootstrap process, and then spin off child spans for each item so that we can get a better breakdown of where our bottlenecks are.

    let transaction = dependencies.sentry.startTransaction(name: "Bootstrap", operation: "Start") // HERE
    for item in items {
      let itemName = String(describing: type(of: item))
      TLog.bootstrap.info("Loading item: \(itemName)")
      
      dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.started(name: itemName))
      
      var span: SentrySpanProtocol?
      do {
        span = transaction?.startChild(operation: itemName) // HERE
        try await item.load()
        try Task.checkCancellation()
        
        dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.finished(name: itemName))
        span?.finish()
      } catch {
        TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
        span?.finish(status: .internalError) // HERE
        transaction?.finish() // HERE
        dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.failed(name: itemName))
        
        /// Some bootstrap items may not need to raise errors up to Sentry, so we log them here instead.
        guard item.shouldRaiseFailureToSentry else {
          TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
          return
        }
        throw BootstrapError.itemFailed(itemName: String(itemName), error: error)
      }
    }
    transaction?.finish() // HERE

As I mentioned, there is a bug in this implementation, as if there is an exception in any of the bootstrap items, then we don't properly finish the span for the entire bootstrap process, which is something we're going to fix on our end. This is why I was wondering if perhaps this could be the reason for our crash, as I mentioned that we do sometimes see some instances in the 99th percentile that are reported to be going for as long as a couple of days, so I wonder if there is an issue in the mechanism for ensure that spans are always cleaned up.

Image

Hopefully that's enough to give you a better picture.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Sep 27, 2024
@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

I should also mention that the spans we're using the Bootstrapper are actually our own span class (which you can see here, which is just a light wrapper around the Sentry Span class. We do this so that our reporting manager is SDK agnostic, and so we can more easily write unit testing.

@philipphofmann
Copy link
Member

I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.

@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with options.debug = true and share them with us here please? That could help us identify the problem.

Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.

There is a small chance that #4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.

@nebsta
Copy link
Author

nebsta commented Oct 1, 2024

I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.

@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with options.debug = true and share them with us here please? That could help us identify the problem.

Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.

There is a small chance that #4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.

@philipphofmann We have been attempting to upgrade to 8.36.0 instead since it came out, but we're still getting the same issue. I'll try enabling the logs and see how that goes

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Oct 1, 2024
@nebsta
Copy link
Author

nebsta commented Oct 1, 2024

@philipphofmann I've attached the logs that appear when enabling debug leading up to the crash. One thing I also noticed actually looking at the code, which you'll also see in the logs, is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.

sentrylogs.log

@philipphofmann
Copy link
Member

The last log message is

[Sentry] [debug] [SentrySDK:224] SDK main thread init started...

The logs don't show more log messages for installing all the integrations. Do you start your transaction leading to a crash directly after initializing the SDK a second time? I tried to reproduce your problem by starting the SDK, stopping it, starting it again and then directly capturing a transaction, but it works flawlessly.

is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.

Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.

@nebsta
Copy link
Author

nebsta commented Oct 2, 2024

@philipphofmann

Do you start your transaction leading to a crash directly after initialising the SDK a second time?

The finish on the transaction that triggers the crash is before we stop and start the system.

Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.

I'll give this a go today and get back to you

@P-Foley-Trustpilot
Copy link

It looks like this has been resolved when updating from 8.20.0 to 8.39.0!
Was there a change in 8.39.0 that was expected to fix this?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Oct 31, 2024
@philipphofmann
Copy link
Member

@P-Foley-Trustpilot, we made so many improvements between these versions for that part of the SDK that it's hard to pin it down. It's great that it is fixed for you. Maybe others could retry with 8.39.0 and tell us if it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Backlog
Development

No branches or pull requests

6 participants