resource-get gets hung on charm store

Bug #1627127 reported by Cory Johns
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Anastasia

Bug Description

The latest edge version of apache-zeppelin (https://jujucharms.com/apache-zeppelin/15) contains a 0-byte placeholder resource in the store and logic to fall back to the current, non-2.0 blob fetching if one is not provided during deployment. However, it is getting hung and never reaching the fall back logic because resource-get is waiting indefinitely for the store.

This doesn't happen every time, but it happens consistently enough to be a significant problem, as it blocks the charm indefinitely.

This doesn't seem to be related to the fix for https://bugs.launchpad.net/juju/+bug/1577415 or is a regression, and I'm fairly certain it is related to the charm store issues reported in https://github.com/CanonicalLtd/jujucharms.com/issues/332

For reference, I have this strace:

    ubuntu@plugin-0:~$ pgrep -alf resource-get
    24068 resource-get zeppelin
    ubuntu@plugin-0:~$ sudo strace -p 24068
    Process 24068 attached
    epoll_wait(5, ^CProcess 24068 detached
     <detached ...>
    ubuntu@plugin-0:~$ time sudo strace -p 24068
    Process 24068 attached
    epoll_wait(5,

And I have the following process time information:

    ubuntu@plugin-0:~$ ps -p "24068" -o etime=
          01:01:58

Revision history for this message
Cory Johns (johnsca) wrote :

Additional information showing the size of the resource file:

    $ charm show ~bigdata-charmers/apache-zeppelin-15 id resources
    id:
      Id: cs:~bigdata-charmers/apache-zeppelin-15
      Name: apache-zeppelin
      Revision: 15
      User: bigdata-charmers
    resources:
    - Description: The Apache Zeppelin distribution
      Fingerprint: null
      Name: zeppelin
      Path: zeppelin.tgz
      Revision: -1
      Size: 0
      Type: file

Revision history for this message
Cory Johns (johnsca) wrote :

It was just pointed out to me that the rev in the above comment is -1 with a null fingerprint, so that may be the cause. Specifying the edge channel shows the right resource rev:

    $ charm show ~bigdata-charmers/apache-zeppelin-15 --channel=edge id resources
    id:
      Id: cs:~bigdata-charmers/apache-zeppelin-15
      Name: apache-zeppelin
      Revision: 15
      User: bigdata-charmers
    resources:
    - Description: The Apache Zeppelin distribution
      Fingerprint: OLBgp1GsljhM2TJ+sbHjaiH9txEUvgdDTAzHv2P24donTt6/529l+9Ua0vFImLlb
      Name: zeppelin
      Path: zeppelin.tgz
      Revision: 2
      Size: 0
      Type: file

I'm going to try releasing to stable to try to get the resource rev fixed and see if that helps.

Revision history for this message
Cory Johns (johnsca) wrote :

Ok, it definitely seems to be caused by the invalid resource revision. The resource revision was caused by a conflict between cs:trusty/apache-zeppelin and cs:apache-zeppelin (the latter being multi-series and including resource support).

This seems to be caused in the charm store by reverting from a resource-enabled revision of a charm to a non-resource-enabled revision on a given channel and then deploying the resource-enabled revision from another channel (or by explicit revision number). It's more of an issue with the charm store, but it may be worth handling the case in Juju to avoid this on the off chance it comes up.

Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Richard Harding (rharding)
milestone: none → 2.0.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-rc3 → 2.0.0
Changed in juju:
milestone: 2.0.0 → 2.0.1
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.1 → none
Changed in juju:
assignee: Richard Harding (rharding) → nobody
milestone: none → 2.2.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta1 → 2.2-beta2
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta2 → 2.2-beta3
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

We're seeing this in our kubernetes deploys. At the least, it would be useful to ensure if the action can't complete that it fails in a timely manner. Juju 2 then should retry the hook and all would be well.

tags: added: cdo-qa cdo-qa-blocker
Changed in juju:
milestone: 2.2-beta3 → 2.2-beta4
Changed in juju:
importance: High → Critical
Changed in juju:
milestone: 2.2-beta4 → 2.2-rc1
Changed in juju:
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
Anastasia (anastasia-macmood) wrote :

From https://github.com/juju/charm/blob/v6-unstable/resource/resource.go#L57, having a negative revision should definitely throw a validation error.

Tracing the code through layers to client and command output, the error seems to be propagated correctly and the command should have erred out.

@Nicholas, @Cory,
Do you have a log that will help us trace why the error is not bubbled up?

I'll try to reproduce the scenario but given the intermittency of the failure, it may not fail easily for me.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I suspect that the cause of all this pain may not necessarily be revision number but the fingerprint verification which essentially constructs a hash based on the file reader contents. Of course, in case of placeholder the contents would be empty. Hash construction is defined in go code, although we have means of providing custom hash calculation algorithm...

Consequently, I also wonder if we're seeing the same issue with go 1.8...

Revision history for this message
John A Meinel (jameinel) wrote :

I tracked into the code a little bit, and came across this comment that seems like it might be relevant:
func ContextDownload(deps ContextDownloadDeps) (path string, err error) {
 // TODO(katco): Potential race-condition: two commands running at
 // once. Solve via collision using os.Mkdir() with a uniform
 // temp dir name (e.g. "<datadir>/.<res name>.download")?

Changed in juju:
assignee: Anastasia (anastasia-macmood) → nobody
Revision history for this message
Greg Lutostanski (lutostag) wrote :

@Anastasia we don't have logs, but we reliably recreate on every run with k8s in our ci. Tomorrow we will leave a run up and I'll ping to see if I can find someone to actively poke it when we are in this state tomorrow.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Greg Lutostanski (lutostag),
It'd be very helpful - thank you. Could you please also clarify what steps I should be using to re-produce?

Simply deploying kubernetes-core does not seem to trip for me. I have tried several times.

Also, based on our IRC conversation, I'd like to eliminate possibilities related to setup - you have mentioned that you suspect your proxies may need to be configured properly.

@Cory Johns (johnsca),
Was your deployment using proxy too?

Revision history for this message
Chris Gregan (cgregan) wrote :

Kubernetes Core on top of autopilot deployed Openstack is the environment, but I double that makes a difference for this particular bug unless it is race based.

Revision history for this message
Jonathan Marsaud (zic) wrote :

Hi,

I can reproduce this one by deploying the old canonical-kubernetes revision *21* (goal is to replicate a production cluster with is still in Kubernetes 1.5.3, then upgrading it to the latest version).

Regards.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Thank you, Jonathan Marsaud (zic)!
I'll try to reproduce it tomorrow :D

Changed in juju:
status: Incomplete → Triaged
status: Triaged → In Progress
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I can re-produce and have traced down the root cause \o/

I am working on a solution and will propose a fix as soon as it's ready!

Revision history for this message
Anastasia (anastasia-macmood) wrote :

The cause is that we will keep trying to download resources despite the errors. The exception to the rule are "not found" errors: if we encounter one of these, we'd stop trying.

Immediate solution is to also ensure that we do not try to download resource if we get validation errors. As Cory mentioned, one such scenario is when resource revision is -1. There is no way that the resource revision between charms will get fixed, so there is no need to persist with downloading.

I also want to limit the number of times we try to download a resource. Current approach of keep trying for ever feels weak.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Correction to above comment:
There is no way that the resource revision between download attempts will get fixed...

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1627127] Re: resource-get gets hung on charm store

To put it another way, that version of the charm will never correctly
deploy because there is invalid data in the resource information for that
revision.
However, we should report it as failed, rather than getting hung, so that
users can move to another version of the charm.

On Wed, May 24, 2017 at 6:17 PM, Anastasia <email address hidden>
wrote:

> The cause is that we will keep trying to download resources despite the
> errors. The exception to the rule are "not found" errors: if we
> encounter one of these, we'd stop trying.
>
> Immediate solution is to also ensure that we do not try to download
> resource if we get validation errors. As Cory mentioned, one such
> scenario is when resource revision is -1. There is no way that the
> resource revision between charms will get fixed, so there is no need to
> persist with downloading.
>
> I also want to limit the number of times we try to download a resource.
> Current approach of keep trying for ever feels weak.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1627127
>
> Title:
> resource-get gets hung on charm store
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1627127/+subscriptions
>

Revision history for this message
Anastasia (anastasia-macmood) wrote :

PR against develop (2.2): https://github.com/juju/juju/pull/7398

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.