I notice that most of people don't know how to deal with errors in Archipel. This post will try to help people to know how to manage problems in Archipel, track the origin of the errors, and how to fix them, or at least being able to report a correct issue.
First let's see how I organize my desktop. This is my personal preference, but I guess it's a good starting point. I have two screens, and this is sincerely the bare minimum for me.
As you can see, on the first screen, I have my browser, with debugger opened, the source code (this is obviously optional if you are just a user) and on the second screen I have opened two terminals. One connected to the hypervisors through SSH, with the archipel.log displayed with tail (I use one tab per hypervisor) and another one, in my local computer, to send update.
As you can see, to update the archipel agent code, a simple scp + restart archipel is sufficient because I have the source installed with buildAgent -d. This allows to optimize the code'n'try way. I also usually have another terminal SSH'ed to the hypervisors to send simple commands, or fix some stuff quickly. If you use Mac OS X, you can create a Terminal profile to open this workspace with one click.
There are basically two main errors users are complaining about.
Their hypervisors are offline
Usually, there are only few things that could cause these issues. it's mostly always due to :
some missing library like python-libvirt or xmpppy or stuff like that
a misconfiguration in archipel.conf
A problem with the ejabberd configuration
When you are debugging Archipel, you need to know that most of the errors are logged in the archipel.log. In case they are not logged, there is great chance you run into a new issue we are not aware of. To be sure you will see every kind of errors, do not start archipel with the init script! it will redirect all output to /dev/null. You should start and restart archipel like this
# killall runarchipel; runarchipel; tailf /var/log/archipel/archipel.log
Do not hesitate to grep! for example:
# killall runarchipel; runarchipel; tailf /var/log/archipel/archipel.log | grep -i error # only displays error lines
# killall runarchipel; runarchipel; tailf /var/log/archipel/archipel.log | grep -i "uuid@fqdn" # only displays logs from a the vm with given UUID
Be creative, and try to always filter the log, because it can be very verbose. most of the functions prefixes the logs with an identifier. for example, if you want to get logs for only migration related stuff, grep on "MIGRATION" or only for VMParking feature, grep on "VMPARKING".
This will help you to track down where your problem comes from.
You have to note that 501 errors are not logged into archipel.log! Errors 501 mean the agent hasn't react to the command, so obviously, nothing will be logged. But, it is logged on the client:
and in the browser console:
The Browser console is a capital tool. It contains the very XMPP stanzas that have triggered the errors. For instance, if we looked at the second error message, we see that the stanza is:
<iq xmlns='jabber:client' from='hypervisor@ramucho/ramucho' to='admin@ramucho/ArchipelController' id='4771'type='error'>
<query xmlns='archipel:hypervisor:health'>
<archipel xmlns='archipel:hypervisor:health' action='logs' limit='50'/>
</query>
<error code='501' type='cancel'>
<feature-not-implemented xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/>
<text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>
The feature requested is not implemented by the recipient or server and therefore cannot be processed.
</text>
</error>
</iq>
Which means that the hypervisor hasn't computed the command archipel:hypervisor:health->logs. In that case, it's easy to understand because I have disabled the health module. So the XMPP stanza has not be handled.
When this happens involuntarily, this means that during startup, an error has occurred. Let's remove manually a required token like "vnc_certificate_file" which is required by the VNC module. When I start the agent, greping on "ERROR" I will have in the archipel.log:
These message is pretty explicit. I have two virtual machines —653b...@ramucho and 2590...@ramucho — who are complaining about the impossibility to load the VNC plugin because of the missing token "vnc_certificate_file". This will not prevent the agent to start, but if you try from the UI to access the VNC screen, you will get 501 errors. That's pretty simple.
The hypervisor offline error
The first thing to do is to check if the hypervisor is really offline. To do so, ask the ejabberd server:
# ejabberdctl connected_users your.fqdn.com
It will list the accounts that are actually connected. This often help to track down the basis of the error. If you see your hypervisor online, well you certainly made a typo when adding it to your roster while entering the JID. If it's offline, you'll see in the log why! There are plenty of possible cause, most of them are logged. If it's a more critical issues, you will certainly a exception trace printed on STDERR (remember we have manually started archipel).
It's important to track down your eventual problems before reporting an issue, or asking for help on the IRC channel. Most of the time, these errors are extremely easy to fix, but the user report them like "it doesn't work". It doesn't help that much, and we loose a large amount of time to explain this. So before reporting any problem:
Check your network connectivity
Be sure you have not made any typo! (seriously, this happens quite often)
Try to do simple test to exclude some test case
As a final note, we greatly appreciate users that send crash reports. But please, consider adding a very short description of what was your last action before the crash. Otherwise, depending on the browser, that trace may not be usable.