Downtime & Recovery: Bludgeoning DigitalOcean Arch into Working Again
By: . Published: . Categories: sysadmin walkthrough.This blog had some downtime yesterday.
I have a DigitalOcean Droplet that’s an Arch x86-64 system, from back before they dropped Arch.
When I went to poke certbot into renewing my LetsEncrypt cert, I found the system was down. Power cycle, and…it’s still down.
What follows are my notes from getting things working again. I ran into a couple dark corners where I found no search hits, so maybe this’ll help the next poor sap.
Goal: Blog Accessible from Phone
The main thing the VPS runs is my blog. The other main thing is ZNC (an IRC bouncer). I really just care about getting the blog back up immediately, though. As far as I’m concerned, that means my phone connected to the cellular network can resolve the domain and see the latest version of the blog content.
Culprit: Systemd Is Hosed
My systemd got hosed somehow when the system tried to boot back up. I reach this conclusion from the fact that all the console errors are from systemd, and that a systemd that freezes execution means a system bootstrap that froze execution.
This looks to be the same issue described by a DigitalOcean Community user.
Bad news: No-one had a solution beyond “nuke & pave”. Hooray.
For added fun, I’m traveling and away from my backups. All I have is an ancient snapshot from 2014.
Maybe that’s all I need?
Nuke & Pave? Ancient Droplet Strikes Again
Bah, enough with an unsupported platform! I don’t get any of the new DO goodies with Arch.
What about just switching to FreeBSD?
I tried to flash with FreeBSD, but turns out you can’t rebuild it with a modern Droplet image unless it was seeded with an SSH key to begin with. You get a fun flash message at a weird place on the screen (or eventually, off the screen, if you keep trying long enough):
“Data image requires at least one SSH key”
I sure as heck have an SSH key on file with them, and I confirmed the (MD5? not SHA-256? Huh) fingerprint as matching the one I have on disk here still.
But that wasn’t quite what was meant. As DigitalOcean support says:
Sadly unless the Droplet was originally created using the SSH key in your Cloud panel, you won’t be able to rebuild it in place with the FreeBSD image. Instead, you’ll have to create a new Droplet.
Guess the key got baked in somewhere deep. Or it didn’t, in my case. Lame sauce.
So back to Arch for now, because I don’t want to have to repoint my DNS and do a bunch more clicky-clicky to rig up an entirely new host.
Recovery: Successfully Time-Traveled to 2014
Restore from ancient snapshot of 2014-03-19 using Digital Ocean Web UI. (Yeah, probably should have taken a snapshot a bit more recently. Oh well.)
ssh in. I can’t install certbot.
Get Certbot Installed
Let’s try to update to a modern toolchain so I can follow those recommended steps.
System Update Fails: GPG Key Import Fails
Keys are out of date, so system update with pacman -Syyu
fails after
restoring from 4-year-old snapshot.
Let’s refresh those keys:
pacman-key --refresh-keys
pacman -S archlinux-keyring
Now try to update system again. Continues failing out when I tell it, “Sure, import the key.” I can’t find anything helpful searching the Web.
Yank pacman source and rip-grep for the error message and wind up staring at libalpm/signing.c:460:
/**
* Import a key defined by a fingerprint into the local keyring.
* @param handle the context handle
* @param fpr the fingerprint key ID to import
* @return 0 on success, -1 on error
*/
int _alpm_key_import(alpm_handle_t *handle, const char *fpr)
/* SNIP */
if(key_import(handle, &fetch_key) == 0) {
ret = 0;
} else {
_alpm_log(handle, ALPM_LOG_ERROR,
_("key \"%s\" could not be imported\n"), fetch_key.uid);
}
/* SNIP */
Maybe requires a newer version of GPG or something to be able to import the key that I need to install that newer version of GPG. Hard to say: The error message has no other info than “import failed”, with no info about why or how it failed.
Disable Signature Verification
That’s a dead-end: There’s no clue what’s wrong, so there’s no clue how to move past it.
Bypass the failing step entirely with:
vi /etc/pacman.conf
/SigLevel
f=
d$
= Never
^[
:wq
System Update Fails: File Conflicts
Now we hit conflicting files:
error: failed to commit transaction (conflicting files)
ca-certificates-utils: /etc/ssl/certs/ca-certificates.crt exists in filesystem
lzo: /usr/include/lzo/lzo1.h exists in filesystem
lzo: /usr/include/lzo/lzo1a.h exists in filesystem
lzo: /usr/include/lzo/lzo1b.h exists in filesystem
lzo: /usr/include/lzo/lzo1c.h exists in filesystem
lzo: /usr/include/lzo/lzo1f.h exists in filesystem
lzo: /usr/include/lzo/lzo1x.h exists in filesystem
lzo: /usr/include/lzo/lzo1y.h exists in filesystem
lzo: /usr/include/lzo/lzo1z.h exists in filesystem
lzo: /usr/include/lzo/lzo2a.h exists in filesystem
lzo: /usr/include/lzo/lzo_asm.h exists in filesystem
lzo: /usr/include/lzo/lzoconf.h exists in filesystem
lzo: /usr/include/lzo/lzodefs.h exists in filesystem
lzo: /usr/include/lzo/lzoutil.h exists in filesystem
lzo: /usr/include/lzo/minilzo.h exists in filesystem
lzo: /usr/lib/liblzo2.so exists in filesystem
lzo: /usr/lib/liblzo2.so.2 exists in filesystem
lzo: /usr/lib/liblzo2.so.2.0.0 exists in filesystem
lzo: /usr/lib/libminilzo.so exists in filesystem
lzo: /usr/lib/libminilzo.so.0 exists in filesystem
lzo: /usr/share/doc/lzo/AUTHORS exists in filesystem
lzo: /usr/share/doc/lzo/COPYING exists in filesystem
lzo: /usr/share/doc/lzo/LZO.FAQ exists in filesystem
lzo: /usr/share/doc/lzo/LZO.TXT exists in filesystem
lzo: /usr/share/doc/lzo/LZOAPI.TXT exists in filesystem
lzo: /usr/share/doc/lzo/NEWS exists in filesystem
lzo: /usr/share/doc/lzo/THANKS exists in filesystem
Errors occurred, no packages were upgraded.
Let’s see what owns those:
[root@gateway-arch jeremy]# pacman -Qo /usr/include/lzo/lzo1.h
/usr/include/lzo/lzo1.h is owned by lzo2 2.06-3
[root@gateway-arch jeremy]# pacman -Qo /etc/ssl/certs/ca-certificates.crt
error: No package owns /etc/ssl/certs/ca-certificates.crt
But when when I try to pacman -U lzo2
, it acts like no such thing exists.
Sigh.
OK, the cert one is a known issue. Gotta love rolling releases. The fix is to delete before upgrading.
For lzo, I ultimately did pacman -S --force core/lzo
.
It runs a risk of clobbering the wrong thing and hosing all the things,
but it seemed a calculated risk, since it’s basically “clobber lzo as installed
with an older package name with lzo as installed with the new package name”.
The risk paid off, so.
OK, Now Really, Update
Then I could pacman -Su
. Finally. And all the things updated.
Rebuilding the man-db database took a century. I worried something had gone wrong, but it hadn’t.
Install Certbot
[root@gateway-arch jeremy]# pacman -S certbot-nginx
error: failed to initialize alpm library
(database is incorrect version: /var/lib/pacman/)
try running pacman-db-upgrade
[root@gateway-arch jeremy]# pacman-db-upgrade
==> Pre-4.2 database format detected - upgrading...
[root@gateway-arch jeremy]# pacman -S certbot-nginx
OK, now we can follow the docs on LetsEncrypt.
Run Certbot
Unicode Issues: Set LC_ALL=en_US.utf_8
Except that Unicode is fun fun fun:
[root@gateway-arch jeremy]# sudo certbot --nginx
Saving debug log to /var/log/letsencrypt/letsencrypt.log
An unexpected error occurred:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10453: ordinal not in range(128)
Please see the logfiles in /var/log/letsencrypt for more details.
Matching issue: https://github.com/certbot/certbot/issues/5236
Looks like there are smart quotes in the default template for nginx.conf, perhaps? Not in my config that I can see.
And that byte is at a gibberish offset.
And I confirmed that Python is happy to read in my nginx.conf as ascii.
OK, whatever. Mucking around with PYTHONIOENCODING=utf8 didn’t help.
So instead, play with locale. locale
reports we’re in C. locale -a
gives
us a utf-8 option. Set that.
Now it runs.
HTTPS Certificate Trust Anchor: Missing
Can’t get an HTTPS cert because of HTTPS certs. Delicious.
[root@gateway-arch jeremy]# certbot --nginx
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator nginx, Installer nginx
Enter email address (used for urgent renewal and security notices) (Enter 'c' to
cancel): ********
An unexpected error occurred:
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /etc/ssl/certs/ca-certificates.crt
Please see the logfiles in /var/log/letsencrypt for more details.
That path definitely exists. It is a symlink, though, to be fair.
Spewed out when requests
is trying to check the cert on the connection
to the ACME backend, per that debug logfile:
2018-05-01 03:46:44,110:DEBUG:certbot.plugins.selection:Selected authenticator <certbot_nginx.configurator.NginxConfigurator ob
ject at 0x7f11271de5c0> and installer <certbot_nginx.configurator.NginxConfigurator object at 0x7f11271de5c0>
2018-05-01 03:46:44,110:INFO:certbot.plugins.selection:Plugins selected: Authenticator nginx, Installer nginx
2018-05-01 03:48:40,618:DEBUG:acme.client:Sending GET request to https://acme-v01.api.letsencrypt.org/directory.
Oh, huh. The file pointed to by the symlink? That one doesn’t exist.
And pacman -Ql ca-certificates
, which should list all the files installed by
that package, shows zip. Nada.
Nothing: ca-certificates is an empty package.
Install More Certificates
Luckily, it’s not the only ca-certificates
package available:
pacman -S ca-certificates-cacert ca-certificates-mozilla
[root@gateway-arch jeremy]# pacman -Ss ca-cert
core/ca-certificates 20170307-1 [installed]
Common CA certificates (default providers)
core/ca-certificates-cacert 20140824-4 [installed]
CAcert.org root certificates
core/ca-certificates-mozilla 3.36.1-1 [installed]
Mozilla's set of trusted CA certificates
core/ca-certificates-utils 20170307-1 [installed]
Common CA certificates (utilities)
Now let’s try again.
Success!
Modulo the fact this is an old nginx.conf that wasn’t updated for http2. I’ll just fish those updates out of my backup later. (I don’t recall it being difficult at all to update from spdy to http2, but I’m out of steam at this point.)
[root@gateway-arch jeremy]# certbot --nginx
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator nginx, Installer nginx
Enter email address (used for urgent renewal and security notices) (Enter 'c' to
cancel): *******
/usr/lib/python3.6/site-packages/josepy/jwa.py:107: CryptographyDeprecationWarning: signer and verifier have been deprecated. Please use sign and verify instead.
signer = key.signer(self.padding, self.hash)
-------------------------------------------------------------------------------
Please read the Terms of Service at
https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf. You must
agree in order to register with the ACME server at
https://acme-v01.api.letsencrypt.org/directory
-------------------------------------------------------------------------------
(A)gree/(C)ancel: a
-------------------------------------------------------------------------------
Would you be willing to share your email address with the Electronic Frontier
Foundation, a founding partner of the Let's Encrypt project and the non-profit
organization that develops Certbot? We'd like to send you email about EFF and
our work to encrypt the web, protect its users and defend digital rights.
-------------------------------------------------------------------------------
(Y)es/(N)o: n
Which names would you like to activate HTTPS for?
-------------------------------------------------------------------------------
1: jeremywsherman.com
2: www.jeremywsherman.com
-------------------------------------------------------------------------------
Select the appropriate numbers separated by commas and/or spaces, or leave input
blank to select all options shown (Enter 'c' to cancel):
Obtaining a new certificate
Performing the following challenges:
http-01 challenge for jeremywsherman.com
http-01 challenge for www.jeremywsherman.com
2018/05/01 03:59:59 [warn] 18919#18919: invalid parameter "spdy": ngx_http_spdy_module was superseded by ngx_http_v2_module in /etc/nginx/nginx.conf:65
2018/05/01 03:59:59 [warn] 18919#18919: could not build optimal types_hash, you should increase either types_hash_max_size: 1024 or types_hash_bucket_size: 64; ignoring types_hash_bucket_size
2018/05/01 03:59:59 [notice] 18919#18919: signal process started
Waiting for verification...
/usr/lib/python3.6/site-packages/josepy/jwa.py:107: CryptographyDeprecationWarning: signer and verifier have been deprecated. Please use sign and verify instead.
signer = key.signer(self.padding, self.hash)
Cleaning up challenges
2018/05/01 04:00:06 [warn] 18922#18922: invalid parameter "spdy": ngx_http_spdy_module was superseded by ngx_http_v2_module in /etc/nginx/nginx.conf:55
2018/05/01 04:00:06 [warn] 18922#18922: could not build optimal types_hash, you should increase either types_hash_max_size: 1024 or types_hash_bucket_size: 64; ignoring types_hash_bucket_size
2018/05/01 04:00:06 [notice] 18922#18922: signal process started
/usr/lib/python3.6/site-packages/josepy/jwa.py:107: CryptographyDeprecationWarning: signer and verifier have been deprecated. Please use sign and verify instead.
signer = key.signer(self.padding, self.hash)
Deploying Certificate to VirtualHost /etc/nginx/nginx.conf
Deploying Certificate to VirtualHost /etc/nginx/nginx.conf
2018/05/01 04:00:11 [warn] 18924#18924: invalid parameter "spdy": ngx_http_spdy_module was superseded by ngx_http_v2_module in /etc/nginx/nginx.conf:55
2018/05/01 04:00:11 [warn] 18924#18924: could not build optimal types_hash, you should increase either types_hash_max_size: 1024 or types_hash_bucket_size: 64; ignoring types_hash_bucket_size
2018/05/01 04:00:11 [notice] 18924#18924: signal process started
Please choose whether or not to redirect HTTP traffic to HTTPS, removing HTTP access.
-------------------------------------------------------------------------------
1: No redirect - Make no further changes to the webserver configuration.
2: Redirect - Make all requests redirect to secure HTTPS access. Choose this for
new sites, or if you're confident your site works on HTTPS. You can undo this
change by editing your web server's configuration.
-------------------------------------------------------------------------------
Select the appropriate number [1-2] then [enter] (press 'c' to cancel): 1
-------------------------------------------------------------------------------
Congratulations! You have successfully enabled https://jeremywsherman.com and
https://www.jeremywsherman.com
You should test your configuration at:
https://www.ssllabs.com/ssltest/analyze.html?d=jeremywsherman.com
https://www.ssllabs.com/ssltest/analyze.html?d=www.jeremywsherman.com
-------------------------------------------------------------------------------
IMPORTANT NOTES:
- Congratulations! Your certificate and chain have been saved at:
/etc/letsencrypt/live/jeremywsherman.com/fullchain.pem
Your key file has been saved at:
/etc/letsencrypt/live/jeremywsherman.com/privkey.pem
Your cert will expire on 2018-07-30. To obtain a new or tweaked
version of this certificate in the future, simply run certbot again
with the "certonly" option. To non-interactively renew *all* of
your certificates, run "certbot renew"
- Your account credentials have been saved in your Certbot
configuration directory at /etc/letsencrypt. You should make a
secure backup of this folder now. This configuration directory will
also contain certificates and private keys obtained by Certbot so
making regular backups of this folder is ideal.
- If you like Certbot, please consider supporting our work by:
Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate
Donating to EFF: https://eff.org/donate-le
I declined the redirect because I already have that in place.
The new lines added to the server stanza are:
ssl_certificate /etc/letsencrypt/live/jeremywsherman.com/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/jeremywsherman.com/privkey.pem; # managed by Certbot
Restart Nginx
systemctl restart nginx
And now my blog is up and running again.
Total time cost: 2 hours-ish. Guess that coulda been worse.
Addendum: Keeping Cron Running
I kept seeing cron failing out on me, and my LetsEncrypt certs not auto-renewing. (On the bright side, that’s how I found out everything was hosed this time.)
Let’s see if we can fix that.
Which Cron?
What cron are we running?
[root@gateway-arch jeremy]# pacman -Qs cron
local/cronie 1.5.1-1
Daemon that runs specified programs at scheduled times and related tools
Cronie.
Who’s That to Systemd?
And how’s it wired into systemd?
[root@gateway-arch jeremy]# pacman -Ql cronie | grep systemd
cronie /usr/lib/systemd/
cronie /usr/lib/systemd/system/
cronie /usr/lib/systemd/system/cronie.service
As unit cronie.service.
Status Shows Errors
Let’s flip it on and check its status.
[root@gateway-arch jeremy]# systemctl enable cronie
[root@gateway-arch jeremy]# systemctl status cronie
* cronie.service - Periodic Command Scheduler
Loaded: loaded (/usr/lib/systemd/system/cronie.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-05-01 02:57:13 UTC; 11h ago
Main PID: 169 (crond)
CGroup: /system.slice/cronie.service
`-169 /usr/bin/crond -n
May 01 13:01:01 gateway-arch crond[22224]: PAM unable to dlopen(/usr/lib/security/pam_unix.so): /usr/lib/libpam.so.0: version >
May 01 13:01:01 gateway-arch crond[22224]: PAM adding faulty module: /usr/lib/security/pam_unix.so
May 01 13:05:01 gateway-arch crond[22250]: PAM unable to dlopen(/usr/lib/security/pam_unix.so): /usr/lib/libpam.so.0: version >
May 01 13:05:01 gateway-arch crond[22250]: PAM adding faulty module: /usr/lib/security/pam_unix.so
May 01 14:01:01 gateway-arch crond[22590]: PAM unable to dlopen(/usr/lib/security/pam_unix.so): /usr/lib/libpam.so.0: version >
May 01 14:01:01 gateway-arch crond[22590]: PAM adding faulty module: /usr/lib/security/pam_unix.so
May 01 14:05:01 gateway-arch crond[22616]: PAM unable to dlopen(/usr/lib/security/pam_unix.so): /usr/lib/libpam.so.0: version >
May 01 14:05:01 gateway-arch crond[22616]: PAM adding faulty module: /usr/lib/security/pam_unix.so
May 01 14:51:01 gateway-arch crond[169]: (root) CAN'T OPEN (/etc/crontab): No such file or directory
May 01 14:51:01 gateway-arch crond[169]: (root) RELOAD (/var/spool/cron/root)
Point one, vendor preset is disabled. So that explains why I kept seeing crond fizzle out on me. I wanted to use cron but then never enabled it with systemd. (And I’m sure it’s disabled because I ought to be writing a systemd unit file. But I’m not going to just now.)
Point two, there are a couple errors.
The /etc/crontab error is weird - looking at the full -Ql
output shows stuff
in /etc/anacrontab, not /etc/crontab. I’m betting this is an older
cron process that started before I updated all my packages from their 2014
vintage flavors.
The PAM error looks grungy and noisy.
A quick search through man systemctl
didn’t show me how to change the
character width so I could see the rest of the PAM module error, but searching
found a good hit for dlopen failure of pam_unix.so.
Consensus is: You had a glibc update. Now a process still using the older glibc is trying to open a shared object using a newer one.
Fix: systemctl restart cronie
Restarting the Service Fixes All Errors
And indeed, that does it:
[root@gateway-arch jeremy]# systemctl status cronie
* cronie.service - Periodic Command Scheduler
Loaded: loaded (/usr/lib/systemd/system/cronie.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-05-01 14:54:35 UTC; 17s ago
Main PID: 22985 (crond)
Memory: 612.0K
CGroup: /system.slice/cronie.service
`-22985 /usr/bin/crond -n
May 01 14:54:35 gateway-arch systemd[1]: Stopped Periodic Command Scheduler.
May 01 14:54:35 gateway-arch systemd[1]: Started Periodic Command Scheduler.
May 01 14:54:35 gateway-arch crond[22985]: (CRON) INFO (RANDOM_DELAY will be scaled with factor 96% if used.)
May 01 14:54:35 gateway-arch crond[22985]: (CRON) INFO (running with inotify support)
May 01 14:54:35 gateway-arch crond[22985]: (CRON) INFO (@reboot jobs will be run at computer's startup.)
No more errors. Good to go!
Total time cost: 20 minutes.