[08:50:17] gerrit down? [08:51:55] yes (at least for me) I mentioned in -ops couple of minutes ago [08:52:59] yeah, the alert just arrived on -operations [08:53:11] and now the recovery [09:00:48] my assorted ssh connections have all gone very slow [09:01:38] (to both codfw and eqiad hosts) [09:08:56] Emperor: still ongoing? I am checking client side metrics trends and I see nothing weird [09:19:14] seems to have resolved now [09:27:46] there is an etcd config complain on codfw [09:28:06] ah, gone now [09:28:07] Emperor: Same errors as yesterday, some internal error on upload-pack from Repository[/srv/gerrit/git/operations/puppet.git] [10:31:25] Can I get a review for https://gerrit.wikimedia.org/r/c/operations/dns/+/883551 and double check on https://gerrit.wikimedia.org/r/c/operations/puppet/+/883552/ to see if I'm not going to break stuff renaming that service? [10:34:07] I'll take a look [10:34:41] thanks <3 [10:35:41] looks good! godspeed [10:38:28] Thanks. About things left behind, I imagine it's like when removing a service (since it functionnaly is removing and adding this one). We'll see [10:39:10] agreed, yeah I think most things will do the right thing [10:39:21] hopefully everything [10:39:23] If not I'll make notes of it [10:40:06] XioNoX, slyng, about to merge the above changes, just fyi [10:41:18] rgr [10:48:45] I have netbox changes for an-worker1148 an-worker1084 and an-worker1080 going from failed to active [10:48:54] Are they right? [10:49:49] btullis ? ^ [10:54:55] The status is right in netbox, servers seem up, icinga looks good, proceeding. [10:59:42] claime: Yes, they were set to failed because the RAID controller batteries needed replacing. It was probably a bit of overkill, but it said to do so on this form: https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ [11:00:25] btullis: Yeah, no problem, just wanted to make sure ;) [11:59:32] Emperor: swift@eqiad seems to be misbehaving: https://grafana.wikimedia.org/goto/OYU5ywoVz?orgId=1 [12:04:26] argh [12:09:49] hum, that coincides with thumbor k8s being pooled [12:10:00] let me try depooling it and seeing if it subsides [12:10:33] * Emperor is doing a rolling-restart of eqiad frontends [12:10:47] hnowlan: sorry [12:11:19] np, I'll repool after the restart and see what happens [12:13:10] oh FCOL [12:13:15] ? [12:13:26] restart of ms-fs1009 has failed because of some prometheus failure [12:13:43] Jan 26 12:13:07 ms-fe1009 confd-prometheus-metrics[180948]: log.warning(f"{template_dest} not found") [12:14:00] weird [12:14:24] File "/usr/local/bin/confd-prometheus-metrics", line 55 is producing a SyntaxError exception [12:15:35] OK, yes, we've broken confd-prometheus-metrics for stretch nodes [12:15:41] :-( [12:15:47] I think f-strings in python came in in python3.6 [12:15:51] stretch has python 3.5 [12:16:13] and I can't upgrade these stretch nodes because I'm too busy fighting fires to actually get to the point of being sure we can do away with swiftrepl [12:16:30] in any case, log.warning should use %s syntax preferibly [12:16:52] and now I can't rolling-restart the eqiad frontends to fix the problems with swift being on fire again because now the stretch node is broken so systemd is unhappy so the restart is going to fail [12:17:12] can I help somehow, e.g. manual restart? [12:19:46] I don't understand, modules/confd/files/confd_prometheus_metrics.py has seemingly not changed since October 2022 [12:19:51] How di it not break earlier? The change is old [12:19:54] Oh [12:20:03] claime: I've no idea [12:20:27] -r-xr-xr-x 1 root root 4324 Jan 26 12:07 /usr/local/bin/confd-prometheus-metrics [12:20:33] it has, however, changed on disk very recently [12:20:42] jbond: could it be related to using etcd ferm rules? [12:20:44] does it come from puppet or a package? [12:20:51] jynus: puppet [12:20:53] puppet [12:22:00] Yeah, definitely linked to jbond's change [12:22:28] sorry reading back log [12:22:33] It got dropped on your machine through a dependency of ferm etcd config [12:23:05] [the other eqiad frontends are bullseye, so I'm continuing with the rolling-restart] [12:23:43] re-running puppet does not remove the file [12:24:00] Emperor: where are ou seeing that issue i have reverted my change so everything should have been cleaned up [12:24:23] jbond: ms-fe1009 [12:24:29] (re-running puppet there now) [12:24:32] I just did [12:25:02] It removed the check_confd_templates and nagios stuff [12:25:09] But not confd-prometheus-metrics [12:25:45] But I think if we remove it and re-run puppet it should not come back. Then we should probably fix that script so that it doesn't use f-string [12:26:02] +1 [12:26:39] probably want to actually remove the service file /lib/systemd/system/confd_prometheus_metrics.service ? [12:27:14] * Emperor will do so, systemctl daemon-reload, rerun puppet [12:27:17] meanwhile, I am generating a 3.5-compatible version of the file [12:27:19] Probably yeah, I can try to chase the dependency chain [12:27:26] jynus: Oh I was about to, thanks. [12:29:03] OK, ms-fe1009 now has happy systemd again. [12:29:27] ok i have fixed the other stretch hosts [12:29:34] rolling-restart of eqiad frontends done, error rate down again. [12:29:49] jbond: thanks, sorry, I keep hoping I can get rid of them then something else crawls out of the woodwork :( [12:30:04] no problem [12:30:38] * Emperor lunch [12:39:34] jbond: if you know more about what the script does and can test it, please have a look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883926 that was a complete blind change [12:39:50] jynus: thanks looking [12:47:45] should I reset-fail thumbor1002 and thumbor2003? [13:01:20] jynus: I'll have a look [13:01:59] my guess is same issues as above- remove file/timer and run puppet/reset stuff [13:02:19] going for lunch [14:15:13] Emperor: is it okay if I repool thumbor? Will keep an eye to see if it causes swift issues again [14:18:06] hnowlan: go ahead [14:26:03] Emperor: looks like it's spiking back up :( depooling again [14:26:53] I am absolutely at a loss as to what could be causing it [14:28:06] thumbor itself doesn't appear to be logging any 503 responses [14:29:53] hnowlan: Hm, now we wait and see if swift recovers or if I need to roll-restart it again :) [14:31:16] hnowlan: do you have any theories as to why it might be only eqiad that's suffering? [14:31:48] I'm only pooling thumbor in eqiad at the moment [14:32:00] thumbor on k8s that is [14:32:34] ah, OK, I think that supports the theory that thumbor is the cause [14:33:03] sadly, looks like the swift unhappyness persists [14:33:35] server.log is full of "Client disconnected on read of [file]" [14:33:46] Is it just request volume? in theory traffic from the on-metal thumbor instances should go down as traffic ramps up from the k8s instances [14:35:31] not _obviously_, but I am starting to trust the swift dashboards less :( [14:35:59] OK, it's not recovering quickly enough, I'm going to restart [14:38:02] sorry for the trouble [14:38:56] It makes me sad that i) we don't know why turning thumbor-on-k8s breaks the frontends ii) they don't recover when thumbor-on-k8s is depooled [14:39:08] FTAOD, I don't think either of those things is your fault :) [14:51:18] how balancing happens at that layer? [14:51:39] (I came to report the spike again but I've seen that you are already on it) [14:52:20] hnowlan: BTW, in terms of testing, using codfw should impact less users than eqiad [14:52:43] eqiad impacts eqiad+esams+drmrs and that's usually higher traffic than codfw+ulsfo+eqsin [14:53:02] * Emperor starting to get concerned that the spike hasn't gone way after a restart as rapidly as this morning's spike did :( [14:53:47] vgutierrez: true, but in this case I want to get prod traffic [14:54:00] hnowlan: it's still prod traffic [14:54:17] just a "smaller" portion of the cake [14:55:37] (I suspect better^W some visibility of what memcached is doing might be valuable) [14:55:46] Emperor: ugh ffs, my confctl command didn't complete - thumbor was still serving. [14:55:49] Sorry about that [14:56:06] so probably a false alarm on the restarts not working [14:56:07] 503s seem to be gone now [14:56:34] hnowlan: oops :) [14:56:43] grafana should reflect that soon [14:57:00] (checking in RT with vgutierrez@cp6008:~$ sudo -i atslog-backend OriginStatus:503) [14:57:46] yeah, grafana moving in a good direction now [14:57:48] one potential culprit is that the swiftclient library has jumped a major version between thumbor on metal and thumbor-k8s [14:58:02] but I would still expect to see errors in thumbor logs if that were the case [14:58:08] [which is nicer, 'cos it means that swift does recover OK by itself when the thumbor-change is undone] [15:05:24] logged https://phabricator.wikimedia.org/T328033 for investigation [15:07:57] are there any logs for swift that might indicate what kind of problem thumbor was causing? [15:10:07] *hollow laugh* each server has proxy-access.log (access logs) and server.log (errors) also /var/log/nginx/unified.error.log and error.log [15:10:18] are those 500s impacting thumbor requests only or is it causing 500s for all requests I wonder [15:10:19] we get nothing at all from/about memcached [15:24:36] [and I've found trying to tie nginx & swift logs together very frustrating and almost never useful] [15:27:05] /var/log/swift/server.log seems to say that the errors around that time are more or less exclusively timeouts - are those masking/being caused by the 500s or are they the source of them? [15:27:18] To pick an example - I found an entry from nginx unified.error.log during the spike in errors - 2023/01/26 14:43:49 [error] 3950277#3950277: *957 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.20.0.61, server: ms-fe.svc.eqiad.wmnet, request: "GET /wikipedia/commons/thumb/4/46/Sigourney_Weaver_1989_cropped.jpg/225px-Sigourney_Weaver_1989_cropped.jpg HTTP/1.1", upstream: [15:27:18] "http://10.64.130.2:80/wikipedia/commons/thumb/4/46/Sigourney_Weaver_1989_cropped.jpg/225px-Sigourney_Weaver_1989_cropped.jpg", host: "upload.wikimedia.org" but there's no server.log entry for that timestamp, and no proxy-access.log entries from today that match either "Sigourney" nor the IP 10.20.0.61. [15:28:19] (in the swift logs for ms-fe1012 which is 10.64.130.2 ) [15:30:34] heh [15:31:35] it's infuriating, I waste hours of my life chasing shadows in the swift logs :( [15:31:41] hnowlan: timeouts on thumbor-on-k8s are the same as in the legacy deployment? [15:33:18] vgutierrez: nope :( legacy deployment is still running and there are a *lot* more workers running on the legacy instances [15:33:45] hnowlan: I meant timeout values in the configuration of the HTTP stack [15:36:41] vgutierrez: ah good question - nothing has changed intentionally but upgraded dependencies might have changed stuff unintentionally [15:40:17] we've gone from stretch to buster, python 2 to python 3 and a bunch of related things all at once. Not much of a choice in the matter though [15:41:40] hnowlan: sure, just mentioning the possibility of having unaligned timeouts given the logline that Emperor pasted [15:47:24] hi folks. running the reimaging cookbook and running into: [15:47:26] spicerack.dhcp.DHCPError: target file ttyS1-115200/cp2027.conf failed to be created. [15:47:29] on looking at the detailed logs, [15:47:33] 2023-01-26 15:30:42,883 sukhe 673961 [ERROR] 50.0% (1/2) of nodes failed to execute command '/bin/echo 'Cmhvc...5200/cp2027.conf': apt2001.wikimedia.org [15:47:54] I tried manually removing the file as well (at least on install4003), no effect. and it doesn't exist on apt2001 (which rsyncs from apt1001 anyway?) [15:48:07] any ideas on how I can resolve this? thanks! host is cp2027.codfw.wmnet [16:02:01] sukhe: there is a install2003:~$ cat /etc/dhcp/automation/ttyS1-115200/cp2027.conf [16:02:13] XioNoX: yeah, removing it didn't help [16:02:23] this got created after it failed again [16:02:43] sukhe I vaguely remember having similar problems trying to run reimage cookbook immediately after a failure. I thought removing it was enough but maybe there were other steps, let me check and see if I have notes [16:02:45] so I removed it twice and that's why I think I am missing something else here :P [16:02:49] inflatador: thakns! [16:02:54] with the right spelling [16:03:15] it's weird that it mentions apt2001 [16:03:23] it's not playing anything in there [16:04:04] yeah... the only reason I saw that was when I went into the logs [16:04:18] previous failures would simply resolve from deleting the file from the install server hosts [16:04:56] trying to figure out how the cookbook decides which hosts to configure dhcp on [16:08:11] sukhe: yeah there is something weird [16:08:27] cumin1001:~$ sudo cumin 'A:installserver and A:codfw' 'grep "^- " /etc/wikimedia/contacts.yaml' -i [16:08:27] 2 hosts will be targeted: [16:08:27] apt2001.wikimedia.org,install2003.wikimedia.org [16:08:40] apt2001 shouldn't be in there [16:08:43] only install2003 [16:09:51] sukhe: https://github.com/wikimedia/puppet/commit/baa51c624e4413ad4616f280da7953da7935a6e9 [16:09:55] XioNoX: yeah I am not sure what is happening but this is the first time I have gone so deep so who knows :) [16:10:29] I think the change above broke it? [16:11:13] how exactly, I'm not sure [16:11:54] yeah I am not sure if that's it because for 2003 it says "NOT APT REPO" [16:11:58] and that's what the motd says too [16:12:01] (cc moritzm if he's still around) [16:12:10] sorry, getting pulled into mtg, might be awhile [16:12:36] np, not urgent for anyone reading [16:12:43] just trying to figure out what is happening [16:12:53] I have to step away soon too [16:12:57] XioNoX: <3 [16:13:56] sukhe: this line should return only install2003: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L163 [16:14:21] but with the commit above it returns both apt2001 and install2003 [16:14:41] indeed, we might need to split up the Cumin alias [16:15:25] the previous alias which includes installserver and apt is still needed for some cases, but we should retain the previous logic, making a patch now [16:15:36] thanks moritzm! [16:15:37] maybe something like that https://www.irccloud.com/pastebin/mSvUPeqc/ [16:15:44] ah, cool [16:15:54] may not be relevant but some NICs need a FW update to PXE boot, ref https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Debian_Installer_Can't_Detect_NICs [16:16:10] inflatador: yeah thanks, ran into that [16:16:13] XioNoX: nice catch! [16:18:16] could I get a quick sanity check for https://gerrit.wikimedia.org/r/883983 [16:19:00] moritzm: all the hosts with a dhcp server will match the P{O:installserver} ? [16:21:00] exactly [16:21:23] and for the cornercases where we previously used A:installserver, there's now A:installserver-full [16:21:23] cool, +1 [16:21:45] merging and forcing puppet runs [16:21:56] thanks! [16:23:22] A:{self.netbox_data["site"]["slug"]} in https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L163 is the site? [16:23:29] nvm, it is [16:23:41] it has to be and also it's down below [16:25:09] deployed the alias fix [16:25:13] sorry for breaking things, I didn't foresee the subtle ways this alias was/is used [16:25:18] eh np at all [16:25:26] thanks for fixing it [17:02:15] sukhe: i didn;t have a chance to test the firmware upgrade cookbook today, im i ok to test on the same machine tomorrow? [17:03:34] jbond: np! I have downtimed it for a day and it's depooled so feel free to go ahead tomorrow [17:03:37] thanks [17:03:56] great thanks