[00:01:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:53] PROBLEM - dump of es5 in eqiad on alert1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than 8 days ago: Most recent backup 2022-04-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:07:17] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) a:05RobH→03ssingh This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors. When we were replacing the memory, it forgot its... [00:07:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:08:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: T307525', diff saved to https://phabricator.wikimedia.org/P27364 and previous config saved to /var/cache/conftool/dbconfig/20220504-001205-ladsgroup.json [00:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:10] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:13:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:13:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27365 and previous config saved to /var/cache/conftool/dbconfig/20220504-001326-ladsgroup.json [00:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:35] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-04-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:17:57] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than 8 days ago: Most recent backup 2022-04-26 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27366 and previous config saved to /var/cache/conftool/dbconfig/20220504-001944-ladsgroup.json [00:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:49] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [00:19:59] PROBLEM - dump of es5 in codfw on alert1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than 8 days ago: Most recent backup 2022-04-26 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27367 and previous config saved to /var/cache/conftool/dbconfig/20220504-003449-ladsgroup.json [00:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:11] PROBLEM - Confd vcl based reload on cp5002 is CRITICAL: reload-vcl failed to run since 0h, 35 minutes. https://wikitech.wikimedia.org/wiki/Varnish [00:45:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=ats-be [00:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:35] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=varnish-fe [00:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=ats-tls [00:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:59] RECOVERY - Confd vcl based reload on cp5002 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [00:46:44] ^ this was due to cp5002 being marked as inactive due to a faulty DIMM. it is now pooled [00:48:30] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) 05Open→03Resolved >>! In T305423#7901638, @RobH wrote: > This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors. > > When... [00:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27368 and previous config saved to /var/cache/conftool/dbconfig/20220504-004954-ladsgroup.json [00:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [00:50:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [00:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10Papaul) @ssingh the host is still has failed as status in netbox https://netbox.wikimedia.org/dcim/devices/1611/ [00:57:51] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-trafficserver-tls-exporter.service,wmf_auto_restart_purged.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:03] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:09] RECOVERY - Check systemd state on cp5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27369 and previous config saved to /var/cache/conftool/dbconfig/20220504-010459-ladsgroup.json [01:05:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [01:05:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [01:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:04] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27370 and previous config saved to /var/cache/conftool/dbconfig/20220504-010507-ladsgroup.json [01:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet [01:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:48] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) >>! In T305423#7901674, @Papaul wrote: > @ssingh the host is still has failed as status in netbox > https://netbox.wikimedia.org/dcim/devices/1611/ Thanks for letting me know @PP... [01:08:41] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:09:51] here [01:10:10] this is weird, I am here as wel [01:10:24] the only thing that's happening is cp5002 is being rebooted, nothing else should have been affected [01:10:39] ok [01:11:10] this error is certainly unexpcted [01:11:22] sukhe@puppetmaster1001:~$ confctl select name='^cp[0-9].*' get|grep '"no"'|sort -n [01:11:25] {"cp5002.eqsin.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=eqsin,cluster=cache_upload,service=ats-be"} [01:11:28] {"cp5002.eqsin.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=eqsin,cluster=cache_upload,service=ats-tls"} [01:11:31] {"cp5002.eqsin.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=eqsin,cluster=cache_upload,service=varnish-fe"} [01:11:34] the host is depooled [01:11:45] I don't think this is related, let me see [01:11:54] hmm it is eqsin [01:13:09] gere [01:13:14] *here [01:13:19] at this point I am more suspicious of a monitoring issue than anything else [01:13:21] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:13:25] ok :P [01:13:34] phew, I was like, one host down that was depooled should not have caused this at all [01:14:10] coool [01:14:19] afk then [01:14:21] we still need to find out what happened so I will discuss with v.alentin then [01:14:32] sorry for the disturbance folks, please carry on with your evening/morning/nights :) [01:14:59] yeah I'm a bit baffled, probably won't be able to figure out what's going on without digging into a bunch of the underling prometheus metrics (underneath the recording rule that is used in the alert) [01:15:11] anyway I'm also going back to my evening [01:15:16] please do! [01:15:26] * sukhe hopes the rest of the evening is quiet [01:17:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 38.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:17:58] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:18:57] I don't understand why this is happening at all [01:19:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:19:37] oh so now it's eqiad? which is clearly not related to the reboot then [01:19:45] so something else is up [01:19:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 58.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:21:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 98.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:22:05] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:22:21] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:22:26] and recovery sigh [01:22:47] hi :) [01:22:52] hello bblack! [01:23:08] the traffic drop recovered, the reduced avail, not yet? [01:23:31] no :( [01:23:39] it's in eqsin, where I was rebooting cp5002 [01:23:43] cache_upload @ eqsin according to the linked graph, I think [01:23:45] I still can't figure out what went wrong though [01:23:54] isn't cp5002 faulty anyways? [01:24:03] yep [01:24:15] it was fixed, and hence I pooled it but I think there are some issues still [01:24:21] https://puppetboard.wikimedia.org/report/cp5002.eqsin.wmnet/8d7bd6801a0778eb7bb9260ead85b42a3396466f [01:24:41] but should this one host have caused this problem? [01:24:47] it is depooled anyway [01:24:49] (now) [01:25:11] maybe [01:25:19] it was "inactive" before, right? [01:25:21] yep [01:26:07] I'm not seeing a real, present, problem yet [01:26:14] still digging around a bit [01:26:27] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:26:43] ^ well at least this is expected [01:26:44] yeah clearly that host is not-well :) [01:26:47] yeah [01:26:47] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:27:01] was it puppet-disabled as well, before all this? [01:27:19] PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [01:27:21] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:27:29] I think given that it was inactive, it must have been [01:27:55] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:28:05] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:28:21] maybe, maybe not [01:28:21] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:28:30] just thinking about theories about how the software could've been out of whack [01:28:41] running agent now there just to see [01:28:48] agent fails, I tried [01:28:57] I am looking at mgmt to see if there are any persisting memory issues [01:29:03] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:29:11] I guess we should at least downtime the host? [01:29:27] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:29:28] well, eventually, maybe not right at this moment. the spam is informative :) [01:29:39] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp5002 is CRITICAL: connect to address 10.132.0.102 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [01:30:17] but I suspect the availability alert is some kind of false alarm at present (although it might've been real when it was pooled) [01:30:39] yeah the puppet agent doesn't even get a clean run right now [01:30:43] yeah it's pretty bad [01:31:28] I suspect this is all because we were in the midst of some software changes back when it first failed, and/or since [01:31:52] it probably needs a fresh reimage, or at least a reboot + successful agent run, before trying to pool again [01:32:11] so because the time was all wonky, I did a reboot with the cookbook [01:32:19] but that's where it got stuck [01:32:27] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: varnish-frontend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:45] there's a reboot cookbook? :) [01:33:03] yeah cookbook sre.hosts.reboot-single [01:33:21] that's what we used for example for the kernel updates recently, so we know it works :) [01:33:48] getsel from mgmt tells me that there is no RAM persisting issues [01:33:48] trying a manual reboot just to see if it clears up all the service dependency mess for puppet or not [01:33:54] ok [01:34:08] the last log from getsel is: [01:34:08] Severity: Ok [01:34:08] Description: The chassis is closed while the power is off. [01:34:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp5002.eqsin.wmnet [01:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:32] heh cookbook was still running? [01:35:49] eh it's OK, it would have never completed given the puppet run was failing :) [01:36:04] did it ever do an actual reboot, or was it failing on some pre-reboot step? [01:36:09] PROBLEM - Host cp5002 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:20] it did do an actual reboot [01:36:52] but the time was all weird, because: [01:36:56] https://phabricator.wikimedia.org/T305423#7901638 [01:37:01] > When we were replacing the memory, it forgot its bios time/date as the CR battery on the mainboard has discharged. This is only an issue if power is entirely lost, requiring the date/time to be set again. I'm not sure that its worth the downtime and such to swap out mainboard batteries when these are due for replacement in Q3 of next fiscal. [01:37:09] RECOVERY - Host cp5002 is UP: PING OK - Packet loss = 0%, RTA = 222.15 ms [01:37:34] yeah, was the bios time at least re-set correctly? [01:38:28] yep I did check it but there was something else up with the time, with some services reported running since ~a week when the system uptime was just a few minutes [01:38:47] which means the time changed *after* it booted [01:38:51] PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [01:39:03] and our usual Puppet fun with the dependencies failing [01:39:05] (as opposed to having BIOS ~correct *before* the kernel+ntp start up) [01:39:25] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-trafficserver-tls-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:33] yeah but it's getting further I think [01:39:47] oh? I assumed that since the time would only be incorrect after a power failure (and which was not the case since it was rebooted), it was something else [01:40:34] I don't think that's true (that it only matters on powerfail) [01:40:44] when the kernel first boots, the time comes from the BIOS [01:40:56] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:41:02] it's not until some time later (kernel up, some services up) that ntp-related boot things try to correct it at the software level [01:41:14] things are much saner if BIOS is at least close enough for ntp to skew/slew it into correction [01:41:23] (instead of needing a big jump) [01:41:43] RECOVERY - Check systemd state on cp5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:45] (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:54] I see [01:42:27] so yeah, the services are still borked [01:43:11] mostly same as the previous case, before I did the cookbook [01:43:12] there's layers of issues here, but I don't think any of it amounts to a real prod problem, other than perhaps during the brief time it was pooled (but even then, possibly pybal healthchecks prevented real traffic trying to reach it?) [01:43:20] varnish-frontend-restart didn't help [01:43:32] yeah it's deeper stuff [01:44:02] but also, I think this host was never converted to haproxy [01:44:18] oooooooooo ha [01:44:20] I think that's at the root of the metrics confusion and alerting, too, and why the graph is showing that one odd entry just for eqsin [01:44:24] it's using the "old" stuff [01:44:31] (which is already borked in general I think) [01:44:32] node 'cp5002.eqsin.wmnet' { role(cache::upload) [01:44:32] } [01:44:33] ha [01:45:07] the root lesson here is: if a host has been out of service a long time, you can't trust it easily :) [01:45:11] I wonder if Occam is telling us to just reimage to haproxy! [01:45:15] 21:45:07 < bblack> the root lesson here is: if a host has been out of service a long time, you can't trust it easily :) [01:45:26] would like to add: don't pool it at night <-- for me [01:45:33] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:45:44] now the alert is back, it's still fake [01:46:09] it's because it's on ATS-TLS, etc (well, and the bork services) [01:46:26] (03PS1) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T307532) [01:46:42] is the alert actually specific to ATS-TLS? do we use a different alert and metric on the haproxy nodes? [01:46:45] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:46:45] (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:13] anyways [01:47:19] seems to be so, since we have haproxy/* for haproxy [01:47:20] and cache_upload/eqsin [01:47:20] yes, reimage is the answer [01:47:22] and cache_upload/eqsin [01:47:24] global ^ [01:47:27] with the same values [01:47:27] right [01:47:48] basically when it came online, it became the only global node left with the ats-tls setup and a different/metric alert :) [01:47:55] yep that probably explains it [01:48:06] so options: 1) downtime host, continue leaving it depooled. does it take care of this alert? unsure [01:48:13] and it's not working right either, probably because the puppetization is broke for this thought-dead config [01:48:14] 2) reimage to haproxy [01:48:18] yeah [01:48:33] if we leave it just downtimed+depooled, we'll keep getting alerts now [01:48:37] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: varnish-frontend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:41] better to fire off a reimage to the haproxy-based role [01:48:45] cool [01:48:48] I will do that then [01:48:56] (and then still leave it depooled afterwards just to avoid further issues till the daytime) [01:49:30] in case there's some other things I didn't think of or whatever [01:49:35] yeah [01:49:45] (03PS2) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T307532) [01:49:52] I will do the reimage [01:49:57] thanks for stopping by and sorry for the noise! [01:50:08] np! [01:50:33] daniel called me, so I came to look :) I saw a quick ack, so I figured someone already knew what it was, until then [01:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27371 and previous config saved to /var/cache/conftool/dbconfig/20220504-015044-ladsgroup.json [01:50:48] ah [01:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:50] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [01:54:56] (03PS1) 10Ssingh: site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/788893 (https://phabricator.wikimedia.org/T290005) [01:55:46] (03CR) 10Ssingh: [C: 03+2] site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/788893 (https://phabricator.wikimedia.org/T290005) (owner: 10Ssingh) [01:58:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5002.eqsin.wmnet with OS buster [01:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [02:00:17] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:33] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:01:38] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [02:03:46] !log dpifke@deploy1002 Started deploy [performance/coal@2a20d5d]: (no justification provided) [02:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:52] !log dpifke@deploy1002 Finished deploy [performance/coal@2a20d5d]: (no justification provided) (duration: 00m 06s) [02:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:16] ^ Hit enter too soon. Verifying scap can deploy to new hosts webperf[12]003. [02:04:23] (Should be a no-op) [02:05:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27372 and previous config saved to /var/cache/conftool/dbconfig/20220504-020549-ladsgroup.json [02:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:27] PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-production-elk7-codfw-gc-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27373 and previous config saved to /var/cache/conftool/dbconfig/20220504-022055-ladsgroup.json [02:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:12] (03PS2) 10Stang: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) [02:28:19] (03PS3) 10Stang: id_internalwikimedia: Enable extension UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T304291) [02:28:22] (03PS2) 10Stang: labswiki: Enable extension SubPageList3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) [02:32:53] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5002.eqsin.wmnet with OS buster [02:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [02:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27374 and previous config saved to /var/cache/conftool/dbconfig/20220504-023600-ladsgroup.json [02:36:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [02:36:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [02:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [02:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27375 and previous config saved to /var/cache/conftool/dbconfig/20220504-023608-ladsgroup.json [02:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:41:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5002.eqsin.wmnet with OS buster [02:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [02:44:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27376 and previous config saved to /var/cache/conftool/dbconfig/20220504-024449-ladsgroup.json [02:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:54] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [02:51:38] PROBLEM - Disk space on ms-be1041 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdj1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1041&var-datasource=eqiad+prometheus/ops [02:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27377 and previous config saved to /var/cache/conftool/dbconfig/20220504-025954-ladsgroup.json [02:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:03:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:08:54] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_navtiming.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:00] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:11:01] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5002.eqsin.wmnet with OS buster [03:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [03:11:23] the cookbook is failing and I will debug tomorrow. for now, the host is inactive so we shouldn't get any alerts [03:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27378 and previous config saved to /var/cache/conftool/dbconfig/20220504-031459-ladsgroup.json [03:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27379 and previous config saved to /var/cache/conftool/dbconfig/20220504-033004-ladsgroup.json [03:30:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [03:30:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [03:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:09] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [03:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27380 and previous config saved to /var/cache/conftool/dbconfig/20220504-033012-ladsgroup.json [03:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27381 and previous config saved to /var/cache/conftool/dbconfig/20220504-033737-ladsgroup.json [03:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:42] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [03:41:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:46:02] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27382 and previous config saved to /var/cache/conftool/dbconfig/20220504-035242-ladsgroup.json [03:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:47] (03PS1) 10KartikMistry: Update cxserver to 2022-05-04-034605-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/788897 (https://phabricator.wikimedia.org/T304828) [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:00:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:00:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27383 and previous config saved to /var/cache/conftool/dbconfig/20220504-040011-ladsgroup.json [04:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:16] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:07:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27384 and previous config saved to /var/cache/conftool/dbconfig/20220504-040747-ladsgroup.json [04:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:03] !log killed refresh job of rowiki (T299021) [04:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:07] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [04:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:12:24] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:14:10] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27385 and previous config saved to /var/cache/conftool/dbconfig/20220504-042253-ladsgroup.json [04:22:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:22:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:27:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [04:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [04:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:29:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [04:33:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [04:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P27386 and previous config saved to /var/cache/conftool/dbconfig/20220504-043343-ladsgroup.json [04:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:48] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [04:36:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P27387 and previous config saved to /var/cache/conftool/dbconfig/20220504-043618-ladsgroup.json [04:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27388 and previous config saved to /var/cache/conftool/dbconfig/20220504-045158-ladsgroup.json [04:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:03] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27389 and previous config saved to /var/cache/conftool/dbconfig/20220504-050703-ladsgroup.json [05:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:00] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:21:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27390 and previous config saved to /var/cache/conftool/dbconfig/20220504-052140-ladsgroup.json [05:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:44] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:22:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27391 and previous config saved to /var/cache/conftool/dbconfig/20220504-052208-ladsgroup.json [05:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27392 and previous config saved to /var/cache/conftool/dbconfig/20220504-052831-ladsgroup.json [05:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:36] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:30:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T307525)', diff saved to https://phabricator.wikimedia.org/P27393 and previous config saved to /var/cache/conftool/dbconfig/20220504-053713-ladsgroup.json [05:37:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:37:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:19] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [05:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27394 and previous config saved to /var/cache/conftool/dbconfig/20220504-053721-ladsgroup.json [05:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27395 and previous config saved to /var/cache/conftool/dbconfig/20220504-054336-ladsgroup.json [05:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:15] (03PS1) 10Marostegui: parsercache.pp: Monitor event_scheduler on parsercache [puppet] - 10https://gerrit.wikimedia.org/r/789037 (https://phabricator.wikimedia.org/T254738) [05:47:00] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:17] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35062/" [puppet] - 10https://gerrit.wikimedia.org/r/789037 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:48:29] (03CR) 10Marostegui: [C: 03+2] parsercache.pp: Monitor event_scheduler on parsercache [puppet] - 10https://gerrit.wikimedia.org/r/789037 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:54:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 106 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27396 and previous config saved to /var/cache/conftool/dbconfig/20220504-055841-ladsgroup.json [05:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:54] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10Joe) >>! In T303857#7896843, @dancy wrote: > @Joe Pinging on this t... [06:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27397 and previous config saved to /var/cache/conftool/dbconfig/20220504-061346-ladsgroup.json [06:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [06:14:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:14:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27398 and previous config saved to /var/cache/conftool/dbconfig/20220504-061441-ladsgroup.json [06:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:14] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27399 and previous config saved to /var/cache/conftool/dbconfig/20220504-062123-ladsgroup.json [06:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:28] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [06:22:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.410 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:25:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27400 and previous config saved to /var/cache/conftool/dbconfig/20220504-062512-ladsgroup.json [06:25:14] (03CR) 10Cwhite: [C: 03+1] team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [06:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:59] (03CR) 10Cwhite: [C: 03+1] thanos: aggregate varnish requests availability [puppet] - 10https://gerrit.wikimedia.org/r/788751 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:28:11] (03CR) 10Cwhite: [C: 03+1] sre: port NEL alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:33:20] (03PS1) 10Muehlenhoff: Failover performance discovery records to the new Bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/789079 [06:36:27] (03PS2) 10Muehlenhoff: Failover performance discovery records to the new Bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/789079 [06:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27401 and previous config saved to /var/cache/conftool/dbconfig/20220504-063628-ladsgroup.json [06:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:46] (03CR) 10Muehlenhoff: [C: 03+2] Failover performance discovery records to the new Bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/789079 (owner: 10Muehlenhoff) [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27402 and previous config saved to /var/cache/conftool/dbconfig/20220504-064017-ladsgroup.json [06:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:14] (03PS3) 10Muehlenhoff: Switch webperf1001/1003 for eventual removal [puppet] - 10https://gerrit.wikimedia.org/r/785116 (https://phabricator.wikimedia.org/T205460) [06:46:15] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:51:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27403 and previous config saved to /var/cache/conftool/dbconfig/20220504-065133-ladsgroup.json [06:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:41] (03CR) 10Muehlenhoff: [C: 03+2] Adding myself to OPS group [puppet] - 10https://gerrit.wikimedia.org/r/788681 (owner: 10Slyngshede) [06:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27404 and previous config saved to /var/cache/conftool/dbconfig/20220504-065522-ladsgroup.json [06:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:59] (03PS2) 10Slyngshede: Adding myself to OPS group [puppet] - 10https://gerrit.wikimedia.org/r/788681 [07:00:04] Amir1, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T0700). [07:00:04] koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:28] (03CR) 10Slyngshede: [C: 03+2] Adding myself to OPS group [puppet] - 10https://gerrit.wikimedia.org/r/788681 (owner: 10Slyngshede) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:40] o/ looking at the patches [07:03:38] koi: hey, around? [07:03:55] mutante: ack thanks (re: arclamp compress) I guess there weren't enough retries [07:04:47] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: aggregate varnish requests availability [puppet] - 10https://gerrit.wikimedia.org/r/788751 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:06:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27405 and previous config saved to /var/cache/conftool/dbconfig/20220504-070639-ladsgroup.json [07:06:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:06:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:44] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [07:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P27406 and previous config saved to /var/cache/conftool/dbconfig/20220504-070647-ladsgroup.json [07:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'll let Andrea merge tho" [puppet] - 10https://gerrit.wikimedia.org/r/788817 (owner: 10Andrea Denisse) [07:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P27407 and previous config saved to /var/cache/conftool/dbconfig/20220504-070920-ladsgroup.json [07:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27408 and previous config saved to /var/cache/conftool/dbconfig/20220504-071027-ladsgroup.json [07:10:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:10:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27409 and previous config saved to /var/cache/conftool/dbconfig/20220504-071035-ladsgroup.json [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:59] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_navtiming.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:25] oh hi [07:16:37] tavvi ^ [07:16:47] *taavi sorry [07:17:42] hello! [07:19:10] do I need to amend the commit message of the first patch? [07:19:22] I see it is closed as dup [07:20:38] preferrably yes, but more importantly please collect a +1 from someone familiar with the site logos before scheduling that for deployment [07:22:03] ok, I think we could move to the rest patches [07:22:27] looking at them right now [07:23:13] for the mediasearch one, I'm confused as ptwikinews only seems to have a few local media files uploaded (https://pt.wikinews.org/wiki/Especial:Lista_de_ficheiros?uselang=en) [07:23:30] (03CR) 10Muehlenhoff: [C: 03+2] Switch webperf1001/1003 for eventual removal [puppet] - 10https://gerrit.wikimedia.org/r/785116 (https://phabricator.wikimedia.org/T205460) (owner: 10Muehlenhoff) [07:25:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27410 and previous config saved to /var/cache/conftool/dbconfig/20220504-072500-ladsgroup.json [07:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [07:25:09] mediasearch has an option to search a remote wiki such as commons, but it's not clear to me if that was intended or not [07:28:30] (03CR) 10Majavah: [C: 04-1] "id_internalwikimedia is a private wiki, so blocking until someone confirms UploadWizard (and especially UploadStash) are safe to enable th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T304291) (owner: 10Stang) [07:29:21] for mediasearch, you mean $wgMediaSearchExternalEntitySearchBaseUri? Not familiar with those variable.. [07:29:37] no, that seems to be related to wikidata integration [07:29:53] https://www.mediawiki.org/wiki/Extension:MediaSearch#Configuration mentions a separate variable [07:32:23] seems there is no local file there, so it make sense to use commons api [07:32:25] (03CR) 10Majavah: "Hello! Since I can't see the Slack thread mentioned in Phabricator, cc'ing you two to confirm this is fine and to confirm someone will upd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) (owner: 10Stang) [07:33:00] (if that variable is commons.wikimedia.org/w/api.php by default [07:34:01] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:01] it's not that by default, and I want explicit confirmation from the requestors if that is what's actually wanted before deploying [07:37:41] left a message to them on ticket [07:37:53] sounds good, thanks [07:38:11] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:27] I think se have done - though nothing deployed in this window :( [07:39:31] *we [07:39:49] sorry :(( [07:40:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27411 and previous config saved to /var/cache/conftool/dbconfig/20220504-074005-ladsgroup.json [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:42:45] 10SRE, 10LDAP, 10Python3-Porting: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10Majavah) >>! In T266147#6676043, @MoritzMuehlenhoff wrote: > https://github.com/tomcz/openldap_exporter seems like a promising alternative. Indeed! I took a stab at packaging that a... [07:46:45] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:47:23] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:47:56] the train is blocked so no deploy this morning [07:48:34] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [07:49:24] hashar: T307518 has a patch to backport, so that only leaves T307513 as a blocker [07:49:25] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [07:49:25] T307518: Beta cluster internal error on changing user's global rights (CentralAuthGroupMembershipProxy::getWikiId() undefined) - https://phabricator.wikimedia.org/T307518 [07:49:44] (03PS3) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T307532) [07:50:50] (03PS1) 10Muehlenhoff: Remove webperf1001/webperf2001 from Kafka Ferm config [puppet] - 10https://gerrit.wikimedia.org/r/789084 (https://phabricator.wikimedia.org/T305460) [07:51:45] (JobUnavailable) resolved: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:52:15] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:52:56] taavi: ahh good [07:53:11] will check it in a few [07:53:39] 10SRE, 10LDAP, 10Python3-Porting: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10MoritzMuehlenhoff) >>! In T266147#7902079, @Majavah wrote: > Indeed! I took a stab at packaging that and pushed the result to https://gitlab.wikimedia.org/taavi/prometheus-openldap-e... [07:54:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27412 and previous config saved to /var/cache/conftool/dbconfig/20220504-075510-ladsgroup.json [07:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:00] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:58:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:58:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [07:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [07:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] hashar and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T0800). [08:00:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:00:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:00] (JobUnavailable) resolved: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:02:04] so TitleBlacklist cache issue got backports [08:02:38] then there is the mysterious https://phabricator.wikimedia.org/T307518 [08:02:54] which went merged in master a mminute ago \o/ [08:05:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:05:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27413 and previous config saved to /var/cache/conftool/dbconfig/20220504-080509-ladsgroup.json [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:13] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:05:33] (03PS4) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T307532) [08:05:49] RECOVERY - Disk space on backup1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops [08:07:35] (03PS1) 10Hashar: SpecialGlobalGroupMembership: do not call core hooks [extensions/CentralAuth] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788854 (https://phabricator.wikimedia.org/T307518) [08:08:14] (03CR) 10Hashar: [C: 03+2] "Issue caused by Echo change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/787920 which is in wmf.10 and require CentralAuth" [extensions/CentralAuth] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788854 (https://phabricator.wikimedia.org/T307518) (owner: 10Hashar) [08:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:08:35] taavi: I will deploy your CentralAuth / Echo hook fix :) [08:09:30] thank you! [08:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27414 and previous config saved to /var/cache/conftool/dbconfig/20220504-081015-ladsgroup.json [08:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:10:21] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:10:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:10:23] (03Merged) 10jenkins-bot: SpecialGlobalGroupMembership: do not call core hooks [extensions/CentralAuth] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788854 (https://phabricator.wikimedia.org/T307518) (owner: 10Hashar) [08:10:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [08:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [08:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27415 and previous config saved to /var/cache/conftool/dbconfig/20220504-081131-ladsgroup.json [08:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:24] (03PS5) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) [08:15:29] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/CentralAuth/includes/Special/SpecialGlobalGroupMembership.php: SpecialGlobalGroupMembership: do not call core hooks - T307518 (duration: 01m 09s) [08:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:33] T307518: Beta cluster internal error on changing user's global rights (CentralAuthGroupMembershipProxy::getWikiId() undefined) - https://phabricator.wikimedia.org/T307518 [08:16:02] (03PS6) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) [08:18:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:43] blocker ssolved [08:18:47] I am going to promote group 0 [08:20:30] (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789088 [08:20:32] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789088 (owner: 10Hashar) [08:21:12] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789088 (owner: 10Hashar) [08:22:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:22:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:36] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.10 refs T305216 [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:40] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:23:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:20] fatal error: [08:24:22] MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. Please ensure that the class definition "TitleBlacklistEntry" of the object you are trying to operate on was loaded _before_ unserialize() gets called or provide an autoloader to load the class definition in TitleBlacklist.php [08:24:23] ... [08:24:43] bah [08:24:51] (03PS1) 10Ayounsi: Interface automation: fail on duplicate cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 [08:25:02] yes, the TitleBlacklist error wasn't fixed yet, we already noticed that yesterday evening [08:26:01] https://github.com/wikimedia/mediawiki-extensions-TitleBlacklist/commit/d92965b663fcd0bd5b987db9801711d95661e53e added an alias for TitleBlacklist, but not TitleBlacklistEntry, intentional? [08:26:23] (03PS1) 10Hashar: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789091 (https://phabricator.wikimedia.org/T307513) [08:26:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27416 and previous config saved to /var/cache/conftool/dbconfig/20220504-082637-ladsgroup.json [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:00] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/788747 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [08:27:02] (03CR) 10Ayounsi: "Tested in Netbox next:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi) [08:27:25] zabe: looks like TitleBlacklist's VERSION constant isn't used to split the cache entries like most extensions, instead it's compared directly in application layer code [08:27:26] (03CR) 10Hashar: [C: 03+2] "rollback is already in progress" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789091 (https://phabricator.wikimedia.org/T307513) (owner: 10Hashar) [08:27:46] that means that the old entries with TitleBlacklistEntry class will still be deserialized [08:27:47] zabe: taavi: I guess I wasn't paying attention :\ [08:27:56] so the simplest solution would be to add a class_alias for that too [08:28:02] thoughts? [08:28:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:14] it is a back compatibility issue when unserializaing that class from cache isn't it? [08:28:29] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.10 - T305216 T307513 [08:28:30] taavi, yeah lets do that [08:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:34] hashar, yes [08:28:34] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [08:28:34] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:28:45] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789091 (https://phabricator.wikimedia.org/T307513) (owner: 10Hashar) [08:28:49] or can we rollback the change that alters the TitleBlacklistEntry class instead? [08:29:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:29:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:15] (03CR) 10AGueyte: [C: 03+1] "Confirming the QuickSurveys shows up on form submit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [08:30:40] oh that is adding a namespace bah [08:30:41] zabe: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TitleBlacklist/+/789093 [08:31:15] (03PS1) 10Filippo Giunchedi: traffic: port LVS traffic/cpu alerts to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/789094 (https://phabricator.wikimedia.org/T305847) [08:32:07] (03PS1) 10Majavah: wmf.9 HACK: add forward class alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789095 (https://phabricator.wikimedia.org/T307513) [08:32:11] and that [08:32:19] rollback worked [08:32:26] (03CR) 10Zabe: [C: 03+1] wmf.9 HACK: add forward class alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789095 (https://phabricator.wikimedia.org/T307513) (owner: 10Majavah) [08:32:44] (03PS1) 10Majavah: Add a class_alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788855 (https://phabricator.wikimedia.org/T307513) [08:33:12] (03CR) 10Majavah: [C: 03+2] wmf.9 HACK: add forward class alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789095 (https://phabricator.wikimedia.org/T307513) (owner: 10Majavah) [08:33:17] (03CR) 10Majavah: [C: 03+2] Add a class_alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788855 (https://phabricator.wikimedia.org/T307513) (owner: 10Majavah) [08:33:54] hashar: do you want to deploy those or should I? [08:34:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:35:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:57] (03Merged) 10jenkins-bot: wmf.9 HACK: add forward class alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/789095 (https://phabricator.wikimedia.org/T307513) (owner: 10Majavah) [08:36:01] (03Merged) 10jenkins-bot: Add a class_alias for TitleBlacklistEntry too [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788855 (https://phabricator.wikimedia.org/T307513) (owner: 10Majavah) [08:36:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:07] taavi: please do :) [08:37:18] sure [08:37:34] I am juggling between the spam monitoring and a sick kid at home :D [08:37:54] which I call "dad multiplexing" [08:38:02] :( [08:39:10] (03PS9) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [08:39:52] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/TitleBlacklist: Backport: [[gerrit:788855|Add a class_alias for TitleBlacklistEntry too (T307513)]] (duration: 00m 50s) [08:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:56] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [08:41:21] taavi: I am promoting group0 again [08:41:27] don't yet [08:41:28] (03PS1) 10Hashar: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789097 [08:41:30] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789097 (owner: 10Hashar) [08:41:34] hashar: !!! [08:41:35] !log hashar@deploy1002 deploy-promote aborted: (duration: 00m 11s) [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27417 and previous config saved to /var/cache/conftool/dbconfig/20220504-084142-ladsgroup.json [08:41:43] still working on the wmf.9 backport [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:54] (03CR) 10Hashar: [C: 04-2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789097 (owner: 10Hashar) [08:42:11] cancelled [08:42:13] thanks [08:42:28] (03PS1) 10Btullis: Increase the heap size of the HDFS namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/789098 (https://phabricator.wikimedia.org/T307549) [08:43:02] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.9/extensions/TitleBlacklist/includes/TitleBlacklistEntry.php: Backport: [[gerrit:789095|wmf.9 HACK: add forward class alias for TitleBlacklistEntry too (T307513)]] (duration: 00m 50s) [08:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:57] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.9/extensions/TitleBlacklist/extension.json: Backport: [[gerrit:789095|wmf.9 HACK: add forward class alias for TitleBlacklistEntry too (T307513)]] (duration: 00m 49s) [08:44:00] ok, done now [08:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:02] hashar: you can continue [08:44:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35063/console" [puppet] - 10https://gerrit.wikimedia.org/r/789098 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [08:45:08] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1002.eqiad.wmnet [08:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:50:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1002.eqiad.wmnet [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-coord1002.eqiad.wmnet [08:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:30] (03CR) 10AGueyte: [C: 03+1] "Good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) (owner: 10STran) [08:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27420 and previous config saved to /var/cache/conftool/dbconfig/20220504-085647-ladsgroup.json [08:56:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:56:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [08:56:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27421 and previous config saved to /var/cache/conftool/dbconfig/20220504-085655-ladsgroup.json [08:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1002.eqiad.wmnet [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-coord1001.eqiad.wmnet [08:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:31] o/ [08:57:56] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789097 (owner: 10Hashar) [08:58:31] taavi: rolling :) [08:58:44] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789097 (owner: 10Hashar) [08:58:48] sick kid got picked up for a trip to the doctor :] [09:00:51] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.10 - T305216 T307513 [09:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:56] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [09:00:56] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [09:02:01] taavi: zabe: the TitleBlacklist issue no more happens after promoting group 0. Success! [09:02:09] \o/ [09:02:50] will let it flows a bit then do group 1 [09:03:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27422 and previous config saved to /var/cache/conftool/dbconfig/20220504-090337-ladsgroup.json [09:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:42] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:03:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:03:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:15] (03CR) 10David Caro: [C: 03+2] openstack: update tools-redis to a 'new' style name [puppet] - 10https://gerrit.wikimedia.org/r/788675 (https://phabricator.wikimedia.org/T278541) (owner: 10Majavah) [09:04:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1001.eqiad.wmnet [09:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:43] looks quiet [09:08:48] going to promote group 1 [09:09:01] (03PS1) 10Hashar: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789101 [09:09:03] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789101 (owner: 10Hashar) [09:09:45] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789101 (owner: 10Hashar) [09:09:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:27] RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:35] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.10 refs T305216 [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:39] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [09:12:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet [09:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:05] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:35] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.10 refs T305216 (duration: 02m 00s) [09:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:47] [167f19a2-afac-4356-b21d-90ac0b117869] /w/index.php?title=Special:NewsFeed&feed=atom&categories=Published%7CNigeria¬categories=No%20publish%7Cdisputed&namespace=0&count=15&ordermethod=categoryadd&stablepages=only Error: Class 'GoogleNewsSitemap' not found [09:15:58] well rolling back again [09:16:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:16:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:41] 10SRE, 10All-and-every-Wikisource, 10Wikimedia-Interwiki-links: Interwiki language links to non-existent wikisources should redirect to the multilingual wikisource - https://phabricator.wikimedia.org/T38033 (10Aklapper) [09:17:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27423 and previous config saved to /var/cache/conftool/dbconfig/20220504-091842-ladsgroup.json [09:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet [09:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:41] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.39.0-wmf.10 refs T305216 [09:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:45] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [09:20:19] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2044.codfw.wmnet with OS bullseye [09:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2044.codfw.wmnet with OS bullseye [09:21:26] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 [09:22:26] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [09:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:00] (03PS1) 10Gehel: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789104 (https://phabricator.wikimedia.org/T289135) [09:27:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:28:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [09:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27424 and previous config saved to /var/cache/conftool/dbconfig/20220504-092833-ladsgroup.json [09:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:37] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:32:37] RECOVERY - Disk space on ms-be1041 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1041&var-datasource=eqiad+prometheus/ops [09:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27425 and previous config saved to /var/cache/conftool/dbconfig/20220504-093347-ladsgroup.json [09:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:05] (03PS1) 10Slyngshede: Adding Slyngshede to Icinga ACLs [puppet] - 10https://gerrit.wikimedia.org/r/789105 [09:34:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:34:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:09] (03PS2) 10Jcrespo: Revert "backup: Ignore cloudcontrol2003-dev backup monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/783922 [09:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/789105 (owner: 10Slyngshede) [09:36:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2044.codfw.wmnet with reason: host reimage [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:22] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2044.codfw.wmnet with reason: host reimage [09:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:50] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789104 (https://phabricator.wikimedia.org/T289135) (owner: 10Gehel) [09:42:55] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:36] (03CR) 10Slyngshede: [C: 03+2] Adding Slyngshede to Icinga ACLs [puppet] - 10https://gerrit.wikimedia.org/r/789105 (owner: 10Slyngshede) [09:47:57] (03CR) 10Jcrespo: [C: 03+2] Revert "backup: Ignore cloudcontrol2003-dev backup monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/783922 (owner: 10Jcrespo) [09:48:26] slyngs: should I merge? [09:48:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27426 and previous config saved to /var/cache/conftool/dbconfig/20220504-094852-ladsgroup.json [09:48:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:48:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:58] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [09:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27427 and previous config saved to /var/cache/conftool/dbconfig/20220504-094900-ladsgroup.json [09:49:02] Yes, otherwise I was just making my way onto puppetmaster [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1008.eqiad.wmnet [09:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor typo, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:00:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:06] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) [10:02:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1008.eqiad.wmnet [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:28] (03PS1) 10David Caro: openstack: avoid starting/running puppet using cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/789116 (https://phabricator.wikimedia.org/T305909) [10:06:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:06:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:13] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2044.codfw.wmnet with OS bullseye [10:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:17] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2044.codfw.wmnet with OS bullseye completed: - ms-be2044 (**PASS**) - Downtim... [10:08:26] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:06] I am out for lunch [10:16:21] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Thanks @JMeybohm - Those docs are really useful. I will proceed to make the changes required. There's one part that I'm not clear on from the docs. (T... [10:17:06] (03PS1) 10David Caro: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) [10:17:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1001.eqiad.wmnet [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:58] (03CR) 10jerkins-bot: [V: 04-1] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [10:17:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:13] (03CR) 10David Caro: "Mostly to avoid having to do so on a new patch every time I touch a file to fix something." [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [10:23:02] (03PS1) 10Jbond: P:base: ensure we contain apt before moving on to installing packages [puppet] - 10https://gerrit.wikimedia.org/r/789118 [10:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27429 and previous config saved to /var/cache/conftool/dbconfig/20220504-102303-ladsgroup.json [10:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:13] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [10:24:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:24:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:59] (03PS2) 10David Caro: openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) [10:27:28] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-test-master1001.eqiad.wmnet [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:06] (03CR) 10Jbond: [C: 03+2] P:base: ensure we contain apt before moving on to installing packages [puppet] - 10https://gerrit.wikimedia.org/r/789118 (owner: 10Jbond) [10:28:56] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7902555, @BTullis wrote: > I understand that it's something to do with [[https://wikitech.wikimedia.org/wiki/DNS/Discovery#Read-only_an... [10:32:13] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:34:30] (03PS3) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27430 and previous config saved to /var/cache/conftool/dbconfig/20220504-103808-ladsgroup.json [10:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:24] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [10:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27431 and previous config saved to /var/cache/conftool/dbconfig/20220504-104914-ladsgroup.json [10:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:19] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [10:52:37] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27432 and previous config saved to /var/cache/conftool/dbconfig/20220504-105313-ladsgroup.json [10:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:49] (03CR) 10JMeybohm: [C: 03+1] service: add image-suggestion ingress service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:55:03] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Prepare puppet master infrastructure for bullseye - https://phabricator.wikimedia.org/T285086 (10jbond) puppetdb package is not currently available in bullseye [10:58:06] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 [10:58:08] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) [11:00:00] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 (owner: 10Vgutierrez) [11:00:25] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) (owner: 10Vgutierrez) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:03:36] (03PS3) 10Vgutierrez: mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 [11:03:38] (03PS3) 10Vgutierrez: mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) [11:04:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27434 and previous config saved to /var/cache/conftool/dbconfig/20220504-110419-ladsgroup.json [11:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:49] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS buster [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27435 and previous config saved to /var/cache/conftool/dbconfig/20220504-110818-ladsgroup.json [11:08:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:08:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:24] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:01] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:30] * kart_ updating cxserver [11:13:50] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-04-034605-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/788897 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [11:14:38] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7902578, @JMeybohm wrote: > I was under the impression that datahub should only run/be used in the active datacenter because it relies o... [11:14:53] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:18:12] (03Merged) 10jenkins-bot: Update cxserver to 2022-05-04-034605-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/788897 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [11:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27436 and previous config saved to /var/cache/conftool/dbconfig/20220504-111924-ladsgroup.json [11:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:35] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:07] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [11:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:25] (03PS1) 10Jbond: P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 [11:25:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35064/console" [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:25:38] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:26] (03CR) 10jerkins-bot: [V: 04-1] P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:26:32] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:31] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:26] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:32:28] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:57] (03PS2) 10Jbond: P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 [11:33:40] !log Updated cxserver to 2022-05-04-034605-production (T304828, T304858, T201491) [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:46] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [11:33:47] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [11:33:47] T304858: Enable Content and Section Translation for Serbian Wikipedia - https://phabricator.wikimedia.org/T304858 [11:33:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35066/console" [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27437 and previous config saved to /var/cache/conftool/dbconfig/20220504-113429-ladsgroup.json [11:34:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:34:34] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:34:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27438 and previous config saved to /var/cache/conftool/dbconfig/20220504-113443-ladsgroup.json [11:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:28] (03CR) 10jerkins-bot: [V: 04-1] P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:41:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27439 and previous config saved to /var/cache/conftool/dbconfig/20220504-114100-ladsgroup.json [11:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:41:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:45:59] (03PS3) 10Jbond: P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 [11:47:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35067/console" [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:47:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:47:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27440 and previous config saved to /var/cache/conftool/dbconfig/20220504-114749-ladsgroup.json [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [11:48:26] (03CR) 10jerkins-bot: [V: 04-1] P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:50:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2045.codfw.wmnet with OS bullseye [11:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:13] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2045.codfw.wmnet with OS bullseye [11:52:35] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:54:19] (03PS4) 10Jbond: P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 [11:55:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35068/console" [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27441 and previous config saved to /var/cache/conftool/dbconfig/20220504-115605-ladsgroup.json [11:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:33] (03CR) 10jerkins-bot: [V: 04-1] P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [11:58:50] (03PS5) 10Jbond: P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 [12:01:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27442 and previous config saved to /var/cache/conftool/dbconfig/20220504-120122-ladsgroup.json [12:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/788752 (https://phabricator.wikimedia.org/T307471) (owner: 10Jelto) [12:01:26] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:03:01] (03CR) 10Jbond: [C: 03+2] P:base: ensure apt is realised before standard packages [puppet] - 10https://gerrit.wikimedia.org/r/789130 (owner: 10Jbond) [12:07:11] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [12:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:11:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27443 and previous config saved to /var/cache/conftool/dbconfig/20220504-121110-ladsgroup.json [12:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:52] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2045.codfw.wmnet with reason: host reimage [12:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2045.codfw.wmnet with reason: host reimage [12:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27444 and previous config saved to /var/cache/conftool/dbconfig/20220504-121627-ladsgroup.json [12:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:20:15] (03PS1) 10Jforrester: FlaggedRevsHooks: Update use of GoogleNewsSitemap constants [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788861 (https://phabricator.wikimedia.org/T307552) [12:21:13] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:49] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:12] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox-dev2002.codfw.wmnet [12:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:16] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [12:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27445 and previous config saved to /var/cache/conftool/dbconfig/20220504-122615-ladsgroup.json [12:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:20] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:27:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:27:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27446 and previous config saved to /var/cache/conftool/dbconfig/20220504-122715-ladsgroup.json [12:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:39] 10SRE-tools, 10Spicerack: sre.hosts.reimage cookbook dosn't like different LC_ALL environments - https://phabricator.wikimedia.org/T307565 (10jbond) p:05Triage→03Medium [12:31:13] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.hosts.reimage cookbook dosn't like different LC_ALL environments - https://phabricator.wikimedia.org/T307565 (10jbond) [12:31:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27447 and previous config saved to /var/cache/conftool/dbconfig/20220504-123132-ladsgroup.json [12:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:39] (03CR) 10Hashar: [C: 03+2] "Will deploy thank you!" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788861 (https://phabricator.wikimedia.org/T307552) (owner: 10Jforrester) [12:32:08] jouncebot: NotASpy [12:32:11] grr [12:32:13] sorry [12:32:15] jouncebot: now [12:32:16] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [12:32:22] I am going to push a hotfix [12:33:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [12:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:31] (03Merged) 10jenkins-bot: FlaggedRevsHooks: Update use of GoogleNewsSitemap constants [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788861 (https://phabricator.wikimedia.org/T307552) (owner: 10Jforrester) [12:35:56] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 (owner: 10Vgutierrez) [12:38:22] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/FlaggedRevs/: FlaggedRevsHooks: Update use of GoogleNewsSitemap constants - T307552 (duration: 00m 51s) [12:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:26] T307552: enwikinews: FlaggedRev hook triggers: Error: Class 'GoogleNewsSitemap' not found - https://phabricator.wikimedia.org/T307552 [12:39:38] (03PS3) 10Stang: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) [12:41:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27448 and previous config saved to /var/cache/conftool/dbconfig/20220504-124140-ladsgroup.json [12:41:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:45] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2045.codfw.wmnet with OS bullseye [12:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:59] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2045.codfw.wmnet with OS bullseye completed: - ms-be2045 (**PASS**) - Downtim... [12:45:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:45:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27449 and previous config saved to /var/cache/conftool/dbconfig/20220504-124637-ladsgroup.json [12:46:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:46:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27450 and previous config saved to /var/cache/conftool/dbconfig/20220504-124650-ladsgroup.json [12:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [12:47:30] (03CR) 10Stang: "ref: https://gerrit.wikimedia.org/r/c/662792" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [12:49:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] (03CR) 10Vgutierrez: [C: 03+1] traffic: port LVS traffic/cpu alerts to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/789094 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:51:24] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2046.codfw.wmnet with OS bullseye [12:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:28] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2046.codfw.wmnet with OS bullseye [12:51:29] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:51:32] (03PS3) 10Filippo Giunchedi: team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) [12:55:59] (03PS1) 10Filippo Giunchedi: lvs: remove rx/cpu alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789144 (https://phabricator.wikimedia.org/T305847) [12:56:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27451 and previous config saved to /var/cache/conftool/dbconfig/20220504-125645-ladsgroup.json [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T1300). [13:00:05] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] I'm here/ [13:00:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27452 and previous config saved to /var/cache/conftool/dbconfig/20220504-130024-ladsgroup.json [13:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:29] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:01:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35070/console" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [13:01:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35069/console" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [13:01:35] koi: hi, I guess I will process it [13:01:48] oh thanks! [13:01:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35071/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [13:02:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1034.eqiad.wmnet [13:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:26] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2035.codfw.wmnet [13:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:36] I wondering what is that `$wgMediaSearchExternalSearchUri` though [13:02:51] https://www.mediawiki.org/wiki/Extension:MediaSearch#Configuration [13:03:05] given we have `$wgMediaSearchExternalEntitySearchBaseUri` set to https://www.wikidata.org/w/api.php [13:03:42] taavi said this morning that this value need to be explicitly set [13:03:54] (don't know [13:04:13] :] [13:04:43] (03CR) 10Hashar: [C: 03+2] ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [13:05:26] once it merges, it is all about following the instructions at https://deploy-commands.toolforge.org/bacc/788803 :] [13:05:28] (03Merged) 10jenkins-bot: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [13:05:34] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7902665, @BTullis wrote: >>>! In T303049#7902578, @JMeybohm wrote: >> I was under the impression that datahub should only run/be used... [13:06:03] pulled on mwdebug1001 [13:06:11] koi: do you know how to test it on mwdebug1001 host ? [13:06:17] yeah, looking [13:07:41] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2035.codfw.wmnet [13:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1034.eqiad.wmnet [13:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2046.codfw.wmnet with reason: host reimage [13:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:37] (03CR) 10Bking: [C: 03+1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789104 (https://phabricator.wikimedia.org/T289135) (owner: 10Gehel) [13:09:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:19] koi: looks like I have bunch of things from https://pt.wikinews.org/wiki/Especial:MediaSearch?type=image&search=brothers+rugby :) [13:09:53] yeah I see, I just wonder if this is what this task want [13:09:59] (IMO it is [13:10:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:04] the only trouble I see is that when you use the search input from the main page, it searches media [13:11:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2046.codfw.wmnet with reason: host reimage [13:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:11] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox-dev2002.codfw.wmnet [13:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:14] when I would guess might want to search through the news article instead [13:11:44] or Special:MediaSearch becomes the default instead of Special:Search [13:11:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27453 and previous config saved to /var/cache/conftool/dbconfig/20220504-131150-ladsgroup.json [13:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] oh I have a global preference override that, didn't notice that :) [13:12:31] 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10WDoranWMF) [13:13:01] (03PS1) 10Ayounsi: Puppet: Add netbox-dev2002 [puppet] - 10https://gerrit.wikimedia.org/r/789145 (https://phabricator.wikimedia.org/T296452) [13:13:06] is there a option to use normal search by default [13:13:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:43] I am crawling through `extension.json` [13:13:47] (03PS1) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [13:13:49] (03PS1) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [13:13:51] (03PS1) 10Jaime Nuche: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) [13:14:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:25] (03CR) 10jerkins-bot: [V: 04-1] scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:14:50] (03CR) 10Ayounsi: [C: 03+2] Puppet: Add netbox-dev2002 [puppet] - 10https://gerrit.wikimedia.org/r/789145 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:15:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27454 and previous config saved to /var/cache/conftool/dbconfig/20220504-131529-ladsgroup.json [13:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:04] the user pref is search-special-page [13:16:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:56] (03PS2) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [13:18:01] koi: so I don't know what to do. We can revert and ask users to clarify which behavior should be the default [13:18:05] hashar, I prefer to revert this patch, MediaSearch will not solve the problem address in this task [13:18:09] and probably would have to involve the developers of MediaSearch [13:18:11] ok [13:18:15] I will leave a message on task [13:18:30] I am doing the revert in git and will deploy [13:18:38] (03CR) 10jerkins-bot: [V: 04-1] scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:19:27] (03PS1) 10Hashar: Revert "ptwikinews: Enable extension MediaSearch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788863 (https://phabricator.wikimedia.org/T299872) [13:19:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[1016-1019].eqiad.wmnet with reason: reboot [13:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[1016-1019].eqiad.wmnet with reason: reboot [13:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:39] koi: if you can +1 the revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/788863 :) [13:19:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:03] (03CR) 10Stang: [C: 03+1] "Per discuss on IRC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788863 (https://phabricator.wikimedia.org/T299872) (owner: 10Hashar) [13:20:10] (03PS3) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [13:20:12] done [13:20:12] (03PS2) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [13:20:17] (03CR) 10Hashar: [C: 03+2] Revert "ptwikinews: Enable extension MediaSearch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788863 (https://phabricator.wikimedia.org/T299872) (owner: 10Hashar) [13:20:19] thx [13:20:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:51] (03CR) 10jenkins-bot: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:20:57] and I have left a patch on the deployment server bah [13:20:59] will fix it [13:21:17] (03PS2) 10Jaime Nuche: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) [13:21:22] (03Merged) 10jenkins-bot: Revert "ptwikinews: Enable extension MediaSearch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788863 (https://phabricator.wikimedia.org/T299872) (owner: 10Hashar) [13:21:42] (03PS1) 10Hashar: Revert "group1 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789149 (https://phabricator.wikimedia.org/T305216) [13:22:07] (03CR) 10Hashar: [C: 03+2] "This was left on the deployment server cause I forgot to push to gerrit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789149 (https://phabricator.wikimedia.org/T305216) (owner: 10Hashar) [13:22:27] I have restored mwdebug1001 [13:22:51] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789149 (https://phabricator.wikimedia.org/T305216) (owner: 10Hashar) [13:23:16] !log UTC afternoon backport window done [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:41] I am going to do the promotion of group1 wikis to 1.39.0-wmf.10 [13:25:18] (03PS1) 10Hashar: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789150 [13:25:20] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789150 (owner: 10Hashar) [13:25:25] (03PS4) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [13:25:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:11] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789150 (owner: 10Hashar) [13:26:49] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1035.eqiad.wmnet [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27455 and previous config saved to /var/cache/conftool/dbconfig/20220504-132655-ladsgroup.json [13:26:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:26:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:00] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27456 and previous config saved to /var/cache/conftool/dbconfig/20220504-132703-ladsgroup.json [13:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:38] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.10 refs T305216 [13:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:42] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [13:27:53] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [13:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:05] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:15] (03CR) 10Jaime Nuche: "User will be added to Keyholder once the request for the new identity is approved and completed: https://phabricator.wikimedia.org/T307351" [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:28:28] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.10 refs T305216 (duration: 00m 49s) [13:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:57] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10Ottomata) FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'anal... [13:29:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:15] (03CR) 10Bking: [C: 03+2] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789104 (https://phabricator.wikimedia.org/T289135) (owner: 10Gehel) [13:30:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27457 and previous config saved to /var/cache/conftool/dbconfig/20220504-133035-ladsgroup.json [13:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:13] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2046.codfw.wmnet with OS bullseye [13:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:16] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2046.codfw.wmnet with OS bullseye completed: - ms-be2046 (**WARN**) - Downtim... [13:31:21] (03CR) 10Andrew Bogott: [C: 03+1] "Unfortunately I don't think that the start_service feature is available in the version of cloud-init that we're using. (I added the flag -" [puppet] - 10https://gerrit.wikimedia.org/r/789116 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [13:32:08] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1035.eqiad.wmnet [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:47] looks like group 1 is a success [13:32:49] at least immediately [13:33:18] (03CR) 10Muehlenhoff: scap: add system package requirements for scap (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:33:28] (03CR) 10CDanis: [C: 03+1] sre: port NEL alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:33:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27458 and previous config saved to /var/cache/conftool/dbconfig/20220504-133344-ladsgroup.json [13:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:49] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:36:00] PROBLEM - Check unit status of acme-chief #page on acmechief1001 is CRITICAL: CRITICAL: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [13:36:07] uh [13:36:09] 👀 [13:36:10] * vgutierrez looking [13:36:30] * Emperor here [13:36:33] vgutierrez: that might be me [13:36:34] * jynus learning what acme-chief is [13:36:44] it's the thing that auto-issues LE certs for lots of stuff [13:36:48] hey [13:36:52] (03CR) 10MMandere: [C: 03+1] mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 (owner: 10Vgutierrez) [13:36:55] but vg wrote it and is already looking :) [13:36:56] vgutierrez: could it be related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/789145/1/hieradata/role/common/acme_chief.yaml ? [13:37:11] ohh yes [13:37:19] heh [13:37:20] XioNoX: yeah, you can't request a cert for a .wmnet name [13:37:24] can you confirm no immediate user impact on tls traffic? [13:37:29] I don't know if that's the cause, but yeah, you can't get LE to issue private-dns certs :P [13:37:40] so that patch is "wrong" regardless [13:37:42] errr [13:37:52] how that got merged? [13:38:16] vgutierrez: I duplicated the existing [13:38:27] but for an internal host [13:38:31] (03PS1) 10Vgutierrez: Revert "Puppet: Add netbox-dev2002" [puppet] - 10https://gerrit.wikimedia.org/r/788864 [13:38:32] but the existing entry is for a public host :-) [13:38:43] yeah but LE is a public CA, so they don't deal in private tlds like "wmnet" [13:38:45] XioNoX: that's the thing... we cannot use acme-chief for internal domains :) [13:38:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:58] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Ariel Gutman - https://phabricator.wikimedia.org/T307582 (10AGutman-WMF) [13:39:08] jynus: yes, I can confirm that :) [13:39:13] vgutierrez: thanks [13:39:28] I'm reverting that CR [13:39:33] might be worth a CI check for hieradata/role/common/acme_chief.yaml, given that it makes the acme_chief unit immediately fail [13:39:34] shall I ack the alert in VO? sounds like vgutierrez and XioNoX are on top of this [13:39:37] letting traffic figure the details on how to go back to a health state [13:39:45] Emperor: I did that AFAIK [13:39:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:57] vgutierrez: ta [13:40:01] hmm that CR is kinda big [13:40:12] gonna create a new one just getting rid of the codfw SNI [13:40:24] vgutierrez: I can do it if you prefer [13:40:37] go ahead please :) [13:40:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:40:48] (03Abandoned) 10Vgutierrez: Revert "Puppet: Add netbox-dev2002" [puppet] - 10https://gerrit.wikimedia.org/r/788864 (owner: 10Vgutierrez) [13:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:53] vgutierrez: on it [13:42:35] (03PS1) 10Ayounsi: ACME chief: remove SNI for private hostname [puppet] - 10https://gerrit.wikimedia.org/r/789151 [13:42:46] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/789151 [13:43:03] (03CR) 10Vgutierrez: [C: 03+1] ACME chief: remove SNI for private hostname [puppet] - 10https://gerrit.wikimedia.org/r/789151 (owner: 10Ayounsi) [13:43:15] (03CR) 10Ayounsi: [C: 03+2] ACME chief: remove SNI for private hostname [puppet] - 10https://gerrit.wikimedia.org/r/789151 (owner: 10Ayounsi) [13:43:30] running puppet on acmechief1001 [13:43:36] vgutierrez: merged [13:43:49] (after XioNoX confirmed it was merged) [13:44:15] (03PS2) 10Samtar: changeprop: Remove RESTBase page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) [13:44:22] (03PS2) 10Filippo Giunchedi: sre: port NEL alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) [13:44:24] (03PS2) 10Filippo Giunchedi: traffic: port LVS traffic/cpu alerts to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/789094 (https://phabricator.wikimedia.org/T305847) [13:45:13] RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-05-03 14:34:47 (3004 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:45:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27459 and previous config saved to /var/cache/conftool/dbconfig/20220504-134540-ladsgroup.json [13:45:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:45:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:44] (03CR) 10Filippo Giunchedi: "Now with a tweaked expression to skip double-counting (and related tests) as per IRC" [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:45:45] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27460 and previous config saved to /var/cache/conftool/dbconfig/20220504-134548-ladsgroup.json [13:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:55] (03CR) 10Ottomata: [C: 03+1] Increase the heap size of the HDFS namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/789098 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [13:46:41] (03CR) 10Andrew Bogott: [C: 03+1] openstack: avoid starting/running puppet using cloud-init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789116 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [13:47:08] RECOVERY - Check unit status of acme-chief #page on acmechief1001 is OK: OK: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [13:47:13] great [13:47:24] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2047.codfw.wmnet with OS bullseye [13:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:27] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2047.codfw.wmnet with OS bullseye [13:47:31] (03PS1) 10Filippo Giunchedi: prometheus: remove high NEL alert, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789152 (https://phabricator.wikimedia.org/T305847) [13:48:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27461 and previous config saved to /var/cache/conftool/dbconfig/20220504-134849-ladsgroup.json [13:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:35] (03PS8) 10Krinkle: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) [13:54:45] (03PS1) 10Giuseppe Lavagetto: reqestctl: add unit tests for grammar parsing [software/conftool] - 10https://gerrit.wikimedia.org/r/789153 (https://phabricator.wikimedia.org/T305607) [13:54:47] (03PS1) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) [13:55:24] (03CR) 10Krinkle: [C: 03+2] static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [13:56:08] (03Merged) 10jenkins-bot: static.php: Restore short cache for temporary 'mismatch' response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777901 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [13:57:04] * Krinkle testing on mwdebug1002 [13:57:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5002.eqsin.wmnet with OS buster [13:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [13:59:11] hashar: I think T307586 warrants a train rollback for wikidatawiki, but unfortunately I won’t be very available for the next hour :/ [13:59:12] T307586: wbsearchentities produces no results on 1.39.0-wmf.10 - https://phabricator.wikimedia.org/T307586 [14:00:31] Lucas_WMDE: well I guess I can just rollback, though I am in an interview loop right now and have to focus [14:00:38] ok [14:00:48] then rollback is easy [14:00:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] (03CR) 10Jaime Nuche: [V: 04-1] "Puupet catalog compiler failing. I need to investigate before I can merge." [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [14:01:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:01:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:51] (03PS1) 10Hashar: Revert "group1 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789160 (https://phabricator.wikimedia.org/T307586) [14:02:57] Lucas_WMDE: reverting [14:03:47] Krinkle: argh sorry lock conflict [14:03:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27464 and previous config saved to /var/cache/conftool/dbconfig/20220504-140354-ladsgroup.json [14:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2047.codfw.wmnet with reason: host reimage [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:14] once Timo is done I will push the revert of group1 [14:04:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:38] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:06] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:05:24] hashar: thanks [14:05:24] hashar: syncing now [14:05:30] (internet is being bad too so I’m half-following through the wm-bot logs ^^) [14:05:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] my mistake, I thought timo had finished deploying his change ;) [14:06:01] ... and done [14:06:09] !log krinkle@deploy1002 Synchronized w/static.php: Ic21c18b591c5 (duration: 00m 50s) [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:35] (03CR) 10David Caro: openstack: avoid starting/running puppet using cloud-init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789116 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [14:06:44] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:07:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2047.codfw.wmnet with reason: host reimage [14:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:16] syncing the revert [14:08:45] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789160 (https://phabricator.wikimedia.org/T307586) (owner: 10Hashar) [14:09:14] hashar: are you reverting all wikis? [14:09:25] I thought wikidatawiki would be enough [14:09:26] group1 which includes wikidata.org [14:09:31] (but maybe there are other reasons for keeping them in sync) [14:09:42] I thing we stopped doing partial rollbacks of the groups [14:09:57] ok [14:10:00] cause that leads to confusion as to what is running on which wiki [14:10:10] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: wbsearchentities produces no result T307586 [14:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] T307586: wbsearchentities produces no results on 1.39.0-wmf.10 - https://phabricator.wikimedia.org/T307586 [14:10:23] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789160 (https://phabricator.wikimedia.org/T307586) (owner: 10Hashar) [14:10:40] Lucas_WMDE__: thanks for the ping [14:10:58] we seem to have search results again, so looks like wmf.10 was indeed the culprit [14:11:01] thanks a lot for the revert [14:11:29] great! [14:12:14] (03CR) 10CDanis: [C: 03+1] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:13:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase1017.eqiad.wmnet [14:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:29] wikis looks good still [14:16:39] (03PS1) 10Jbond: P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 [14:17:16] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5002.eqsin.wmnet with OS buster [14:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [14:18:08] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:44] (03CR) 10jerkins-bot: [V: 04-1] P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [14:18:58] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I had the beginnings of a theory, based on some reading around varnish, but now I don't think that it's va... [14:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27466 and previous config saved to /var/cache/conftool/dbconfig/20220504-141859-ladsgroup.json [14:19:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:19:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:19:07] (03PS1) 10Roman Stolar: Migrate tests from nose to pytest [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) [14:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27467 and previous config saved to /var/cache/conftool/dbconfig/20220504-141907-ladsgroup.json [14:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:20:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:19] (03PS2) 10Jbond: P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 [14:23:07] (03CR) 10jerkins-bot: [V: 04-1] P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [14:23:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2047.codfw.wmnet with OS bullseye [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:30] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2047.codfw.wmnet with OS bullseye completed: - ms-be2047 (**PASS**) - Downtim... [14:24:43] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10CDanis) Hi, haven't deeply read or understood this issue (sorry!) but I wanted to point out T264021 as potentially... [14:24:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 T301879', diff saved to https://phabricator.wikimedia.org/P27468 and previous config saved to /var/cache/conftool/dbconfig/20220504-142449-marostegui.json [14:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:55] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [14:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27469 and previous config saved to /var/cache/conftool/dbconfig/20220504-142533-ladsgroup.json [14:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:37] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:26:24] !log powercycling restbase1017 [14:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:06] (03CR) 10David Caro: [C: 03+2] openstack::cinder: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/785840 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [14:28:55] (03PS1) 10Jbond: P:installserver::proxy: move rsync port to ssl_ports as it is not http [puppet] - 10https://gerrit.wikimedia.org/r/789164 [14:29:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10akosiaris) [14:29:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10akosiaris) [14:30:33] (03CR) 10Ayounsi: [C: 03+1] P:installserver::proxy: move rsync port to ssl_ports as it is not http [puppet] - 10https://gerrit.wikimedia.org/r/789164 (owner: 10Jbond) [14:30:38] (03CR) 10David Caro: [C: 03+1] prometheus-node-cloudvirt-libvirt-stats.py: handle newer VM xml data [puppet] - 10https://gerrit.wikimedia.org/r/788781 (owner: 10Andrew Bogott) [14:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27470 and previous config saved to /var/cache/conftool/dbconfig/20220504-143050-ladsgroup.json [14:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:54] (03PS3) 10Jbond: P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 [14:30:55] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:31:29] (03CR) 10Jbond: [C: 03+2] P:installserver::proxy: move rsync port to ssl_ports as it is not http [puppet] - 10https://gerrit.wikimedia.org/r/789164 (owner: 10Jbond) [14:31:33] (03CR) 10MMandere: [C: 03+1] mtail::cache_haproxy: Track termination state per request [puppet] - 10https://gerrit.wikimedia.org/r/789108 (https://phabricator.wikimedia.org/T307444) (owner: 10Vgutierrez) [14:32:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1017.eqiad.wmnet [14:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:58] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:35:36] (03PS2) 10Tchanders: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) [14:37:15] (03PS2) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:51] (03CR) 10Dzahn: [C: 03+1] "it seems you have "denisse@" in the Icinga contact, missing the a at the beginning, so nagios tried to send mail to you but couldn't deliv" [puppet] - 10https://gerrit.wikimedia.org/r/788817 (owner: 10Andrea Denisse) [14:40:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27471 and previous config saved to /var/cache/conftool/dbconfig/20220504-144038-ladsgroup.json [14:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:00] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10JMeybohm) [14:43:01] (RoutinatorRsyncErrors) resolved: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:43:27] (03CR) 10EllenR: "Hi all, looking for a review - thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [14:44:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5002.eqsin.wmnet with OS buster [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [14:45:10] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @dcaro not just yet. I believe the one change we will need to test here is adding a route on the cloud-storage interfaces. What... [14:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27472 and previous config saved to /var/cache/conftool/dbconfig/20220504-144555-ladsgroup.json [14:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:42] (03PS1) 10Ottomata: Update changeprop beta kafka broker hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/789187 (https://phabricator.wikimedia.org/T304433) [14:47:45] !log installing Linux 5.10.113 on Bullseye hosts [14:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] (03PS3) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [14:48:21] (03PS5) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [14:48:53] (03CR) 10jerkins-bot: [V: 04-1] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [14:50:41] (03CR) 10Jelto: icinga: increase retries and delay for icinga status check (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [14:51:53] (03PS1) 10Jbond: WIP - P:acme_chief: Improve type checking for certificates [puppet] - 10https://gerrit.wikimedia.org/r/789188 [14:52:26] (03CR) 10jerkins-bot: [V: 04-1] WIP - P:acme_chief: Improve type checking for certificates [puppet] - 10https://gerrit.wikimedia.org/r/789188 (owner: 10Jbond) [14:53:20] (03PS4) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [14:53:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2048.codfw.wmnet with OS bullseye [14:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:29] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2048.codfw.wmnet with OS bullseye [14:53:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:53:53] (03CR) 10jerkins-bot: [V: 04-1] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [14:53:56] (03CR) 10Hnowlan: [C: 03+1] Update changeprop beta kafka broker hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/789187 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [14:55:01] (03CR) 10Hnowlan: [C: 03+2] Update changeprop beta kafka broker hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/789187 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [14:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27473 and previous config saved to /var/cache/conftool/dbconfig/20220504-145543-ladsgroup.json [14:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:19] (03CR) 10Muehlenhoff: "We already have a few other roles which have worked around this (at least maps and swift from a quick glance), I think it would be better " [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [14:58:38] (03PS5) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [14:58:52] (03CR) 10Hnowlan: [C: 03+2] changeprop: use helm3 semantics for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/787526 (https://phabricator.wikimedia.org/T295578) (owner: 10Hnowlan) [15:00:13] (03Merged) 10jenkins-bot: Update changeprop beta kafka broker hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/789187 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [15:01:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27474 and previous config saved to /var/cache/conftool/dbconfig/20220504-150100-ladsgroup.json [15:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:04:09] (03Merged) 10jenkins-bot: changeprop: use helm3 semantics for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/787526 (https://phabricator.wikimedia.org/T295578) (owner: 10Hnowlan) [15:05:01] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:06:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [15:06:37] (03CR) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [15:08:04] (03CR) 10Vgutierrez: [C: 03+2] mtail::cache_haproxy: Handle requests without cache-status [puppet] - 10https://gerrit.wikimedia.org/r/789102 (owner: 10Vgutierrez) [15:09:19] (03PS2) 10Tchanders: Enable IPInfo instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) (owner: 10STran) [15:10:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2048.codfw.wmnet with reason: host reimage [15:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:15] (03CR) 10Tchanders: [C: 03+1] "Just moved the stream config closer to other more similar streams" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) (owner: 10STran) [15:10:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T307525)', diff saved to https://phabricator.wikimedia.org/P27475 and previous config saved to /var/cache/conftool/dbconfig/20220504-151048-ladsgroup.json [15:10:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:10:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5002.eqsin.wmnet with reason: host reimage [15:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2048.codfw.wmnet with reason: host reimage [15:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:27] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10akosiaris) [15:15:32] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) [15:16:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5002.eqsin.wmnet with reason: host reimage [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27476 and previous config saved to /var/cache/conftool/dbconfig/20220504-151606-ladsgroup.json [15:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:10] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:16:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:16:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27477 and previous config saved to /var/cache/conftool/dbconfig/20220504-151630-ladsgroup.json [15:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:32] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @CDanis - Yes that looks very likely. Also I think that the latency ticket {T294911} is also probab... [15:25:18] (03PS6) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [15:26:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [15:32:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27478 and previous config saved to /var/cache/conftool/dbconfig/20220504-153247-root.json [15:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2048.codfw.wmnet with OS bullseye [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:20] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2048.codfw.wmnet with OS bullseye completed: - ms-be2048 (**PASS**) - Downtim... [15:36:44] (03PS7) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [15:38:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [15:41:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:42:22] (03PS3) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [15:46:28] (03PS3) 10Jaime Nuche: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) [15:47:49] (03CR) 10CDanis: requestctl: set an X-Requestctl header for matching rules (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [15:47:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27479 and previous config saved to /var/cache/conftool/dbconfig/20220504-154751-root.json [15:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:43] (03CR) 10CDanis: requestctl: Allow detecting matching rules that are disabled (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [15:53:28] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:56:24] (03CR) 10Jsn.sherman: [C: 04-1] "This looks good to me other than the change to the EventRelayer. So far as I know, that's not our config and isn't directly related to our" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [15:57:31] (03CR) 10CDanis: "Clever." [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [15:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27480 and previous config saved to /var/cache/conftool/dbconfig/20220504-155919-ladsgroup.json [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:25] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [16:01:00] (03CR) 10CDanis: [C: 03+1] reqestctl: add unit tests for grammar parsing [software/conftool] - 10https://gerrit.wikimedia.org/r/789153 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [16:02:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27481 and previous config saved to /var/cache/conftool/dbconfig/20220504-160255-root.json [16:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5002.eqsin.wmnet with OS buster [16:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:20] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster comple... [16:04:30] (03PS8) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [16:05:21] (03CR) 10jerkins-bot: [V: 04-1] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [16:14:09] (03CR) 10STran: [C: 04-1] "There's an AC that this survey should always show up and if a user answers the survey then it no longer shows up for them in later submiss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [16:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27482 and previous config saved to /var/cache/conftool/dbconfig/20220504-161424-ladsgroup.json [16:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:31] (03CR) 10STran: "Wrong patch sorry!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [16:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27483 and previous config saved to /var/cache/conftool/dbconfig/20220504-161758-root.json [16:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] (03PS13) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:20:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35080/console" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:21:11] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:21:18] RECOVERY - dump of es5 in codfw on alert1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-05-03 14:34:47 (2983 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:21:20] (03PS5) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [16:22:29] (03PS14) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:22:31] (03PS6) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [16:22:50] (03PS6) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [16:24:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35081/console" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [16:24:49] (03CR) 10Jbond: [V: 03+1] "ready for review latest PCC is essentially a noop" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [16:24:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35082/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [16:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P27484 and previous config saved to /var/cache/conftool/dbconfig/20220504-162929-ladsgroup.json [16:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:01] (03PS7) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [16:31:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35083/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [16:32:51] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka jumbo-eqiad cluster: Reboot kafka nodes [16:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27485 and previous config saved to /var/cache/conftool/dbconfig/20220504-163302-root.json [16:33:04] (03CR) 10Btullis: [V: 03+1 C: 03+2] Increase the heap size of the HDFS namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/789098 (https://phabricator.wikimedia.org/T307549) (owner: 10Btullis) [16:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:06] (03CR) 10Stang: [C: 04-1] "Waiting T279645#7903949" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) (owner: 10Stang) [16:33:23] (03CR) 10Jbond: [V: 03+1] "Ready for review see pcc diffs e.g." [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [16:33:36] (03PS8) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [16:34:42] (03PS9) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [16:36:43] (03PS9) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [16:38:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35086/console" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [16:38:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [16:39:54] (03CR) 10Jbond: "This is ready to review however the PCC will look cleaner once the other changes in the chain get merged" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [16:44:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27486 and previous config saved to /var/cache/conftool/dbconfig/20220504-164434-ladsgroup.json [16:44:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:44:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:44:40] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27487 and previous config saved to /var/cache/conftool/dbconfig/20220504-164448-ladsgroup.json [16:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:59] (03PS2) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) [16:47:51] (03CR) 10Dzahn: "this would add a separate check for each conf* host that always checks the same certificate" [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [16:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After optimizing a table', diff saved to https://phabricator.wikimedia.org/P27488 and previous config saved to /var/cache/conftool/dbconfig/20220504-164806-root.json [16:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:10] (03CR) 10Dzahn: [C: 03+2] profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [16:52:48] PROBLEM - Check systemd state on ms-be2046 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-node-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:34] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-node-cloudvirt-libvirt-stats.py: handle newer VM xml data [puppet] - 10https://gerrit.wikimedia.org/r/788781 (owner: 10Andrew Bogott) [16:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27489 and previous config saved to /var/cache/conftool/dbconfig/20220504-165800-ladsgroup.json [16:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [16:59:36] (03CR) 10Dave Pifke: [C: 03+1] Remove webperf1001/webperf2001 from Kafka Ferm config [puppet] - 10https://gerrit.wikimedia.org/r/789084 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [16:59:47] (03CR) 10Dzahn: [C: 03+2] "this will do the job but if the certificate expires it will mean 6 alerts for one thing" [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [17:00:03] (03CR) 10Dave Pifke: [C: 03+1] Failover performance discovery records to the new Bullseye hosts [dns] - 10https://gerrit.wikimedia.org/r/789079 (owner: 10Muehlenhoff) [17:02:28] (03CR) 10Ahmon Dancy: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [17:04:15] (03CR) 10Hnowlan: "One minor style comment, otherwise LGTM. Thanks!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/789163 (https://phabricator.wikimedia.org/T303866) (owner: 10Roman Stolar) [17:04:17] PROBLEM - etcd tlsproxy SSL conf1006.eqiad.wmnet:4001 on conf1006 is CRITICAL: SSL CRITICAL - failed to verify conf1006-conf1006.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 https://wikitech.wikimedia.org/wiki/Cergen [17:06:49] PROBLEM - etcd tlsproxy SSL conf2005.codfw.wmnet:4001 on conf2005 is CRITICAL: SSL CRITICAL - failed to verify conf2005-conf2005.codfw.wmnet against etcd-v3.codfw.wmnet, conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, conf2004, conf2005, conf2006, etcd.codfw.wmnet https://wikitech.wikimedia.org/wiki/Cergen [17:07:06] ^ that is new monitoring I just merged [17:07:26] obviously there is an issue in there with the host name, I will revert it [17:08:19] (03PS1) 10Dzahn: Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789169 [17:09:00] (03CR) 10Ahmon Dancy: [C: 03+1] scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [17:10:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789169 (owner: 10Dzahn) [17:10:31] (03PS2) 10Dzahn: Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789169 [17:10:55] (03CR) 10Dzahn: [C: 03+2] "17:04 <+icinga-wm> PROBLEM - etcd tlsproxy SSL conf1006.eqiad.wmnet:4001 on conf1006 is CRITICAL: SSL CRITICAL - failed to verify conf1006" [puppet] - 10https://gerrit.wikimedia.org/r/789169 (owner: 10Dzahn) [17:12:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789169 (owner: 10Dzahn) [17:12:29] (03PS1) 10CDanis: Expand stick-table test to three other hosts [puppet] - 10https://gerrit.wikimedia.org/r/789219 [17:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27490 and previous config saved to /var/cache/conftool/dbconfig/20220504-171305-ladsgroup.json [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:16] (03PS3) 10Dzahn: Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789169 [17:14:53] PROBLEM - etcd tlsproxy SSL conf1005.eqiad.wmnet:4001 on conf1005 is CRITICAL: SSL CRITICAL - failed to verify conf1005-conf1005.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 https://wikitech.wikimedia.org/wiki/Cergen [17:15:13] (03PS2) 10CDanis: Expand stick-table test to three other hosts [puppet] - 10https://gerrit.wikimedia.org/r/789219 [17:20:05] PROBLEM - etcd tlsproxy SSL conf2004.codfw.wmnet:4001 on conf2004 is CRITICAL: SSL CRITICAL - failed to verify conf2004-conf2004.codfw.wmnet against etcd-v3.codfw.wmnet, conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, conf2004, conf2005, conf2006, etcd.codfw.wmnet https://wikitech.wikimedia.org/wiki/Cergen [17:22:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:22:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27491 and previous config saved to /var/cache/conftool/dbconfig/20220504-172214-ladsgroup.json [17:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:18] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:22:39] (03PS3) 10CDanis: Expand stick-table test to three other hosts [puppet] - 10https://gerrit.wikimedia.org/r/789219 (https://phabricator.wikimedia.org/T306580) [17:22:43] ACKNOWLEDGEMENT - etcd tlsproxy SSL conf1005.eqiad.wmnet:4001 on conf1005 is CRITICAL: SSL CRITICAL - failed to verify conf1005-conf1005.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 daniel_zahn reverting https://wikitech.wikimedia.org/wiki/Cergen [17:22:43] ACKNOWLEDGEMENT - etcd tlsproxy SSL conf1006.eqiad.wmnet:4001 on conf1006 is CRITICAL: SSL CRITICAL - failed to verify conf1006-conf1006.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 daniel_zahn reverting https://wikitech.wikimedia.org/wiki/Cergen [17:22:43] ACKNOWLEDGEMENT - etcd tlsproxy SSL conf2004.codfw.wmnet:4001 on conf2004 is CRITICAL: SSL CRITICAL - failed to verify conf2004-conf2004.codfw.wmnet against etcd-v3.codfw.wmnet, conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, conf2004, conf2005, conf2006, etcd.codfw.wmnet daniel_zahn reverting https://wikitech.wikimedia.org/wiki/Cergen [17:22:43] ACKNOWLEDGEMENT - etcd tlsproxy SSL conf2005.codfw.wmnet:4001 on conf2005 is CRITICAL: SSL CRITICAL - failed to verify conf2005-conf2005.codfw.wmnet against etcd-v3.codfw.wmnet, conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, conf2004, conf2005, conf2006, etcd.codfw.wmnet daniel_zahn reverting https://wikitech.wikimedia.org/wiki/Cergen [17:23:01] (03CR) 10CDanis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/35089/" [puppet] - 10https://gerrit.wikimedia.org/r/789219 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [17:23:46] (03CR) 10Dzahn: "+ check_command check_ssl_on_host_port!conf2006-conf2006.codfw.wmnet!conf2006.codfw.wmnet!4001" [puppet] - 10https://gerrit.wikimedia.org/r/789169 (owner: 10Dzahn) [17:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27492 and previous config saved to /var/cache/conftool/dbconfig/20220504-172739-ladsgroup.json [17:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:45] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:28:04] (03PS1) 10Dduvall: ci: Provide basic `.pipeline/config.yaml` [software/tegola] - 10https://gerrit.wikimedia.org/r/789222 (https://phabricator.wikimedia.org/T307507) [17:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P27493 and previous config saved to /var/cache/conftool/dbconfig/20220504-172810-ladsgroup.json [17:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:50] (03PS2) 10Jbond: P:acme_chief: Improve type checking for certificates [puppet] - 10https://gerrit.wikimedia.org/r/789188 [17:32:10] (03CR) 10Dzahn: [C: 03+2] add gitlab-runner role on new physical server gitlab-runner2002 [puppet] - 10https://gerrit.wikimedia.org/r/787820 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [17:32:16] (03PS2) 10Dzahn: add gitlab-runner role on new physical server gitlab-runner2002 [puppet] - 10https://gerrit.wikimedia.org/r/787820 (https://phabricator.wikimedia.org/T307142) [17:32:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35090/console" [puppet] - 10https://gerrit.wikimedia.org/r/789188 (owner: 10Jbond) [17:35:32] (03CR) 10Dduvall: [C: 03+1] ci: Provide basic `.pipeline/config.yaml` [software/tegola] - 10https://gerrit.wikimedia.org/r/789222 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [17:42:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27494 and previous config saved to /var/cache/conftool/dbconfig/20220504-174244-ladsgroup.json [17:42:46] (03CR) 10Krinkle: [C: 03+1] "LGTM" [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [17:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T307525)', diff saved to https://phabricator.wikimedia.org/P27495 and previous config saved to /var/cache/conftool/dbconfig/20220504-174317-ladsgroup.json [17:43:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [17:43:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [17:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:22] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27496 and previous config saved to /var/cache/conftool/dbconfig/20220504-174325-ladsgroup.json [17:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:43] (03PS1) 10Krinkle: clinic-duty: Combine the various conditionals in Message#work [software] - 10https://gerrit.wikimedia.org/r/789224 [17:50:20] (03CR) 10Dzahn: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 14.10 [puppet] - 10https://gerrit.wikimedia.org/r/788752 (https://phabricator.wikimedia.org/T307471) (owner: 10Jelto) [17:52:16] (03PS2) 10Krinkle: clinic-duty: Combine the various conditionals in Message#work [software] - 10https://gerrit.wikimedia.org/r/789224 [17:56:10] (03CR) 10EllenR: Set log level to 'debug' for mediamoderation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [17:56:13] (03PS1) 10Dzahn: site: use gitlab-runner role on all new physical gitlab-runner servers [puppet] - 10https://gerrit.wikimedia.org/r/789229 (https://phabricator.wikimedia.org/T307142) [17:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27497 and previous config saved to /var/cache/conftool/dbconfig/20220504-175747-ladsgroup.json [17:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:57:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27498 and previous config saved to /var/cache/conftool/dbconfig/20220504-175755-ladsgroup.json [17:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] hashar and brennen: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T1800). [18:00:04] hashar and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T1800). [18:00:45] (JobUnavailable) firing: Reduced availability for job gitlab_runner in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:05:45] train's blocked, i see ebernhardson has a patch - i'll wait for review on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/789227/ [18:06:50] (03PS1) 10Ottomata: Bump eventgate-main image version for T302925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/789231 (https://phabricator.wikimedia.org/T302925) [18:07:17] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Bump eventgate-main image version for T302925 [deployment-charts] - 10https://gerrit.wikimedia.org/r/789231 (https://phabricator.wikimedia.org/T302925) (owner: 10Ottomata) [18:08:14] (03CR) 10Dzahn: [C: 03+2] site: use gitlab-runner role on all new physical gitlab-runner servers [puppet] - 10https://gerrit.wikimedia.org/r/789229 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [18:11:02] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:11:03] 10SRE, 10RESTBase-API: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Dzahn) [18:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:44] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:07] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Dzahn) [18:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27499 and previous config saved to /var/cache/conftool/dbconfig/20220504-181252-ladsgroup.json [18:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27500 and previous config saved to /var/cache/conftool/dbconfig/20220504-181301-ladsgroup.json [18:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:06] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:13:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:13:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27501 and previous config saved to /var/cache/conftool/dbconfig/20220504-181356-ladsgroup.json [18:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:15:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27502 and previous config saved to /var/cache/conftool/dbconfig/20220504-181518-ladsgroup.json [18:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:54] (03PS1) 10Ottomata: eventgate-main - pre cache image-suggestions-feedback schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/789235 (https://phabricator.wikimedia.org/T302925) [18:17:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main - pre cache image-suggestions-feedback schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/789235 (https://phabricator.wikimedia.org/T302925) (owner: 10Ottomata) [18:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27503 and previous config saved to /var/cache/conftool/dbconfig/20220504-181919-ladsgroup.json [18:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:25] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27504 and previous config saved to /var/cache/conftool/dbconfig/20220504-181934-ladsgroup.json [18:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:43] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:16] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:33] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:38] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [18:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:39] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [18:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [18:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P27505 and previous config saved to /var/cache/conftool/dbconfig/20220504-182757-ladsgroup.json [18:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:32:12] (03PS1) 10Jbond: C:raid: update hpsa logic to install ssacli tools on > buster [puppet] - 10https://gerrit.wikimedia.org/r/789240 (https://phabricator.wikimedia.org/T306354) [18:33:00] (03CR) 10jerkins-bot: [V: 04-1] C:raid: update hpsa logic to install ssacli tools on > buster [puppet] - 10https://gerrit.wikimedia.org/r/789240 (https://phabricator.wikimedia.org/T306354) (owner: 10Jbond) [18:33:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=ats-be [18:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=varnish-fe [18:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5002.eqsin.wmnet,service=ats-tls [18:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:53] (03PS2) 10Jbond: C:raid: update hpsa logic to install ssacli tools on > buster [puppet] - 10https://gerrit.wikimedia.org/r/789240 (https://phabricator.wikimedia.org/T306354) [18:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27506 and previous config saved to /var/cache/conftool/dbconfig/20220504-183424-ladsgroup.json [18:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27507 and previous config saved to /var/cache/conftool/dbconfig/20220504-183441-ladsgroup.json [18:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [18:37:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [18:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:32] (03PS1) 10Dzahn: site/gitlab: add gitlab-runner role also on eqiad physical machines [puppet] - 10https://gerrit.wikimedia.org/r/789242 (https://phabricator.wikimedia.org/T307142) [18:40:32] (03CR) 10Dzahn: [C: 03+2] site/gitlab: add gitlab-runner role also on eqiad physical machines [puppet] - 10https://gerrit.wikimedia.org/r/789242 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [18:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T307525)', diff saved to https://phabricator.wikimedia.org/P27509 and previous config saved to /var/cache/conftool/dbconfig/20220504-184302-ladsgroup.json [18:43:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:43:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:11] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:45] (03CR) 10Cathal Mooney: [C: 03+2] "Thanks for the feedback. I'll submit a patch tomorrow, need to discuss how to avoid using the for() loop to get the loopback IP." [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [18:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27510 and previous config saved to /var/cache/conftool/dbconfig/20220504-184929-ladsgroup.json [18:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27511 and previous config saved to /var/cache/conftool/dbconfig/20220504-184946-ladsgroup.json [18:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:36] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/789240 (https://phabricator.wikimedia.org/T306354) (owner: 10Jbond) [18:57:21] (03PS3) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [19:00:07] !log planet2002 - apt autoremove [19:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:40] (03CR) 10EllenR: Set log level to 'debug' for mediamoderation (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:46] (03PS1) 10Brennen Bearnes: Search against index instead of type [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789174 (https://phabricator.wikimedia.org/T307586) [19:03:40] (03PS1) 10Dzahn: Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789175 [19:04:10] (03Abandoned) 10Dzahn: Revert "profile: add etcd tlsproxy certificate monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/789175 (owner: 10Dzahn) [19:04:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P27512 and previous config saved to /var/cache/conftool/dbconfig/20220504-190435-ladsgroup.json [19:04:36] (03PS1) 10Dzahn: Revert "Revert "profile: add etcd tlsproxy certificate monitoring"" [puppet] - 10https://gerrit.wikimedia.org/r/789176 [19:04:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:04:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:40] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [19:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T307525)', diff saved to https://phabricator.wikimedia.org/P27513 and previous config saved to /var/cache/conftool/dbconfig/20220504-190451-ladsgroup.json [19:04:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:04:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [19:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27514 and previous config saved to /var/cache/conftool/dbconfig/20220504-190459-ladsgroup.json [19:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27515 and previous config saved to /var/cache/conftool/dbconfig/20220504-190919-ladsgroup.json [19:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:06] brennen: i don't know that anyone is going to be around to review my patch. I estimate might as well ship it :) [19:17:35] ebernhardson: yeah, trey's +1 is good enough for me i think. clicked backport a bit ago: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/789174 [19:18:34] (03PS1) 10Tchanders: Add SimilarEditors extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) [19:18:49] jouncebot nowandnext [19:18:50] For the next 0 hour(s) and 41 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T1800) [19:18:50] In 0 hour(s) and 41 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T2000) [19:19:14] (03CR) 10Brennen Bearnes: [C: 03+2] Search against index instead of type [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789174 (https://phabricator.wikimedia.org/T307586) (owner: 10Brennen Bearnes) [19:19:29] ok sounds reasonable. I'll be out for 15 minutes on each side of the next hour, but otherwise around if it continues to be problematic (but that patch should do the trick) [19:20:05] (03CR) 10Tchanders: [C: 04-2] "Awaiting code to be deployed to production, following Ic401f672783" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [19:20:09] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:02] ebernhardson: cool, thanks for the fix; it seems testable enough, so i'll go ahead with the backport once tests do their thing. [19:22:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:22:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27516 and previous config saved to /var/cache/conftool/dbconfig/20220504-192240-ladsgroup.json [19:22:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:22:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:45] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [19:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27517 and previous config saved to /var/cache/conftool/dbconfig/20220504-192249-ladsgroup.json [19:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27518 and previous config saved to /var/cache/conftool/dbconfig/20220504-192425-ladsgroup.json [19:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:47] (03CR) 10Gehel: [C: 04-1] "see inline comments." [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [19:36:01] (03Merged) 10jenkins-bot: Search against index instead of type [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789174 (https://phabricator.wikimedia.org/T307586) (owner: 10Brennen Bearnes) [19:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27519 and previous config saved to /var/cache/conftool/dbconfig/20220504-193930-ladsgroup.json [19:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:44:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:23] patch tested, syncing and rolling train forward. [19:46:40] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:47:51] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/WikibaseCirrusSearch/src/WikibasePrefixSearcher.php: Backport: [[gerrit:789174|Search against index instead of type (T307586)]] (duration: 00m 52s) [19:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:56] T307586: wbsearchentities produces no results on 1.39.0-wmf.10 - https://phabricator.wikimedia.org/T307586 [19:48:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:48:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:12] (03PS2) 10Cwhite: Revert "Revert "profile: add etcd tlsproxy certificate monitoring"" [puppet] - 10https://gerrit.wikimedia.org/r/789176 (owner: 10Dzahn) [19:50:51] hashar: i see there's a revert patch here not pushed up to gerrit, going to push that then go forward to group1 [19:51:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:20] ah, nm, just needed to fetch & rebase. [19:53:51] (03CR) 10BryanDavis: [C: 03+1] labswiki: Enable extension SubPageList3 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) (owner: 10Stang) [19:54:04] (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789257 [19:54:06] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789257 (owner: 10Brennen Bearnes) [19:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T307525)', diff saved to https://phabricator.wikimedia.org/P27520 and previous config saved to /var/cache/conftool/dbconfig/20220504-195435-ladsgroup.json [19:54:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:54:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:40] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [19:54:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27521 and previous config saved to /var/cache/conftool/dbconfig/20220504-195448-ladsgroup.json [19:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:42] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789257 (owner: 10Brennen Bearnes) [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:53] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:57:06] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.10 refs T305216 [19:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:10] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [19:57:58] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.10 refs T305216 (duration: 00m 52s) [19:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:05] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27522 and previous config saved to /var/cache/conftool/dbconfig/20220504-195903-ladsgroup.json [19:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:41] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Jdforrester-WMF) A month later, it'd be really nice to get this finally don... [20:00:04] RoanKattouw, Urbanecm, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T2000). [20:00:04] cjming, Tchanders, tgr, bwang, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:34] (03CR) 10Cwhite: [C: 03+2] Revert "Revert "profile: add etcd tlsproxy certificate monitoring"" [puppet] - 10https://gerrit.wikimedia.org/r/789176 (owner: 10Dzahn) [20:01:08] * urbanecm waves [20:01:13] cjming: do you want to deploy, or should i? [20:01:27] Hi o/ [20:01:40] happy to deploy - if anything goes haywire, mind if I ping you urbanecm? [20:01:43] (03Abandoned) 10Andrea Denisse: Use diff --color instead of colordiff as colordiff is not standard [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 (owner: 10Andrea Denisse) [20:01:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:01:50] cjming: not at all. [20:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:52] and thanks [20:02:11] and those who want to self-server, please feel free - I'll start with my own patch since it's at the top of the list [20:02:18] *self-serve [20:02:22] hey all, train _just_ hit group1, so do keep in mind there might be log weirdness unrelated to backports. [20:02:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:02:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:02] feel free to ping me if anything looks out of the ordinary. [20:03:02] (03CR) 10Clare Ming: [C: 03+2] Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788849 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [20:03:37] I won't be around for a while. My patch is a no-op, can be deployed without me. Alternatively I can deploy it towards the end of the window. [20:03:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:39] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me! This is ready to get put on the backport calendar, IMO." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [20:04:53] (03Merged) 10jenkins-bot: Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788849 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [20:07:36] Tchanders: you're up next [20:07:46] cjming: thanks [20:07:54] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/WikimediaEvents/includes/PageSplitter/PageSplitterInstrumentation.php: Backport: [[gerrit:788849|Fix undefined offset error (T307019)]] (duration: 00m 50s) [20:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:58] (03CR) 10Clare Ming: [C: 03+2] Enable IPInfo instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) (owner: 10STran) [20:07:59] T307019: PHP Notice: Undefined offset: 2 in WikimediaEvents\PageSplitter\PageSplitterInstrumentation->getBucket - https://phabricator.wikimedia.org/T307019 [20:08:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:52] (03Merged) 10jenkins-bot: Enable IPInfo instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) (owner: 10STran) [20:09:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27523 and previous config saved to /var/cache/conftool/dbconfig/20220504-201002-ladsgroup.json [20:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:07] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [20:10:52] Tchanders: your patch is on mwdebug1001 - is it something you can verify? [20:11:09] cjming: Should be - taking a look [20:13:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:40] cjming: looks great [20:13:47] cool - syncing now then [20:13:53] Thank you! [20:14:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27524 and previous config saved to /var/cache/conftool/dbconfig/20220504-201409-ladsgroup.json [20:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:08] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788872|Enable IPInfo instrumentation on all wikis (T296480)]] (duration: 00m 56s) [20:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:12] T296480: Enable the ipinfo instrument in production - https://phabricator.wikimedia.org/T296480 [20:15:13] Tchanders: np! your change should be live [20:15:26] tgr: i'll go ahead and deploy your change [20:16:51] (03PS2) 10Clare Ming: Duplicate eswiki Growth campaign config to itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788776 (owner: 10Gergő Tisza) [20:18:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:43] (03PS1) 10Ssingh: varnish: mask the varnishncsa service [puppet] - 10https://gerrit.wikimedia.org/r/789262 [20:20:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35092/console" [puppet] - 10https://gerrit.wikimedia.org/r/789262 (owner: 10Ssingh) [20:22:57] (03CR) 10Clare Ming: [C: 03+2] Duplicate eswiki Growth campaign config to itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788776 (owner: 10Gergő Tisza) [20:23:54] (03PS1) 10Bernard Wang: Fix TOC fadeout placement [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789263 (https://phabricator.wikimedia.org/T306893) [20:24:42] (03CR) 10Clare Ming: [C: 03+1] Fix TOC fadeout placement [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789263 (https://phabricator.wikimedia.org/T306893) (owner: 10Bernard Wang) [20:24:52] (03Merged) 10jenkins-bot: Duplicate eswiki Growth campaign config to itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788776 (owner: 10Gergő Tisza) [20:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27525 and previous config saved to /var/cache/conftool/dbconfig/20220504-202508-ladsgroup.json [20:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:41] !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo-eqiad cluster: Reboot kafka nodes [20:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:50] alrighty @bwang - your patch is next [20:26:57] sounds good! [20:27:01] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788776|Duplicate eswiki Growth campaign config to itwiki]] (duration: 00m 51s) [20:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:18] (03PS3) 10Stang: labswiki: Enable extension SubPageList3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) [20:28:10] bwang: just waiting for CI to let me +2 [20:28:34] (03PS1) 10Dwisehaupt: Add new on disk certificate checks [puppet] - 10https://gerrit.wikimedia.org/r/789267 (https://phabricator.wikimedia.org/T307476) [20:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27526 and previous config saved to /var/cache/conftool/dbconfig/20220504-202914-ladsgroup.json [20:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:48] tgr: fyi your change is live [20:30:22] (03PS10) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [20:30:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27527 and previous config saved to /var/cache/conftool/dbconfig/20220504-203051-ladsgroup.json [20:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:56] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [20:31:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:31] hrm, seeing a bunch of these monolog errors: https://phabricator.wikimedia.org/T307626 [20:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:53] (03PS2) 10Ssingh: varnish: mask the varnishncsa service [puppet] - 10https://gerrit.wikimedia.org/r/789262 [20:31:58] all wikitech, seems plausibly train-related. [20:32:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35093/console" [puppet] - 10https://gerrit.wikimedia.org/r/789262 (owner: 10Ssingh) [20:33:06] (at least on the timing.) [20:34:51] (03PS2) 10Ladsgroup: maintain-views: Drop views on revision_actor_temp [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) [20:35:10] (03CR) 10Razzi: [C: 03+1] maintain-views: Drop views on revision_actor_temp [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [20:35:34] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] maintain-views: Drop views on revision_actor_temp [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [20:36:02] brennen: ok to carry on? [20:36:16] yeah, it seems to have tailed off and timing could be coincidental. [20:36:31] i'd say go ahead. [20:36:42] cool thanks [20:38:43] (03PS1) 10Dzahn: etcd::tlsproxy: add monitoring for TLS cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/789270 [20:39:43] (03CR) 10Jgreen: [C: 03+2] Add new on disk certificate checks [puppet] - 10https://gerrit.wikimedia.org/r/789267 (https://phabricator.wikimedia.org/T307476) (owner: 10Dwisehaupt) [20:40:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27528 and previous config saved to /var/cache/conftool/dbconfig/20220504-204013-ladsgroup.json [20:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:41] (03PS1) 10Razzi: dbproxy: depool clouddb1013-1017 to update views [puppet] - 10https://gerrit.wikimedia.org/r/789271 (https://phabricator.wikimedia.org/T275246) [20:41:21] (03CR) 10Ladsgroup: [C: 03+1] dbproxy: depool clouddb1013-1017 to update views [puppet] - 10https://gerrit.wikimedia.org/r/789271 (https://phabricator.wikimedia.org/T275246) (owner: 10Razzi) [20:41:39] (03CR) 10Razzi: [C: 03+2] dbproxy: depool clouddb1013-1017 to update views [puppet] - 10https://gerrit.wikimedia.org/r/789271 (https://phabricator.wikimedia.org/T275246) (owner: 10Razzi) [20:42:27] (03PS2) 10Dzahn: etcd::tlsproxy: add monitoring for TLS cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/789270 (https://phabricator.wikimedia.org/T307383) [20:43:21] (03CR) 10Andrew Bogott: "Just for my peace of mind: this is behind Keystone now? Both ports?" [puppet] - 10https://gerrit.wikimedia.org/r/788761 (owner: 10Majavah) [20:44:10] koi: are you here? I can merge yours while waiting for CI to finish [20:44:14] yeah [20:44:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T307525)', diff saved to https://phabricator.wikimedia.org/P27529 and previous config saved to /var/cache/conftool/dbconfig/20220504-204419-ladsgroup.json [20:44:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:44:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:24] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [20:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27530 and previous config saved to /var/cache/conftool/dbconfig/20220504-204427-ladsgroup.json [20:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:20] (03CR) 10Dzahn: [C: 03+2] "I just wondered for quite some time why Gerrit rejects my amended patch until I realized it was merged. heh" [puppet] - 10https://gerrit.wikimedia.org/r/789169 (owner: 10Dzahn) [20:45:28] cjming could you please manually "recheck" my patch? I changed my email and allowlist is still at the old version [20:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27531 and previous config saved to /var/cache/conftool/dbconfig/20220504-204556-ladsgroup.json [20:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:05] (03CR) 10Clare Ming: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) (owner: 10Stang) [20:46:18] (03CR) 10Dzahn: "sorry, but I did not mean to merge this either. Just could not figure out why Gerrit would reject my amended patch" [puppet] - 10https://gerrit.wikimedia.org/r/789176 (owner: 10Dzahn) [20:48:09] (03CR) 10Dzahn: "with 1st and 2nd parameter being identical there is actually no reason to use "check_ssl_on_host_port"" [puppet] - 10https://gerrit.wikimedia.org/r/789176 (owner: 10Dzahn) [20:48:55] thanks cjming! [20:49:58] (03CR) 10Clare Ming: [C: 03+2] Fix TOC fadeout placement [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789263 (https://phabricator.wikimedia.org/T306893) (owner: 10Bernard Wang) [20:50:31] queueing for really a long time today.. [20:51:01] yes [20:51:52] (03CR) 10Dzahn: "it works! just in 1821-60 days we will get 6 alerts" [puppet] - 10https://gerrit.wikimedia.org/r/789176 (owner: 10Dzahn) [20:52:31] (03CR) 10Clare Ming: [C: 03+2] labswiki: Enable extension SubPageList3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) (owner: 10Stang) [20:52:42] (03Abandoned) 10Dzahn: etcd::tlsproxy: add monitoring for TLS cert expiration [puppet] - 10https://gerrit.wikimedia.org/r/789270 (https://phabricator.wikimedia.org/T307383) (owner: 10Dzahn) [20:53:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10Multichill) Is this upgrade supposed to be without impact? Does it impact the performance of the Swift cluster? I've got quite a few of these for the last couple of days: WARNING:... [20:53:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27532 and previous config saved to /var/cache/conftool/dbconfig/20220504-205349-ladsgroup.json [20:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:54] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [20:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P27533 and previous config saved to /var/cache/conftool/dbconfig/20220504-205518-ladsgroup.json [20:55:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:55:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27534 and previous config saved to /var/cache/conftool/dbconfig/20220504-205526-ladsgroup.json [20:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:41] 10SRE-OnFire-Incident-Docs, 10Observability-Alerting, 10serviceops-radar, 10Patch-For-Review, and 2 others: Certificate expiration monitoring - https://phabricator.wikimedia.org/T307383 (10Dzahn) monitoring has been added in Icinga and works now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?sea... [20:56:54] (03Merged) 10jenkins-bot: labswiki: Enable extension SubPageList3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) (owner: 10Stang) [20:58:37] koi: since I'm still waiting for bwang's patch, I'll go ahead and sync yours [20:59:10] oh thanks! is that ok without process on mwdebug? [20:59:32] it's on mwdebug1001 now - can you confirm? [20:59:44] then I'll sync [20:59:50] looking [21:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P27535 and previous config saved to /var/cache/conftool/dbconfig/20220504-210101-ladsgroup.json [21:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:44] strange, could not found this extension in Special:Version [21:02:35] 10SRE-OnFire-Incident-Docs, 10Observability-Alerting, 10serviceops-radar, 10Patch-For-Review, and 2 others: Certificate expiration monitoring - https://phabricator.wikimedia.org/T307383 (10colewhite) 05Open→03Resolved a:03colewhite >>! In T307383#7905029, @Dzahn wrote: > only slight issue I see is we... [21:02:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:03:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:06] koi: what's the url? [21:04:13] https://wikitech.wikimedia.org/wiki/Special:Version [21:04:33] there should have a `` tag inside [21:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:46] urbanecm: can I trouble you about koi's patch? neither of us is seeing the extension [21:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27536 and previous config saved to /var/cache/conftool/dbconfig/20220504-210854-ladsgroup.json [21:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:18] 10SRE-OnFire-Incident-Docs, 10Observability-Alerting, 10serviceops-radar, 10Patch-For-Review, and 2 others: Certificate expiration monitoring - https://phabricator.wikimedia.org/T307383 (10Dzahn) Good points, especially about the reload notify! Alright, yep. [21:11:05] !log running extensions/GrowthExperiments/maintenance/changeWikiConfig.php for T306792 [21:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:11] T306792: initWikiConfig should set excludedSections for link-recommendation task type - https://phabricator.wikimedia.org/T306792 [21:11:26] cjming: does mwdebug work for wikitech? [21:11:51] it used to be on a separate server group or something like that [21:11:53] i have no idea - should I go ahead and sync? [21:12:10] or should I scap pull to another debug server? [21:12:17] cjming: sorry, i just saw the ping [21:12:20] mwdebug does not work for wikitech [21:12:25] ah - gtk [21:12:37] it's served separately from the main servers (so a lot of things is different) [21:12:44] (there are plans to change that, but...needs some time) [21:12:50] so, just sync :) [21:12:57] will do - thanks [21:13:03] no problem [21:13:24] (03PS2) 10Razzi: superset-next: disable require_u2f for now [puppet] - 10https://gerrit.wikimedia.org/r/788774 (https://phabricator.wikimedia.org/T275575) [21:13:58] this behavior is documented at https://wikitech.wikimedia.org/wiki/WikimediaDebug#Limitations, but that can certainly be made more visible :)) [21:14:05] (03CR) 10ODimitrijevic: [C: 03+1] "Lgtm. This is the same behavior as prod superset." [puppet] - 10https://gerrit.wikimedia.org/r/788774 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [21:14:07] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788819|labswiki: Enable extension SubPageList3 (T304181)]] (duration: 00m 51s) [21:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:12] T304181: Request to enable the SubPageList3 extension on wikitech - https://phabricator.wikimedia.org/T304181 [21:14:21] alrighty - koi: your patch should be live [21:14:42] yeah I see that extension installed [21:14:43] ty! [21:14:45] i see the extension at special:Version, so i guess it works :) [21:14:49] (03CR) 10Razzi: [C: 03+2] superset-next: disable require_u2f for now [puppet] - 10https://gerrit.wikimedia.org/r/788774 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [21:14:58] nice - thanks for confirming [21:15:10] (03Merged) 10jenkins-bot: Fix TOC fadeout placement [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789263 (https://phabricator.wikimedia.org/T306893) (owner: 10Bernard Wang) [21:15:43] ^^ 25 minutes! [21:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27537 and previous config saved to /var/cache/conftool/dbconfig/20220504-211607-ladsgroup.json [21:16:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:16:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:12] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [21:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:04] bwang: mind verifying your change on mwdebug1001? [21:17:11] yes! [21:17:21] oh BTW urbanecm, would you mind have a look at the message I left on your meta talkpage? [21:17:29] koi: sure [21:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27538 and previous config saved to /var/cache/conftool/dbconfig/20220504-211853-ladsgroup.json [21:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:33] (03PS3) 10Ladsgroup: wikireplicas: Add linktarget to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/788278 (https://phabricator.wikimedia.org/T305064) [21:19:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:21] (03PS4) 10Ladsgroup: wikireplicas: Add linktarget to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/788278 (https://phabricator.wikimedia.org/T305064) [21:20:29] urbanecm: another Q -- for wmf10 changes, they should still be visible on mwdebug right? [21:20:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:20:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wikireplicas: Add linktarget to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/788278 (https://phabricator.wikimedia.org/T305064) (owner: 10Ladsgroup) [21:20:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:52] cjming: yes, but only on group0/group1 wikis (as that's where the train is now) [21:21:16] (and of course, if we exclude wikitech [21:21:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:05] urbanecm: got it - of course [21:23:13] no problem [21:23:17] koi: you've a re there. hth. [21:23:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:23:41] hmm that's not me :( [21:23:56] cjming: feeling ok about things generally? logs are looking good from a train perspective so i'm about to step away for a break. [21:24:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27539 and previous config saved to /var/cache/conftool/dbconfig/20220504-212401-ladsgroup.json [21:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:12] koi: sorry! Replying to the other one too :D [21:24:23] brennen: seems good - just syncing last patch and i'll close window here shortly [21:24:26] 1001 looks good to me! [21:24:31] woohoo - syncing [21:24:33] right on. [21:25:43] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/skins/Vector/resources/skins.vector.styles/components: Backport: [[gerrit:789263|Fix TOC fadeout placement (T306893)]] (duration: 00m 51s) [21:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:48] T306893: [ToC] Bottom fade issue - https://phabricator.wikimedia.org/T306893 [21:26:15] bwang: your change should be live - will roll out to group2 tomorrow [21:27:21] !log end of UTC late backport & config window [21:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:55] cjming: assuming you're done with B&C, I'll squeeze sth in :) [21:28:26] urbanecm: all yours [21:28:31] thank you [21:30:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:31:58] (03PS1) 10JHathaway: admin: Add Ariel Gutman to LDAP only accounts [puppet] - 10https://gerrit.wikimedia.org/r/789288 (https://phabricator.wikimedia.org/T307582) [21:32:10] actually, decided not to do it now [21:33:31] (03CR) 10JHathaway: "Kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/789288 (https://phabricator.wikimedia.org/T307582) (owner: 10JHathaway) [21:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27540 and previous config saved to /var/cache/conftool/dbconfig/20220504-213358-ladsgroup.json [21:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:02] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Ariel Gutman - https://phabricator.wikimedia.org/T307582 (10jhathaway) @AGutman-WMF I assume you don't need shell access? [21:35:12] (03CR) 10Urbanecm: [C: 03+2] Restrict 'flow-hide' right to autoconfirmed users on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631930 (https://phabricator.wikimedia.org/T264489) (owner: 10Urbanecm) [21:35:24] (this was not a C+2, but a re) [21:35:43] urbanecm: i think this is the new Gerrit version [21:35:48] yeah [21:35:52] you can't "change vote" [21:35:52] but it's confusing, too :/ [21:36:02] but it used to be no problem to leave comments with vote 0 [21:36:07] even after voting before [21:36:10] yeah [21:36:11] agreed [21:36:49] people didn't receive notification on patch if it is merged, seems [21:37:10] and that's...also confusing [21:37:14] also wonder if it is allowed but could I have a look at task :) [21:38:07] koi: don't see why not. added you. note it's confidential [21:38:22] usually at some point after being resolved they move to public [21:38:25] but security decides [21:38:49] mutante: yep, but in this case, _another_ sec ticket is discussed there, so...not time to publish yet [21:39:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T307525)', diff saved to https://phabricator.wikimedia.org/P27541 and previous config saved to /var/cache/conftool/dbconfig/20220504-213908-ladsgroup.json [21:39:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:39:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:13] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [21:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27542 and previous config saved to /var/cache/conftool/dbconfig/20220504-213916-ladsgroup.json [21:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:23] aha an old friend [21:39:30] I will have a talk with them [21:39:39] thanks [21:45:06] !log Start server-side upload of 1 TIFF file (~2.1G; T300857) [21:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:11] T300857: Server-side upload request for Eatdirt - https://phabricator.wikimedia.org/T300857 [21:46:32] (03CR) 10Dduvall: Revert "Revert "contint: Install docker 20.10 from thirdparty/ci on buster"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768774 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [21:47:42] (03PS1) 10Ladsgroup: dbproxy: Repool the old batch, Depool the new one [puppet] - 10https://gerrit.wikimedia.org/r/789291 (https://phabricator.wikimedia.org/T275246) [21:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27543 and previous config saved to /var/cache/conftool/dbconfig/20220504-214903-ladsgroup.json [21:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:35] (03CR) 10Ladsgroup: [C: 03+2] dbproxy: Repool the old batch, Depool the new one [puppet] - 10https://gerrit.wikimedia.org/r/789291 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [21:52:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27544 and previous config saved to /var/cache/conftool/dbconfig/20220504-215217-ladsgroup.json [21:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:22] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [21:52:31] (03PS1) 10Andrea Denisse: Add denisse to sms group [puppet] - 10https://gerrit.wikimedia.org/r/789301 [21:53:36] (03CR) 10Dzahn: [C: 03+1] "yep, aware this _might_ not be needed anymore since VictorOps but I am just going through the classic steps for Icinga and it's like a dem" [puppet] - 10https://gerrit.wikimedia.org/r/789301 (owner: 10Andrea Denisse) [21:56:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:56:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [21:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [21:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:26] (03CR) 10Andrea Denisse: [C: 03+2] Add denisse to sms group [puppet] - 10https://gerrit.wikimedia.org/r/789301 (owner: 10Andrea Denisse) [22:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P27545 and previous config saved to /var/cache/conftool/dbconfig/20220504-220409-ladsgroup.json [22:04:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [22:04:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [22:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:15] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [22:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27546 and previous config saved to /var/cache/conftool/dbconfig/20220504-220417-ladsgroup.json [22:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27547 and previous config saved to /var/cache/conftool/dbconfig/20220504-220722-ladsgroup.json [22:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:35] (03PS2) 10Andrea Denisse: admin: Add Andrea Denisse to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/788817 [22:19:17] (03CR) 10Andrea Denisse: [C: 03+2] admin: Add Andrea Denisse to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/788817 (owner: 10Andrea Denisse) [22:20:54] (03PS1) 10Ladsgroup: dbproxy: Repool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/789308 (https://phabricator.wikimedia.org/T304733) [22:21:33] (03PS2) 10Ladsgroup: dbproxy: Repool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/789308 (https://phabricator.wikimedia.org/T304733) [22:22:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27548 and previous config saved to /var/cache/conftool/dbconfig/20220504-222227-ladsgroup.json [22:22:30] (03CR) 10Ladsgroup: [V: 03+2] dbproxy: Repool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/789308 (https://phabricator.wikimedia.org/T304733) (owner: 10Ladsgroup) [22:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:33] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] dbproxy: Repool clouddb10(17|18|19|20) [puppet] - 10https://gerrit.wikimedia.org/r/789308 (https://phabricator.wikimedia.org/T304733) (owner: 10Ladsgroup) [22:23:34] (03PS1) 10Razzi: matomo: enable maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/789310 [22:25:25] (03CR) 10Razzi: [C: 03+2] matomo: enable maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/789310 (owner: 10Razzi) [22:26:48] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet [22:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:34] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:29:20] CUSTOM - puppet last run on planet1002 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:29:24] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) @WDoranWMF happy to help on this access request. Would you be so kind as to update this ticket with the access request form det... [22:29:31] (03PS1) 10Razzi: Revert "matomo: enable maintenance mode" [puppet] - 10https://gerrit.wikimedia.org/r/789313 [22:29:32] denisse: see the CUSTOM line above, that was you :) [22:30:15] mutante: Awesome!! :D [22:30:20] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet [22:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:29] (03CR) 10Razzi: [C: 03+2] Revert "matomo: enable maintenance mode" [puppet] - 10https://gerrit.wikimedia.org/r/789313 (owner: 10Razzi) [22:31:34] (03PS1) 10Ladsgroup: labs: Set actor migration to write new only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789314 (https://phabricator.wikimedia.org/T275246) [22:32:50] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [22:32:54] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:34:49] (03CR) 10Ladsgroup: [C: 03+2] labs: Set actor migration to write new only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789314 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [22:35:03] jouncebot: nowandnext [22:35:03] No deployments scheduled for the next 7 hour(s) and 24 minute(s) [22:35:03] In 7 hour(s) and 24 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220505T0600) [22:35:18] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:35:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [22:35:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [22:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27550 and previous config saved to /var/cache/conftool/dbconfig/20220504-223601-ladsgroup.json [22:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [22:36:20] (03Merged) 10jenkins-bot: labs: Set actor migration to write new only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789314 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [22:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27551 and previous config saved to /var/cache/conftool/dbconfig/20220504-223732-ladsgroup.json [22:37:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:37:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:39:18] (ProbeDown) firing: (2) Service ncredir:80 has failed probes (http_ncredir_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:25] here [22:40:37] well, I did not see the page but I saw IRC [22:40:39] so here [22:40:49] I'm here too [22:41:11] note ncredir isn't in the critical flow of any primary traffic [22:41:26] what's more worrying is whether it's the canary for something else that will page shortly as well [22:41:29] this is only paging us because it's the new type of checks, right [22:41:39] what Fillippo mailed about [22:41:45] hence "jinxer-wm" [22:41:46] mutante: I belive so [22:42:06] esams and eqsin both [22:42:11] and he already said to watch out and how we had some false positives due to thresholds [22:42:16] ok [22:42:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:41] mail thread "[Ops] Paging network probes for service::catalog" [22:43:04] yeah I see it [22:43:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:43:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:45] looks like we had a legit surge of requests to ncredir, FWIW [22:44:06] https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1 [22:44:09] reading https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown [22:44:18] (ProbeDown) resolved: (2) Service ncredir:80 has failed probes (http_ncredir_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:46] in any case, seems to have self-resolved for now [22:44:48] wikipedia.com heh [22:44:51] (03PS10) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [22:44:56] yeah [22:44:58] something accessed .com and not .org [22:45:16] ncredir's whole job in life is to listen to all of those "other" domains we own like wikipedia.com. The ones we don't intend anyone to really use. [22:45:23] and respond with a simple redirect over to our real sites [22:45:23] ok, so, should we somehow ACK those? not sure yet where [22:45:37] so I'd say, it's nothing to panic about it in general [22:45:58] ok, so perhaps a scraper hitting wikipedia.com? [22:46:02] yes [22:46:21] (or something similar) [22:46:23] (03CR) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [22:46:32] we should leave a comment on that ticket [22:47:08] ncredir is our extremely creative shortening of "non-canonical redirector", meaning non-canonical domainnames (ones we don't consider official, all very low traffic normally) [22:47:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:47:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [22:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [22:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:47] bblack: your description was much easier to parse than the one on wikitech https://wikitech.wikimedia.org/wiki/Ncredir [22:48:44] jhathaway: I'll see if I can make it clearer right quick, add something at the top [22:48:50] jhathaway: did you get the SMS from victorops just like "normal" pages? [22:48:55] (03CR) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [22:48:56] bblack: thanks [22:49:06] mutante: yes [22:49:27] jhathaway: it has the "ACK" and "resolve" and numbers next to it. let's reply with the ACK number [22:49:52] yes, I acked it [22:49:56] cool, thanks [22:50:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27552 and previous config saved to /var/cache/conftool/dbconfig/20220504-225048-ladsgroup.json [22:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [22:51:08] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:51:24] I need to step away for a moment, if someone is able to find the logs in logstash, please post a link [22:51:29] ^ this is because Icinga thinks the IP for those mgmt hosts is 0.0.0.0 btw [22:56:40] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:57:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:57:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:44] updated https://wikitech.wikimedia.org/wiki/Ncredir a bit, hopefully clearer [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27553 and previous config saved to /var/cache/conftool/dbconfig/20220504-230432-ladsgroup.json [23:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:37] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:05:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27554 and previous config saved to /var/cache/conftool/dbconfig/20220504-230553-ladsgroup.json [23:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:07:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:07:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27555 and previous config saved to /var/cache/conftool/dbconfig/20220504-230727-ladsgroup.json [23:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:16] bblack: thanks [23:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27556 and previous config saved to /var/cache/conftool/dbconfig/20220504-231937-ladsgroup.json [23:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T307525)', diff saved to https://phabricator.wikimedia.org/P27557 and previous config saved to /var/cache/conftool/dbconfig/20220504-232007-ladsgroup.json [23:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:15] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:20:28] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:20:30] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27558 and previous config saved to /var/cache/conftool/dbconfig/20220504-232058-ladsgroup.json [23:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:25] I pinged about the codfw DBs there [23:24:34] reimaging is going on though [23:27:30] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:28:21] I can confirm db2101 is up and xtrabackup is running [23:28:32] and pigz [23:30:38] it's only _from_ db2101 and that is backup source [23:30:42] and pigz uses tons of CPU [23:31:44] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:31:44] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:32:04] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:32:14] ok then [23:32:23] going afk before the next thing [23:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27559 and previous config saved to /var/cache/conftool/dbconfig/20220504-233442-ladsgroup.json [23:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27560 and previous config saved to /var/cache/conftool/dbconfig/20220504-233515-ladsgroup.json [23:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T307525)', diff saved to https://phabricator.wikimedia.org/P27561 and previous config saved to /var/cache/conftool/dbconfig/20220504-233604-ladsgroup.json [23:36:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:36:07] 10SRE, 10DBA, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10Dzahn) [23:36:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [23:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:08] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T307525)', diff saved to https://phabricator.wikimedia.org/P27562 and previous config saved to /var/cache/conftool/dbconfig/20220504-233611-ladsgroup.json [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:42] 10SRE, 10DBA, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10Dzahn) [23:37:11] 10SRE, 10DBA, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10Dzahn) [23:38:01] 10SRE, 10DBA, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10Dzahn) [23:46:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T307525)', diff saved to https://phabricator.wikimedia.org/P27563 and previous config saved to /var/cache/conftool/dbconfig/20220504-234947-ladsgroup.json [23:49:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:49:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:49:53] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T307525)', diff saved to https://phabricator.wikimedia.org/P27564 and previous config saved to /var/cache/conftool/dbconfig/20220504-235000-ladsgroup.json [23:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27565 and previous config saved to /var/cache/conftool/dbconfig/20220504-235020-ladsgroup.json [23:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:14] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale