[00:00:12] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:48] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:00] <wikibugs>	 (03PS1) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621
[00:41:19] <wikibugs>	 (03PS2) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621
[00:41:59] <wikibugs>	 (03CR) 10Jdlrobson: "Hey Zabe thanks for catching that. (FWIW luckily this is a NOOP in production servers right now, but would have led to this rolling out wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson)
[00:45:33] <wikibugs>	 (03PS3) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621
[00:47:29] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: enable internode encryption [puppet] - 10https://gerrit.wikimedia.org/r/883682
[00:48:25] <wikibugs>	 (03CR) 10Eevans: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans)
[00:48:55] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2028.codfw.wmnet,service=cdn
[00:48:55] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2028.codfw.wmnet,service=ats-be
[00:49:11] <sukhe>	 !log depool cp2028 for testing firmware update cookbook: T321309
[00:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:15] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[00:50:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet
[00:50:56] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans)
[00:51:26] <wikibugs>	 (03PS4) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621
[00:51:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2028.codfw.wmnet
[00:53:24] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra-dev: enable internode encryption [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans)
[00:53:49] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson)
[00:54:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson)
[01:00:12] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2*: Enable internode encryption - eevans@cumin1001
[01:02:58] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:03:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2028.codfw.wmnet
[01:03:10] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet
[01:05:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2028.codfw.wmnet with OS bullseye
[01:05:15] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2028.codfw.wmnet with OS bullseye
[01:19:28] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2*: Enable internode encryption - eevans@cumin1001
[01:20:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage
[01:23:24] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage
[01:28:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[01:46:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2028.codfw.wmnet with OS bullseye
[01:46:49] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2028.codfw.wmnet with OS bullseye completed: - cp2028 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[01:46:52] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2027.codfw.wmnet,service=cdn
[01:46:52] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2027.codfw.wmnet,service=ats-be
[01:47:08] <icinga-wm>	 PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100%
[01:48:00] <sukhe>	 ^ fixing
[01:48:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2027.codfw.wmnet with reason: firmware test
[01:48:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2027.codfw.wmnet with reason: firmware test
[01:49:48] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2028.codfw.wmnet,service=cdn
[01:49:48] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2028.codfw.wmnet,service=ats-be
[01:51:04] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Archive metavid-l - https://phabricator.wikimedia.org/T327971 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}}
[01:53:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[01:53:30] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[01:53:58] <icinga-wm>	 RECOVERY - Host cp2027 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[01:55:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) Since we started reimaging the cp hosts to bullseye, this has come up again and I was loo...
[01:59:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:01:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:05:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:09:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 28 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:10] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:15:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:35] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye
[02:17:41] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[02:17:54] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[02:18:00] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:05] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) `cp2027`, for later debugging:  ` Jan 26 02:23:56 partman-auto-raid: Selected spare count: 0 Jan 26 02:23:56 partman-auto-raid: Spare devices count: 0 Jan 26 02:23:56 partman-auto-raid: mdadm: cannot open...
[02:30:30] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye
[02:30:36] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[02:41:44] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS bullseye
[02:41:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6013.drmrs.wmnet with OS bullseye
[02:46:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:01:33] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[03:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:04:33] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage
[03:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:26:35] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS bullseye
[03:26:39] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6013.drmrs.wmnet with OS bullseye completed: - cp6013 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[03:27:57] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6013.drmrs.wmnet
[03:28:07] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:29:00] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[03:29:15] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS bullseye
[03:29:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6005.drmrs.wmnet with OS bullseye
[03:49:12] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage
[03:52:04] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage
[03:59:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:10:19] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:17:55] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS bullseye
[04:18:01] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6005.drmrs.wmnet with OS bullseye completed: - cp6005 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[04:22:28] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6005.drmrs.wmnet
[04:23:07] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[04:24:01] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS bullseye
[04:24:07] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6014.drmrs.wmnet with OS bullseye
[04:42:19] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[04:45:31] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage
[04:47:48] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Disable PHP L10n in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707
[04:48:01] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Disable PHP L10n in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707 (https://phabricator.wikimedia.org/T99740)
[05:04:17] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10JJMC89)
[05:07:00] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS bullseye
[05:07:06] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6014.drmrs.wmnet with OS bullseye completed: - cp6014 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[05:09:15] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6014.drmrs.wmnet
[05:10:00] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[05:10:15] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS bullseye
[05:10:21] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6006.drmrs.wmnet with OS bullseye
[05:11:34] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:28:40] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage
[05:32:30] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage
[05:42:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:53:21] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS bullseye
[05:53:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6006.drmrs.wmnet with OS bullseye completed: - cp6006 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[05:53:57] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6006.drmrs.wmnet
[05:54:25] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[05:57:16] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS bullseye
[05:57:24] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6015.drmrs.wmnet with OS bullseye
[06:10:13] <Amir1>	 jouncebot: nowandnext
[06:10:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[06:10:14] <jouncebot>	 In 0 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700)
[06:10:14] <jouncebot>	 In 0 hour(s) and 49 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700)
[06:13:51] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:16:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327861
[06:16:48] <stashbot>	 T327861: Switchover x1 master (db1120 -> db1103) - https://phabricator.wikimedia.org/T327861
[06:17:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327861
[06:17:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1103 with weight 0 T327861', diff saved to https://phabricator.wikimedia.org/P43350 and previous config saved to /var/cache/conftool/dbconfig/20230126-061751-root.json
[06:18:08] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage
[06:18:32] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[06:18:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[06:19:11] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:20:41] <Amir1>	 hmm
[06:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:20:52] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage
[06:22:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:22:24] <Amir1>	 false alert
[06:24:25] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:24:45] <Amir1>	 so wikiuser2023 works just fine in mwdebug, syncing 
[06:26:35] <Amir1>	 37	______▇	0622 ○	0626 ●	DBReadOnlyError.....	.19 i/l/r/d/Database:675  Database is read-only: The database is read-only until replication lag decreases.
[06:26:43] <Amir1>	 only 37 though 
[06:26:56] <Amir1>	 It's Manuel's x1 switchover
[06:28:16] <marostegui>	 yep
[06:30:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883710
[06:30:14] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi)
[06:30:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883710 (owner: 10Marostegui)
[06:32:49] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Rotating wikiuser password (T326802) (duration: 07m 23s)
[06:32:53] <stashbot>	 T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802
[06:37:52] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802)
[06:38:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[06:42:12] <wikibugs>	 (03PS1) 10Marostegui: db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883694 (https://phabricator.wikimedia.org/T327861)
[06:42:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883694 (https://phabricator.wikimedia.org/T327861) (owner: 10Marostegui)
[06:42:40] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802)
[06:42:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[06:43:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi)
[06:43:28] <marostegui>	 Amir1: ^ especially with the triggers, let's wait for the dc switch to avoid running into race conditions?
[06:43:40] <marostegui>	 Amir1: We'd need to change the triggers across all the hosts in production
[06:43:59] <Amir1>	 marostegui: I can automate that
[06:44:11] <Amir1>	 the whole thing is mostly automated
[06:44:17] <marostegui>	 Amir1: Sure, what I mean is, the switchover is in 15 minutes, let's wait until it is done
[06:44:26] <Amir1>	 oh that one
[06:44:32] <Amir1>	 sure, you said dc switchover 
[06:44:40] <wikibugs>	 (03PS1) 10Majavah: fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980)
[06:44:49] <Amir1>	 I thought I need to wait months :D
[06:44:49] <marostegui>	 oh sorry
[06:44:51] <marostegui>	 I meant x1 
[06:44:58] <Amir1>	 so many switchovers
[06:45:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[06:47:00] <wikibugs>	 (03PS2) 10Majavah: fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980)
[06:48:11] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS bullseye
[06:48:17] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6015.drmrs.wmnet with OS bullseye completed: - cp6015 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[06:48:24] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39260/console" [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) (owner: 10Majavah)
[06:48:34] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6015.drmrs.wmnet
[06:49:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) Script used to generate the servers lists: {P43345}
[06:49:21] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[06:49:33] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[06:50:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[06:51:42] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[06:52:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[06:52:57] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[06:53:05] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Adding Jaime for the backup related hosts
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700).
[07:00:08] <marostegui>	 !log Starting x1 eqiad failover from db1120 to db1103 - T327861
[07:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:12] <stashbot>	 T327861: Switchover x1 master (db1120 -> db1103) - https://phabricator.wikimedia.org/T327861
[07:00:19] <Amir1>	 o/
[07:00:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1103 to x1 primary and set section read-write T327861', diff saved to https://phabricator.wikimedia.org/P43351 and previous config saved to /var/cache/conftool/dbconfig/20230126-070035-marostegui.json
[07:01:12] <wikibugs>	 (03CR) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[07:01:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[07:01:17] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot)
[07:01:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T327861', diff saved to https://phabricator.wikimedia.org/P43352 and previous config saved to /var/cache/conftool/dbconfig/20230126-070158-root.json
[07:02:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add some weight to db1103', diff saved to https://phabricator.wikimedia.org/P43353 and previous config saved to /var/cache/conftool/dbconfig/20230126-070220-marostegui.json
[07:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:05:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43354 and previous config saved to /var/cache/conftool/dbconfig/20230126-070512-root.json
[07:06:54] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925)
[07:07:09] <marostegui>	 Amir1: can you review ^?
[07:07:17] <Amir1>	 on it
[07:07:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui)
[07:08:06] <wikibugs>	 (03PS1) 10Ayounsi: Remove single contact feature [puppet] - 10https://gerrit.wikimedia.org/r/883700
[07:09:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui)
[07:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui)
[07:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:10:14] <marostegui>	 Amir1:  what was that URL for the deployment commands?
[07:10:36] <Amir1>	 scap backport 883699
[07:10:38] <Amir1>	 ?
[07:10:44] <marostegui>	 ah that :)
[07:10:45] <marostegui>	 thanks
[07:11:02] <Amir1>	 https://deploy-commands.toolforge.org/bacc/883699 This is also useful, e.g. how to revert
[07:11:13] <marostegui>	 yeah that is what I was looking for :)
[07:11:53] <marostegui>	 mmm there seem to be something pending to be deployed?
[07:12:14] <marostegui>	 07:11:23 The following are unexpected commits pulled from origin for /srv/mediawiki-staging:
[07:12:14] <marostegui>	 commit 4d798447521b90a0bf8af199981789c9e53fc41c
[07:12:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800
[07:12:53] <stashbot>	 T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800
[07:12:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800
[07:14:35] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]]
[07:14:39] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[07:15:04] <wikibugs>	 (03CR) 10Ladsgroup: "Hi, please rebase this in deploy1002 after merge, it doesn't need to follow backport window but if it's not rebased, it'll confuse future " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson)
[07:16:25] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:16:34] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:17:04] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backupmon1001.eqiad.wmnet with reason: m1 switchover
[07:17:17] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backupmon1001.eqiad.wmnet with reason: m1 switchover
[07:17:39] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1001.eqiad.wmnet with reason: m1 switchover
[07:18:03] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1001.eqiad.wmnet with reason: m1 switchover
[07:18:10] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800)
[07:18:57] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800)
[07:20:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43356 and previous config saved to /var/cache/conftool/dbconfig/20230126-072017-root.json
[07:21:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui)
[07:21:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui)
[07:23:04] <marostegui>	 !log Failover m1 from db1195 to db1176 - T327800
[07:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:08] <stashbot>	 T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800
[07:23:10] <marostegui>	 jynus: I am starting ok?
[07:23:21] <jynus>	 green light for me
[07:23:49] <marostegui>	 done
[07:23:58] <wikibugs>	 (03CR) 10Ayounsi: "Follow up from Ia0a4b2b9605a1c795fb0345e52234c5a32187887" [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi)
[07:24:20] <jynus>	 etherpad is working for me
[07:24:22] <marostegui>	 yeah
[07:24:23] <marostegui>	 same
[07:25:05] <jynus>	 let me update racktables and move it to archived
[07:25:12] <marostegui>	 cool
[07:25:17] <dcausse>	 !log T322869: depooling wdqs2009 wdqs2010 wdqs2011 wdqs2012 these hosts should not serve user traffic yet they don't have the database loaded 
[07:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:21] <stashbot>	 T322869: Fewer results from wdqs nodes running in codfw than eqiad - https://phabricator.wikimedia.org/T322869
[07:25:37] <jynus>	 do you see any process on the the old host?
[07:25:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:25:55] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]] (duration: 11m 19s)
[07:25:59] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[07:26:07] <jynus>	 ^bacula is me, will resolve when I start it up
[07:26:11] <marostegui>	 jynus: nope
[07:27:05] <jynus>	 let me start up bacula to 100% finalize the process
[07:27:06] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:27:06] <wikibugs>	 (03PS1) 10Marostegui: monitoring.yaml: Change master for m1 [puppet] - 10https://gerrit.wikimedia.org/r/883705 (https://phabricator.wikimedia.org/T327800)
[07:27:08] <marostegui>	 jynus: you can merge ^ as you wish
[07:27:25] <jynus>	 oh, true, I forgot
[07:27:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] monitoring.yaml: Change master for m1 [puppet] - 10https://gerrit.wikimedia.org/r/883705 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui)
[07:28:33] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[07:30:02] <jynus>	 I think there is one more patch I have to do before starting up stuff
[07:30:10] <marostegui>	 which one?
[07:31:26] <wikibugs>	 (03PS1) 10Marostegui: db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883726 (https://phabricator.wikimedia.org/T327995)
[07:32:42] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800)
[07:33:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800) (owner: 10Jcrespo)
[07:33:06] <jynus>	  T327800
[07:33:06] <stashbot>	 T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800
[07:33:11] <jynus>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/883727
[07:33:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883726 (https://phabricator.wikimedia.org/T327995) (owner: 10Marostegui)
[07:33:29] <jynus>	 this is all because proxy & tls only
[07:34:49] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/883513 (https://phabricator.wikimedia.org/T327997)
[07:35:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43357 and previous config saved to /var/cache/conftool/dbconfig/20230126-073523-root.json
[07:35:26] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[07:35:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800) (owner: 10Jcrespo)
[07:35:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s1 T327997
[07:35:54] <stashbot>	 T327997: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T327997
[07:36:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s1 T327997
[07:36:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2112 with weight 0 T327997', diff saved to https://phabricator.wikimedia.org/P43358 and previous config saved to /var/cache/conftool/dbconfig/20230126-073616-root.json
[07:36:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/883513 (https://phabricator.wikimedia.org/T327997) (owner: 10Gerrit maintenance bot)
[07:45:00] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/883514 (https://phabricator.wikimedia.org/T327998)
[07:45:28] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[07:48:39] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2009.*
[07:49:21] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2010.*
[07:49:28] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2011.*
[07:49:42] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2012.*
[07:50:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43359 and previous config saved to /var/cache/conftool/dbconfig/20230126-075028-root.json
[07:56:00] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883820
[07:57:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883820 (owner: 10Marostegui)
[08:00:05] <jouncebot>	 Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0800).
[08:00:07] <marostegui>	 !log Starting s1 codfw failover from db2103 to db2112 - T327997
[08:00:08] <apergos>	 as often happens, there are no trainees signed up to learn the ropes today, and there are no patches scheduled for deployment, so enjoy a quiet morning! 
[08:00:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:11] <stashbot>	 T327997: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T327997
[08:00:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 primary T327997', diff saved to https://phabricator.wikimedia.org/P43360 and previous config saved to /var/cache/conftool/dbconfig/20230126-080033-root.json
[08:02:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103 T327997', diff saved to https://phabricator.wikimedia.org/P43361 and previous config saved to /var/cache/conftool/dbconfig/20230126-080159-root.json
[08:02:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P43362 and previous config saved to /var/cache/conftool/dbconfig/20230126-080233-root.json
[08:04:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:04:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2107 with weight 0 T327998', diff saved to https://phabricator.wikimedia.org/P43363 and previous config saved to /var/cache/conftool/dbconfig/20230126-080427-root.json
[08:04:32] <stashbot>	 T327998: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T327998
[08:04:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T327998
[08:05:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T327998
[08:05:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43364 and previous config saved to /var/cache/conftool/dbconfig/20230126-080533-root.json
[08:05:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi)
[08:06:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/883514 (https://phabricator.wikimedia.org/T327998) (owner: 10Gerrit maintenance bot)
[08:07:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:08:02] <wikibugs>	 (03PS4) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249
[08:09:25] <wikibugs>	 (03CR) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[08:14:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:15:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) @Papaul could you rename (Netbox, label, console, etc) the switch cloudsw**1**-b1-codfw? For co...
[08:16:42] <wikibugs>	 (03CR) 10Jcrespo: mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn)
[08:17:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43365 and previous config saved to /var/cache/conftool/dbconfig/20230126-081738-root.json
[08:17:43] <marostegui>	 !log Starting s2 codfw failover from db2104 to db2107 - T327998
[08:17:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:47] <stashbot>	 T327998: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T327998
[08:18:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2107 to s2 primary T327998', diff saved to https://phabricator.wikimedia.org/P43366 and previous config saved to /var/cache/conftool/dbconfig/20230126-081818-root.json
[08:18:31] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39262/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[08:19:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T327998', diff saved to https://phabricator.wikimedia.org/P43367 and previous config saved to /var/cache/conftool/dbconfig/20230126-081916-root.json
[08:20:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43368 and previous config saved to /var/cache/conftool/dbconfig/20230126-082038-root.json
[08:20:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43369 and previous config saved to /var/cache/conftool/dbconfig/20230126-082055-root.json
[08:21:53] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:22:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Adapt cookbooks to installserver role rename [cookbooks] - 10https://gerrit.wikimedia.org/r/883833
[08:22:53] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:23:06] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/883515 (https://phabricator.wikimedia.org/T327999)
[08:23:43] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:24:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999
[08:24:29] <stashbot>	 T327999: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T327999
[08:24:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2127 with weight 0 T327999', diff saved to https://phabricator.wikimedia.org/P43370 and previous config saved to /var/cache/conftool/dbconfig/20230126-082432-root.json
[08:24:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999
[08:25:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/883515 (https://phabricator.wikimedia.org/T327999) (owner: 10Gerrit maintenance bot)
[08:26:46] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.reimage: add new cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[08:27:56] <wikibugs>	 (03CR) 10Muehlenhoff: Rename installserver role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[08:32:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM.  No point doing anything more complex if we're not gonna have it elsewhere I think." [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[08:32:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43371 and previous config saved to /var/cache/conftool/dbconfig/20230126-083243-root.json
[08:34:36] <marostegui>	 !log Starting s3 codfw failover from db2105 to db2127 - T327999
[08:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:40] <stashbot>	 T327999: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T327999
[08:35:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2127 to s3 primary T327999', diff saved to https://phabricator.wikimedia.org/P43372 and previous config saved to /var/cache/conftool/dbconfig/20230126-083459-root.json
[08:35:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2105 T327999', diff saved to https://phabricator.wikimedia.org/P43373 and previous config saved to /var/cache/conftool/dbconfig/20230126-083543-root.json
[08:36:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43374 and previous config saved to /var/cache/conftool/dbconfig/20230126-083600-root.json
[08:36:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43375 and previous config saved to /var/cache/conftool/dbconfig/20230126-083640-root.json
[08:37:25] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:38:27] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:39:28] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2118 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/883516 (https://phabricator.wikimedia.org/T328000)
[08:40:16] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:40:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:41:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000
[08:41:05] <stashbot>	 T328000: Switchover s7 master (db2121 -> db2118) - https://phabricator.wikimedia.org/T328000
[08:41:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2118 with weight 0 T328000', diff saved to https://phabricator.wikimedia.org/P43376 and previous config saved to /var/cache/conftool/dbconfig/20230126-084112-root.json
[08:41:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000
[08:41:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2118 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/883516 (https://phabricator.wikimedia.org/T328000) (owner: 10Gerrit maintenance bot)
[08:44:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) @cmooney  Thinking more about it... Your approach is great and careful and would suit well live...
[08:44:37] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[08:44:37] <moritzm>	 !log added Eoghan to pwstore
[08:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[08:46:42] <wikibugs>	 (03PS5) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249
[08:47:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43377 and previous config saved to /var/cache/conftool/dbconfig/20230126-084748-root.json
[08:48:50] <volans>	 gerrit seems unavailable again
[08:49:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi)
[08:49:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:51:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43378 and previous config saved to /var/cache/conftool/dbconfig/20230126-085105-root.json
[08:51:34] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[08:51:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43379 and previous config saved to /var/cache/conftool/dbconfig/20230126-085145-root.json
[08:53:02] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65172 bytes in 8.989 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[08:54:33] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155)
[08:55:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, depends on I813c36b4deb4992e44a848ddc3c3a5c738914661" [cookbooks] - 10https://gerrit.wikimedia.org/r/883833 (owner: 10Muehlenhoff)
[08:55:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8560178, @ayounsi wrote: >> B connection is probably sufficient, this does mean...
[08:56:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[08:56:36] <wikibugs>	 (03PS3) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587
[08:57:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[09:00:05] <jouncebot>	 brennen and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0900).
[09:00:28] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:00:28] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:00:48] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:01:37] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 4.196 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:01:37] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:02:00] <marostegui>	 !log Starting s7 codfw failover from db2121 to db2118 - T328000
[09:02:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) (owner: 10Majavah)
[09:02:03] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66291 bytes in 7.481 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:05] <stashbot>	 T328000: Switchover s7 master (db2121 -> db2118) - https://phabricator.wikimedia.org/T328000
[09:02:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2118 to s7 primary T328000', diff saved to https://phabricator.wikimedia.org/P43380 and previous config saved to /var/cache/conftool/dbconfig/20230126-090212-root.json
[09:02:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43381 and previous config saved to /var/cache/conftool/dbconfig/20230126-090253-root.json
[09:03:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2121 T328000', diff saved to https://phabricator.wikimedia.org/P43382 and previous config saved to /var/cache/conftool/dbconfig/20230126-090302-root.json
[09:04:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P43383 and previous config saved to /var/cache/conftool/dbconfig/20230126-090418-root.json
[09:05:19] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:05:44] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@e5fdd6e]: (no justification provided)
[09:05:50] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@e5fdd6e]: (no justification provided) (duration: 00m 06s)
[09:05:57] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:06:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43384 and previous config saved to /var/cache/conftool/dbconfig/20230126-090610-root.json
[09:06:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43385 and previous config saved to /var/cache/conftool/dbconfig/20230126-090650-root.json
[09:08:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) LGTM!
[09:11:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff)
[09:12:11] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155)
[09:12:23] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155)
[09:14:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Adapt cookbooks to installserver role rename [cookbooks] - 10https://gerrit.wikimedia.org/r/883833 (owner: 10Muehlenhoff)
[09:17:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43386 and previous config saved to /var/cache/conftool/dbconfig/20230126-091758-root.json
[09:19:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001
[09:19:07] <stashbot>	 T328001: Switchover x2 master (db2142 -> db2144) - https://phabricator.wikimedia.org/T328001
[09:19:09] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/883836 (https://phabricator.wikimedia.org/T328001)
[09:19:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001
[09:19:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43387 and previous config saved to /var/cache/conftool/dbconfig/20230126-091923-root.json
[09:19:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo)
[09:20:27] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:21:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43388 and previous config saved to /var/cache/conftool/dbconfig/20230126-092115-root.json
[09:21:40] <wikibugs>	 (03CR) 10DCausse: "this distribution does not seem to have the required deps in the opt folder:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[09:21:41] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69123 bytes in 0.055 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:21:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43389 and previous config saved to /var/cache/conftool/dbconfig/20230126-092155-root.json
[09:22:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001
[09:22:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001
[09:24:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) Thanks for the summary!  Some additional notes/thoughts: * public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (...
[09:24:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/883836 (https://phabricator.wikimedia.org/T328001) (owner: 10Marostegui)
[09:24:45] <marostegui>	 !log Starting x2 codfw failover from db2142 to db2144 - T328001
[09:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:55] <stashbot>	 T328001: Switchover x2 master (db2142 -> db2144) - https://phabricator.wikimedia.org/T328001
[09:25:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P43390 and previous config saved to /var/cache/conftool/dbconfig/20230126-092512-root.json
[09:25:17] <stashbot>	 T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811
[09:30:22] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39263/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[09:30:27] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez)
[09:30:32] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:33:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43391 and previous config saved to /var/cache/conftool/dbconfig/20230126-093303-root.json
[09:34:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43392 and previous config saved to /var/cache/conftool/dbconfig/20230126-093428-root.json
[09:35:05] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:36:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43393 and previous config saved to /var/cache/conftool/dbconfig/20230126-093620-root.json
[09:36:47] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:37:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43394 and previous config saved to /var/cache/conftool/dbconfig/20230126-093700-root.json
[09:37:05] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui)
[09:37:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:37:27] <marostegui>	 ^ checking
[09:37:50] <marostegui>	 jynus: that is a backup source
[09:38:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo)
[09:38:15] <marostegui>	 looks overloaded
[09:39:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo)
[09:39:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm: some minor nits which could also be addressed in a future change" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[09:39:23] <jynus>	 overloaded? there are no running backups
[09:40:12] <jynus>	 there is a backup running now? why?
[09:40:26] <marostegui>	 I don't know but the host is very very very slow
[09:40:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:40:33] <marostegui>	 So it is either that or HW
[09:40:45] <jynus>	 no, there is something going on, but not sure why
[09:41:04] <marostegui>	 HW logs are clean
[09:41:06] <wikibugs>	 (03PS3) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767)
[09:41:43] <marostegui>	 jynus: there are actually two backups running, right? for for s1 and one for s6
[09:41:46] <jynus>	 backups just started now
[09:41:54] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:42:08] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39265/console" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:42:11] <jynus>	 maybe the scheduler got weird because of the time change
[09:42:34] <marostegui>	 yeah could be
[09:42:37] <wikibugs>	 (03CR) 10Elukey: "John: fixed the name of one of the pem files, missed a _, pcc complained but now it seems ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:42:42] <jynus>	 I will kill all of those process, but don't like that sytemd timer retroactively runs stuff
[09:42:52] <marostegui>	 jynus: might happen on the other sources too?
[09:42:57] <jynus>	 yeah
[09:43:07] <jynus>	 I mean, it shouldn't overload anyway
[09:43:16] <jynus>	 but may happen as it is not the night
[09:43:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thx, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:44:56] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:46:27] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:47:37] <jynus>	 even if backups started wrongly, db2141 shouldn't have overloaded
[09:47:42] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:47:51] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:47:52] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:47:56] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:48:00] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:48:11] <wikibugs>	 (03PS6) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249
[09:48:16] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis About to be decommed https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:48:19] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet
[09:49:07] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435]: Regular analytics weekly train [analytics/refinery@8ed8435]
[09:49:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43395 and previous config saved to /var/cache/conftool/dbconfig/20230126-094933-root.json
[09:52:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43396 and previous config saved to /var/cache/conftool/dbconfig/20230126-095205-root.json
[09:52:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43397 and previous config saved to /var/cache/conftool/dbconfig/20230126-095257-root.json
[09:53:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883711
[09:54:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883711 (owner: 10Marostegui)
[09:56:07] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435]: Regular analytics weekly train [analytics/refinery@8ed8435] (duration: 07m 00s)
[09:57:09] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (thin): Regular analytics weekly train THIN [analytics/refinery@8ed8435]
[09:57:15] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (thin): Regular analytics weekly train THIN [analytics/refinery@8ed8435] (duration: 00m 05s)
[09:57:24] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8ed8435]
[09:58:32] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8ed8435] (duration: 01m 08s)
[09:58:58] <wikibugs>	 (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582
[09:59:25] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[09:59:31] <wikibugs>	 (03PS3) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582
[09:59:54] <wikibugs>	 (03CR) 10Marostegui: "This requires applying all the events live" [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[09:59:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[10:00:41] <wikibugs>	 (03CR) 10Ladsgroup: dbtools: Rotate wikiuser (031 comment) [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[10:03:16] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847
[10:04:15] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Test on db1206." [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff)
[10:04:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43398 and previous config saved to /var/cache/conftool/dbconfig/20230126-100438-root.json
[10:05:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond)
[10:07:31] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) We can't migrate the puppetdb2002 VM (it's being moved to baremetal, but that is unlikely completed by then), so we'll need to disable Puppet f...
[10:08:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43399 and previous config saved to /var/cache/conftool/dbconfig/20230126-100802-root.json
[10:08:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582 (owner: 10Muehlenhoff)
[10:08:30] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[10:08:31] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts sretest1002.eqiad.wmnet
[10:08:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Only "concern" is someone where to use this with a system that parses the "nagios" output, it might get confused about the topology inform" [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff)
[10:10:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "patch LGTM, not +1'ing yet though because centrallog1002 is failing its rsyslog probes: https://logstash.wikimedia.org/goto/2155b6c052cd06" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[10:10:53] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712
[10:11:13] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "Thanks for finding this fix for the start issues of phd." [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[10:11:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, modulo CI failure that doesn't look related?" [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[10:11:58] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712
[10:12:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff)
[10:12:10] <wikibugs>	 (03PS3) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712
[10:13:33] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712 (owner: 10Jcrespo)
[10:14:40] <wikibugs>	 (03PS4) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756)
[10:19:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43400 and previous config saved to /var/cache/conftool/dbconfig/20230126-101943-root.json
[10:21:40] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - Second after failure [analytics/refinery@8ed8435]
[10:21:45] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - Second after failure [analytics/refinery@8ed8435] (duration: 00m 04s)
[10:22:33] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847
[10:23:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43401 and previous config saved to /var/cache/conftool/dbconfig/20230126-102307-root.json
[10:24:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond)
[10:24:19] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847
[10:31:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:31:27] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - third after failure [analytics/refinery@8ed8435]
[10:31:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff)
[10:32:08] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802)
[10:32:12] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[10:32:43] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - third after failure [analytics/refinery@8ed8435] (duration: 01m 16s)
[10:32:52] <wikibugs>	 (03PS5) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876
[10:33:01] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[10:34:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[10:34:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[10:34:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43402 and previous config saved to /var/cache/conftool/dbconfig/20230126-103448-root.json
[10:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[10:35:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo the fact that I don't know what (if any) things will need to be removed (e.g. left behind/unmanaged by puppet)" [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[10:36:11] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:38:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43403 and previous config saved to /var/cache/conftool/dbconfig/20230126-103812-root.json
[10:40:13] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15)
[10:41:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[10:41:50] <claime>	 !log cgoubert@authdns1001:~$ sudo -i authdns-update
[10:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:33] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@e52205b]: (no justification provided)
[10:42:44] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@e52205b]: (no justification provided) (duration: 00m 11s)
[10:43:43] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[10:45:17] <moritzm>	 !log installing postgresql-13 security updates
[10:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:15] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename aux-k8s-ingress service to k8s-ingress-aux - cgoubert@cumin1001"
[10:49:45] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:49:55] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:50:14] <hashar>	 ^ I am on those gerrit alarms
[10:50:17] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:50:43] <jgleeson>	 gerrit is back hashar 
[10:51:03] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:51:07] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155)
[10:51:11] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 972 bytes in 0.027 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:51:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo)
[10:51:37] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61922 bytes in 0.042 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:52:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:53:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[10:53:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43404 and previous config saved to /var/cache/conftool/dbconfig/20230126-105317-root.json
[10:54:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[10:54:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[10:55:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename aux-k8s-ingress service to k8s-ingress-aux - cgoubert@cumin1001"
[10:55:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:55:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[10:55:26] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service: Rename aux-k8s-ingress service to k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[10:56:20] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo)
[10:57:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "https://www.php.net/manual/en/timezones.asia.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15)
[10:57:12] <Amir1>	 jouncebot: nowandnext
[10:57:12] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0900)
[10:57:12] <jouncebot>	 In 0 hour(s) and 2 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100)
[10:57:12] <jouncebot>	 In 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100)
[10:57:17] <Amir1>	 sad
[11:00:05] <jouncebot>	 mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100). Please do the needful.
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100)
[11:01:59] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:02:07] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:02:33] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:03:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rename ceph profiles to cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis)
[11:03:27] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:03:37] <wikibugs>	 (03CR) 10Btullis: Rename ceph profiles to cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis)
[11:03:37] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:03:50] <hashar>	 !log Restarted Apache 2 on gerrit.wikimedia.org
[11:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:01] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66644 bytes in 0.046 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:04:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis)
[11:08:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43405 and previous config saved to /var/cache/conftool/dbconfig/20230126-110822-root.json
[11:10:02] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863
[11:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:10:46] <wikibugs>	 (03PS1) 10Slyngshede: PERC RAID: Fix formatting for Nagios output. [puppet] - 10https://gerrit.wikimedia.org/r/883864
[11:12:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond)
[11:12:53] <icinga-wm>	 PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100%
[11:13:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) @ssingh i have created a patch to defer reboots until all drivers have been uploaded.  Are...
[11:23:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[11:24:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865
[11:26:39] <wikibugs>	 (03PS1) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868
[11:26:41] <wikibugs>	 (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869
[11:26:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883864 (owner: 10Slyngshede)
[11:28:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39267/console" [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond)
[11:28:47] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:29:29] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[11:29:36] <wikibugs>	 (03CR) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:29:49] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[11:30:17] <icinga-wm>	 RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[11:31:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865
[11:31:14] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond)
[11:32:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond)
[11:33:36] <wikibugs>	 (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869
[11:33:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond)
[11:36:41] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux
[11:37:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865 (owner: 10Muehlenhoff)
[11:39:46] <wikibugs>	 (03PS1) 10Jbond: Revert "gerrit: Add requestctl support to ferm gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/883725
[11:40:18] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts flowspec1001
[11:40:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "gerrit: Add requestctl support to ferm gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/883725 (owner: 10Jbond)
[11:41:17] <wikibugs>	 (03PS1) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886
[11:42:57] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:43:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:43:48] <wikibugs>	 (03PS1) 10Ayounsi: flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009)
[11:44:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:44:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:44:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[11:46:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flowspec1001 decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001"
[11:48:03] <wikibugs>	 (03CR) 10Muehlenhoff: flowspec1001: remove everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi)
[11:48:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flowspec1001 decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001"
[11:48:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:48:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flowspec1001
[11:48:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decom flowspec1001 - https://phabricator.wikimedia.org/T328009 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `flowspec1001` - flowspec1001 (**PASS**)   - Downtimed host on Icinga/Alertmanag...
[11:48:55] <wikibugs>	 (03PS1) 10Jbond: wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825)
[11:49:04] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet
[11:49:18] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet
[11:50:05] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:50:14] <wikibugs>	 (03PS2) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886
[11:50:20] <wikibugs>	 (03PS2) 10Ayounsi: flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009)
[11:50:52] <wikibugs>	 (03CR) 10Ayounsi: flowspec1001: remove everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi)
[11:52:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi)
[11:52:45] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi)
[11:53:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[11:54:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[11:54:24] <wikibugs>	 (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869
[11:55:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886 (owner: 10Jbond)
[11:56:08] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155)
[11:56:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo)
[11:56:34] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd-client-ssl._tcp.wikimedia.org on all recursors
[11:56:38] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd-client-ssl._tcp.wikimedia.org on all recursors
[11:57:30] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[11:57:56] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155)
[11:59:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:00:05] <jinxer-wm>	 (ConfdResourceFailed) resolved: confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:00:13] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39268/console" [puppet] - 10https://gerrit.wikimedia.org/r/883869 (owner: 10Jbond)
[12:02:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869 (owner: 10Jbond)
[12:03:31] <jbond>	 !log enable profile::base::firewall::defs_from_etcd: true globally
[12:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo)
[12:04:53] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:09:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-proxies rolling restart_daemons on A:eqiad and not A:thanos-fe and A:swift-fe or A:thanos-fe
[12:10:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[12:10:39] <arturo>	 jbond: :-( the global ferm thing made puppet agent sad in some of our servers
[12:10:56] <jbond>	 arturo: can you give me an example ill take a look
[12:11:10] <arturo>	 jbond:  https://www.irccloud.com/pastebin/BuAfS8XD/
[12:12:03] * jbond looking
[12:12:11] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:12:45] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01105 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[12:12:58] <jbond>	 ok rolling back
[12:13:19] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:13:24] <wikibugs>	 (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883887
[12:13:41] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:14:02] <arturo>	 jbond: sorry :-(
[12:14:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883887 (owner: 10Jbond)
[12:15:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet
[12:16:22] <wikibugs>	 (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[12:16:42] <jbond>	 arturo: shuld be fixed now sorry about that
[12:16:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[12:17:01] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:18:37] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:46] <wikibugs>	 (03PS1) 10Jaime Nuche: scap3 Jenkins deployment (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/883913
[12:20:47] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002513 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[12:21:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:21:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[12:22:05] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:23:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:29:05] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-proxies (exit_code=0) rolling restart_daemons on A:eqiad and not A:thanos-fe and A:swift-fe or A:thanos-fe
[12:29:53] <wikibugs>	 (03PS6) 10Clément Goubert: httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876
[12:31:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:31:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[12:35:12] <icinga-wm>	 PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100%
[12:35:17] <wikibugs>	 (03PS1) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926
[12:35:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:35:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:37:06] <wikibugs>	 (03PS2) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926
[12:37:12] <icinga-wm>	 RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[12:37:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Abhas)
[12:37:39] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:38:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:38:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:38:18] <wikibugs>	 (03CR) 10Clément Goubert: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:38:46] <wikibugs>	 (03PS3) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926
[12:39:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:39:20] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:39:37] <claime>	 Haha jynus fixing things before I can redact a comment lol
[12:39:52] <jynus>	 yeah, I thought the . was a ./
[12:40:03] <claime>	 Same at first
[12:40:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:40:11] <jynus>	 I am not familiar with Path, I usually use os.path (join)
[12:40:16] <claime>	 Looks good now
[12:40:42] <jynus>	 yeah, but better jbond can have a look, minimal changes sometimes are not what it is supposed to do
[12:40:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet
[12:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:40:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:40:58] <claime>	 jynus: agreed
[12:41:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet
[12:41:40] <sukhe>	 !log depool cp3051.esams.wmnet for firmware update testing: T323717
[12:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:43] <stashbot>	 T323717: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717
[12:42:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3051.esams.wmnet,service=cdn
[12:42:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3051.esams.wmnet,service=ats-be
[12:42:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp3051.esams.wmnet with reason: T323717
[12:43:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp3051.esams.wmnet with reason: T323717
[12:45:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8559564, @ssingh wrote: > Since we started reimaging the cp hosts to bulls...
[12:46:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) If I don't upgrade the iDRAC firmware, the NIC firmware fails to update for me so I have...
[12:46:39] <wikibugs>	 (03PS4) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767)
[12:46:43] <wikibugs>	 (03CR) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[12:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:47:07] <wikibugs>	 (03PS5) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783)
[12:47:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet
[12:49:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr can you add the disk back?
[12:49:14] <wikibugs>	 (03CR) 10Jcrespo: "please test on strech to make sure it works as intended :-D" [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:50:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff)
[12:51:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:52:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: sre.swift.roll-restart-reboot-proxies fails on thanos hosts, which lack nginx - https://phabricator.wikimedia.org/T327783 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been fixed by splitting the restart cookbooks i...
[12:53:23] <wikibugs>	 (03Abandoned) 10Muehlenhoff: sre.swift.roll-restart-reboot-proxies: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff)
[12:53:27] <wikibugs>	 (03PS4) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:53:29] <wikibugs>	 (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[12:53:32] <jbond>	 jynus: claime: i have made a small update can you both take another look
[12:53:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium
[12:53:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) a:03Clement_Goubert
[12:53:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:54:16] <wikibugs>	 (03PS5) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:54:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:55:33] <jynus>	 jbond: ideal looks good, needs a concrete exception
[12:55:54] <wikibugs>	 (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[12:55:55] <jbond>	 updated 
[12:56:04] <jynus>	 ImportError I guess=
[12:56:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) [] Approval from @Ottomata or @odimitrijevic as group approvers [] Approval from @JanWMF as manager [] Out of band key verification
[12:56:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:57:25] <wikibugs>	 (03PS6) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:57:26] <jbond>	 ok not updated :)
[12:57:27] <wikibugs>	 (03PS4) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[12:57:44] <jynus>	 jbond: looks good to me, feel free to  sqash both changes, as long as it works on all versions it is ok to me
[12:57:48] <wikibugs>	 (03PS1) 10Elukey: ml-services: update revscoring model servers to the latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/883929 (https://phabricator.wikimedia.org/T325528)
[12:57:53] <wikibugs>	 (03PS7) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[12:58:02] <jbond>	 ack thanks jynus
[12:58:20] <jbond>	 ill push the other through later though as there is also a acl blocking
[12:58:51] <jynus>	 going to lunch, but please you or someone else have a look at the thumbor hosts complaining (probably same fix than swift)
[12:59:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Ottomata) Approved.  I'm not certain this will need kerberos access, but I'd go ahead and give it for good measure.  I'd expect there to be times when it will just be easier t...
[13:00:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[13:00:28] <wikibugs>	 (03PS4) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013)
[13:02:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] disc_desired_state: Add k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[13:04:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[13:04:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster
[13:04:26] <wikibugs>	 (03PS7) 10Jbond: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[13:04:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring model servers to the latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/883929 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey)
[13:04:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+2] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[13:04:54] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert)
[13:05:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] PERC RAID: Fix formatting for Nagios output. [puppet] - 10https://gerrit.wikimedia.org/r/883864 (owner: 10Slyngshede)
[13:06:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:07:06] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:03] <hashar>	 !log Rebooting gerrit2002.wikimedia.org host to validate Apache 2 services starts AFTER network went online | T326125
[13:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:07] <stashbot>	 T326125: apache2 fails to start after gerrit hosts are rebooted - https://phabricator.wikimedia.org/T326125
[13:10:01] <moritzm>	 !log installing nodejs security updates on bullseye
[13:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:42] <icinga-wm>	 ACKNOWLEDGEMENT - Host gerrit2002 is DOWN: PING CRITICAL - Packet loss = 100% amusso reboot!
[13:12:24] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-staging2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:13:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] disc_desired_state: Add k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert)
[13:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:18:43] <wikibugs>	 (03PS1) 10Jbond: confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935
[13:18:45] <Amir1>	 jouncebot: nowandnext
[13:18:45] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 41 minute(s)
[13:18:45] <jouncebot>	 In 0 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400)
[13:18:45] <jouncebot>	 In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400)
[13:19:11] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Change time zone setting on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15)
[13:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: Change time zone setting on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15)
[13:20:05] <wikibugs>	 (03PS5) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[13:20:08] <icinga-wm>	 PROBLEM - Check systemd state on ml-staging2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo)
[13:20:51] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]]
[13:20:55] <stashbot>	 T327986: Change time zone setting in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327986
[13:21:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[13:22:39] <logmsgbot>	 !log ladsgroup@deploy1002 superpes and ladsgroup: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:22:39] <wikibugs>	 (03PS6) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825)
[13:25:47] <moritzm>	 !log restarting turnilo for nodejs security update
[13:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:42] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) Adding Jaime for the backup hosts.
[13:32:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:32:53] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]] (duration: 12m 02s)
[13:32:57] <stashbot>	 T327986: Change time zone setting in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327986
[13:33:44] <wikibugs>	 (03CR) 10Jforrester: "You'll need to patch scap (or the puppet controling code) to generate the PHP i18n first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707 (https://phabricator.wikimedia.org/T99740) (owner: 10Ladsgroup)
[13:33:47] <wikibugs>	 (03PS1) 10Stevemunene: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884)
[13:33:52] <wikibugs>	 (03PS2) 10Jbond: confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935
[13:35:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 (owner: 10Jbond)
[13:35:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 (owner: 10Jbond)
[13:36:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert)
[13:36:28] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[13:37:18] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove DNS records for removed esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[13:37:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert)
[13:38:18] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove DNS records for removed esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[13:38:18] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:38:38] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/883519 (https://phabricator.wikimedia.org/T328022)
[13:38:58] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[13:39:30] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:39:44] <wikibugs>	 (03CR) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[13:39:51] <wikibugs>	 (03PS1) 10Ayounsi: BGPalerter: switch to email noc@ [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600)
[13:40:12] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2113 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/883520 (https://phabricator.wikimedia.org/T328023)
[13:40:34] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[13:41:28] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:42:36] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/883521 (https://phabricator.wikimedia.org/T328024)
[13:42:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-staging2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:43:13] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[13:43:54] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:43:58] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove include for reverse zone for  2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266)
[13:44:44] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[13:44:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:44:56] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede)
[13:45:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:45:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:46:42] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:18] <icinga-wm>	 RECOVERY - Check systemd state on ml-staging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:56] <wikibugs>	 (03PS1) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943
[13:51:15] <wikibugs>	 (03PS2) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields on group 0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004)
[13:51:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[13:51:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T328023
[13:52:03] <stashbot>	 T328023: Switchover s5 master (db2123 -> db2113) - https://phabricator.wikimedia.org/T328023
[13:52:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2113 with weight 0 T328023', diff saved to https://phabricator.wikimedia.org/P43408 and previous config saved to /var/cache/conftool/dbconfig/20230126-135215-root.json
[13:52:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T328023
[13:52:40] <wikibugs>	 (03PS2) 10Jbond: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[13:52:52] <wikibugs>	 (03PS3) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943
[13:53:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2113 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/883520 (https://phabricator.wikimedia.org/T328023) (owner: 10Gerrit maintenance bot)
[13:53:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[13:53:49] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39270/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[13:54:52] <wikibugs>	 (03CR) 10Jgiannelos: "This has already been tested in our last OSM import. I think that its better to merge as it is and file a ticket for further improvements." [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[13:55:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove vslow from db2113, future s5 codfw master T328023', diff saved to https://phabricator.wikimedia.org/P43409 and previous config saved to /var/cache/conftool/dbconfig/20230126-135509-marostegui.json
[13:55:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991)
[13:56:10] <wikibugs>	 (03Abandoned) 10Jgiannelos: maps: Disable tilerator on codfw replicas [puppet] - 10https://gerrit.wikimedia.org/r/811737 (owner: 10Jgiannelos)
[13:57:59] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39271/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[13:58:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: Add missing index script on import [puppet] - 10https://gerrit.wikimedia.org/r/883197 (owner: 10Jgiannelos)
[13:58:24] <wikibugs>	 (03PS1) 10Jbond: cr/interfaces: check for ips key before accessing it [homer/public] - 10https://gerrit.wikimedia.org/r/883947
[13:58:27] <wikibugs>	 (03CR) 10Ayounsi: "one comment then lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[14:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400). nyaa~
[14:00:04] <jouncebot>	 Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:11] <Dreamy_Jazz>	 \o
[14:00:14] <Lucas_WMDE>	 o/
[14:00:22] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:00:24] <wikibugs>	 (03PS4) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943
[14:00:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[14:00:48] <moritzm>	 !log restarting etherpad-lite to pick up nodejs security update
[14:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:52] <Lucas_WMDE>	 I can deploy
[14:01:12] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:01:13] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300 (10ayounsi)
[14:01:15] <Dreamy_Jazz>	 Won't be able to test myself as I do not have CheckUser permissions on group 0 or 1. Any steward or WMF employee with staff rights should be able to load Special:CheckUserLog to test.
[14:01:22] <wikibugs>	 (03Abandoned) 10Jbond: cr/interfaces: check for ips key before accessing it [homer/public] - 10https://gerrit.wikimedia.org/r/883947 (owner: 10Jbond)
[14:01:30] <wikibugs>	 (03CR) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[14:01:45] <urbanecm>	 Dreamy_Jazz: i have like ~15 minutes now
[14:01:54] <wikibugs>	 (03PS5) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943
[14:01:57] <Dreamy_Jazz>	 Okay. Thanks.
[14:02:00] <Lucas_WMDE>	 urbanecm: want to do the deployment?
[14:02:06] <Lucas_WMDE>	 (or I can deploy and let you verify ^^)
[14:02:12] <urbanecm>	 Lucas_WMDE: can you do it please? :)
[14:02:16] <Lucas_WMDE>	 sure!
[14:02:19] <urbanecm>	 ty
[14:02:55] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39272/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[14:03:05] <Lucas_WMDE>	 Dreamy_Jazz: I assume the data was backfilled via maintenance script or something like that?
[14:03:31] <Dreamy_Jazz>	 Yes. See https://phabricator.wikimedia.org/T327290
[14:03:51] <Lucas_WMDE>	 got it, thanks
[14:03:57] <Lucas_WMDE>	 (I got lost in the many updates on https://phabricator.wikimedia.org/T233004 ^^)
[14:04:10] <Dreamy_Jazz>	 Np
[14:04:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[14:05:33] <wikibugs>	 (03Merged) 10jenkins-bot: Enable write new for CheckUserLog comment fields on group 0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[14:06:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]]
[14:06:05] <marostegui>	 !log Starting s5 codfw failover from db2123 to db2113 - T328023
[14:06:05] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[14:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:09] <stashbot>	 T328023: Switchover s5 master (db2123 -> db2113) - https://phabricator.wikimedia.org/T328023
[14:06:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede)
[14:06:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2113 to s5 primary T328023', diff saved to https://phabricator.wikimedia.org/P43410 and previous config saved to /var/cache/conftool/dbconfig/20230126-140630-root.json
[14:06:36] <Dreamy_Jazz>	 urbanecm: Test instructions are:
[14:06:36] <Dreamy_Jazz>	 * Load Special:CheckUserLog
[14:06:36] <Dreamy_Jazz>	 * Find (or make) an entry with a wikilink in it's reason
[14:06:36] <Dreamy_Jazz>	 * Copy the reason as shown in the CheckUserLog - It should be the reason without the "[[" and "]]" markup for the wikilink
[14:06:36] <Dreamy_Jazz>	 * Paste this into the 'reason' search field
[14:06:36] <Dreamy_Jazz>	 * Search the log
[14:06:36] <Dreamy_Jazz>	 * The test passes if you see the entry with the wikilink shown
[14:06:37] <Dreamy_Jazz>	 This works because the method to search changes once read new is set so that the wikilink structure is ignored when searching
[14:06:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991)
[14:07:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2123 T328023', diff saved to https://phabricator.wikimedia.org/P43411 and previous config saved to /var/cache/conftool/dbconfig/20230126-140716-root.json
[14:07:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:07:35] <urbanecm>	 Dreamy_Jazz: which wiki please? :-)
[14:07:44] <Dreamy_Jazz>	 Any group 0 or 1 wiki
[14:07:46] <urbanecm>	 okay
[14:07:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:07:54] <Dreamy_Jazz>	 test wikis have already had the change made
[14:08:05] <Lucas_WMDE>	 urbanecm: should be on mwdebug now
[14:08:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43412 and previous config saved to /var/cache/conftool/dbconfig/20230126-140804-root.json
[14:08:10] <urbanecm>	 okay, so non-testwiki group0/1
[14:08:16] <urbanecm>	 metawiki should work?
[14:08:18] <Dreamy_Jazz>	 Sure
[14:08:50] <Dreamy_Jazz>	 As far as I am aware yes, as it's shown in group 1 on toolforge versions list
[14:08:54] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[14:08:56] <Dreamy_Jazz>	 ( https://versions.toolforge.org/ )
[14:09:10] <wikibugs>	 (03PS2) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[14:09:15] <urbanecm>	 Lucas_WMDE: Dreamy_Jazz: change works correctly :)
[14:09:20] <Lucas_WMDE>	 I think even test2wiki might work, since READ_NEW is set on testwiki (a wiki) rather than testwikis (a dblist) afaict
[14:09:22] <Lucas_WMDE>	 okay, yay
[14:09:22] <wikibugs>	 (03PS3) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[14:09:23] <Dreamy_Jazz>	 Great thanks for the merge!
[14:09:29] <Lucas_WMDE>	 syncing :)
[14:09:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991)
[14:09:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:09:32] <Dreamy_Jazz>	 And testing
[14:09:33] <Lucas_WMDE>	 thanks for testing!
[14:09:49] <urbanecm>	 no problem
[14:10:12] <Lucas_WMDE>	 hmm, mw1448 Special:Version returned 500 according to scap
[14:10:15] <Lucas_WMDE>	 let’s hope that was just flaky
[14:10:31] <Lucas_WMDE>	 (it’s continuing so far, iirc one failed canary isn’t enough to stop the sync)
[14:10:38] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[14:10:58] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:11:14] <jbond>	 !log disable puppet fleet wide to role out etcd ferm change gerrit:883888
[14:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:11:52] <Dreamy_Jazz>	 Yeah. Special:Version shouldn't have been affected by that config change, so probably just being flaky
[14:12:08] <Lucas_WMDE>	 yeah, really doesn’t seem like it should be related
[14:12:54] <Lucas_WMDE>	 ok I can see it in logstash, it was “shellbox server returned status code 503”
[14:13:01] <Lucas_WMDE>	 (reporting the lilypond version)
[14:13:20] <urbanecm>	 out of caution, i tested special:Version at mw1448. it works fine.
[14:13:28] <urbanecm>	 so, a onetime error
[14:13:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive 2 reinserted.
[14:13:32] <Dreamy_Jazz>	 Thanks!
[14:13:34] <Lucas_WMDE>	 thanks
[14:13:42] <Lucas_WMDE>	 how do you test a specific non-mwdebug server?
[14:13:57] <urbanecm>	 Lucas_WMDE: ssh there, `curl -i --connect-to ::$HOSTNAME 'https://test.wikipedia.org/wiki/Special:Version'`
[14:14:10] <Lucas_WMDE>	 ok, thanks!
[14:14:17] <urbanecm>	 if you have the proxy env variables set in your bashrc like i do, you need to unset those first
[14:14:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544)
[14:14:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950
[14:15:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2062 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]] (duration: 09m 16s)
[14:15:21] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[14:16:21] <Amir1>	 jouncebot: nowandnext
[14:16:21] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400)
[14:16:22] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400)
[14:16:22] <jouncebot>	 In 2 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700)
[14:16:31] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <Amir1>	 I was about to ask :D
[14:16:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto)
[14:16:41] <Lucas_WMDE>	 :)
[14:16:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto)
[14:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:17:07] <Lucas_WMDE>	 (unimportant thing that *I* was about to ask earlier: the s5 master is on codfw?)
[14:17:33] <Lucas_WMDE>	 *primary
[14:17:41] <Amir1>	 nope, codfw has it's own master but it's a replica of the eqiad one
[14:17:48] <Lucas_WMDE>	 ok, thx ^^
[14:17:50] <Amir1>	 https://orchestrator.wikimedia.org/web/cluster/alias/s5
[14:18:36] <Amir1>	 it still needs switchovers for maint because if we stop replication on it, it'll break replication to the whole codfw :D
[14:19:33] <Lucas_WMDE>	 makes sense
[14:20:22] <wikibugs>	 (03PS1) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004)
[14:21:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) I see it rebuilding, I will ping you once the alert recovers so we can pull it out again: ` perccli64 /c0 show rebuildrate CLI Version =...
[14:22:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2062 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:22:39] <wikibugs>	 (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883890
[14:23:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43413 and previous config saved to /var/cache/conftool/dbconfig/20230126-142309-root.json
[14:23:24] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[14:23:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883890 (owner: 10Jbond)
[14:24:29] <wikibugs>	 (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883891 (https://phabricator.wikimedia.org/T313825)
[14:24:46] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:24:58] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544)
[14:25:00] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950
[14:27:11] <moritzm>	 !log installing containerd security updates
[14:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:22] <wikibugs>	 (03PS2) 10Cathal Mooney: Remove include for reverse zone for  2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266)
[14:28:57] <wikibugs>	 (03CR) 10Cathal Mooney: Remove include for reverse zone for  2620:0:861:fe03::/64 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[14:29:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove include for reverse zone for  2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[14:30:36] <wikibugs>	 (03CR) 10Btullis: "You'll need to increment the `version` value in charts/datahub/Chart.yaml and charts/datahub/charts/datahub-frontend/Chart.yaml as well, o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene)
[14:31:01] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Rotating wikiadmin password (T326802) (duration: 07m 04s)
[14:31:05] <stashbot>	 T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802
[14:31:23] <wikibugs>	 (03PS1) 10Ladsgroup: dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802)
[14:31:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[14:31:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:31:46] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:32:50] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:33:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266)
[14:34:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:23] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:35:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) I pasted the wrong command above: ` root@db1206:~# perccli64 /c0/e252/s2 show rebuild CLI Version = 007.1910.0000.0000 Oct 08, 2021 Opera...
[14:36:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[14:37:14] <wikibugs>	 (03PS1) 10Jbond: cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960
[14:37:24] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[14:37:26] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[14:37:26] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:37:32] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[14:38:12] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43414 and previous config saved to /var/cache/conftool/dbconfig/20230126-143814-root.json
[14:39:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "My bad!" [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond)
[14:39:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[14:39:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond)
[14:40:19] <wikibugs>	 (03Merged) 10jenkins-bot: cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond)
[14:40:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[14:40:39] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001"
[14:40:39] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:40:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:42:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:42] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[14:44:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:44:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[14:45:10] <icinga-wm>	 RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid
[14:45:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:46:18] <wikibugs>	 (03CR) 10Bking: [C: 03+2] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking)
[14:46:46] <wikibugs>	 (03PS3) 10Bking: dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409)
[14:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:47:03] <wikibugs>	 (03CR) 10Bking: dse-k8s: add rdf-streaming-updater namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409) (owner: 10Bking)
[14:47:05] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:47:07] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409) (owner: 10Bking)
[14:47:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) RAID is now back in optimal status, waiting for Icinga to recover before pulling the disk out again ` VD LIST : =======  ----------------...
[14:48:40] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802)
[14:49:13] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802)
[14:49:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:49:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche)
[14:49:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] BGPalerter: switch to email noc@ [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[14:51:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) ` root@db1206:~#  sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication: 0 OK | controller: 0 OK | physical_disk: 0 OK...
[14:52:06] <jinxer-wm>	 (ConfdResourceFailed) resolved: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:52:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2062 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:53:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43415 and previous config saved to /var/cache/conftool/dbconfig/20230126-145319-root.json
[14:54:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:55:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr whenever you can, pull the disk out again. Thank you
[14:55:18] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[14:55:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:55:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:56:52] <wikibugs>	 (03CR) 10Ladsgroup: "PCC looks fine: https://puppet-compiler.wmflabs.org/output/883961/39273/" [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[14:57:26] <wikibugs>	 (03PS2) 10Stevemunene: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884)
[14:59:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:00:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[15:00:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[15:01:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney)
[15:02:16] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add SANs to the inference endpoints for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302)
[15:02:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[15:02:33] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[15:02:36] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye
[15:02:41] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[15:04:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[15:04:07] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[15:04:12] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye
[15:04:14] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[15:04:16] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[15:04:32] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302)
[15:04:41] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[15:05:59] <wikibugs>	 (03PS1) 10Hashar: gerrit: listen on all port, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125)
[15:08:05] <wikibugs>	 (03CR) 10Hashar: "For the record, that did not work cause `network-online.target` is reached immediately after the interface script have completed and they " [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[15:08:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43417 and previous config saved to /var/cache/conftool/dbconfig/20230126-150824-root.json
[15:08:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[15:09:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564
[15:09:10] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[15:09:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[15:09:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[15:09:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564
[15:09:23] <elukey>	 ah lovely
[15:09:23] <sukhe>	 !log stop pybal on lvs2007: T326564
[15:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:38] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39274/console" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[15:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup)
[15:10:05] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[15:10:11] <wikibugs>	 (03CR) 10Hashar: "That is a continuation of https://gerrit.wikimedia.org/r/c/operations/puppet/+/875315/ which did not work ;)" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[15:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:11:12] <wikibugs>	 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10MatthewVernon)
[15:11:55] <wikibugs>	 (03PS3) 10Elukey: admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302)
[15:12:35] <jbond>	 !log disabl-puppet deplot requestctl ferm chage gerrit:883935
[15:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:46] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:13:04] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:13:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:14:00] <sukhe>	 ^ BGP alerts on cr*-codfw expected as lvs2007 is depooled
[15:15:34] <wikibugs>	 (03PS4) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[15:15:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883891 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[15:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:16:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[15:16:31] <wikibugs>	 (03PS5) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[15:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:18:19] <wikibugs>	 (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883892
[15:18:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883892 (owner: 10Jbond)
[15:18:45] <wikibugs>	 (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825)
[15:19:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:07] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[15:22:10] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[15:23:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43418 and previous config saved to /var/cache/conftool/dbconfig/20230126-152329-root.json
[15:25:08] <claime>	 jouncebot: nowandnext
[15:25:08] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 34 minute(s)
[15:25:08] <jouncebot>	 In 1 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700)
[15:25:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos)
[15:26:06] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert)
[15:27:47] <sukhe>	 !log poweroff lvs2007: T326564
[15:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:52] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[15:29:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[15:29:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[15:29:34] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye
[15:29:39] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[15:29:50] <sukhe>	 hmm ttyS1-115200/cp2027.conf exists, removing doesn't help too
[15:29:52] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:30:17] <sukhe>	 !log install2003: rm /etc/dhcp/automation/ttyS1-115200/cp2027.conf
[15:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[15:30:43] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye
[15:30:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[15:30:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:30:59] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[15:31:20] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:35:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:39:54] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1009.eqiad.wmnet
[15:40:47] <wikibugs>	 (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825)
[15:40:50] <wikibugs>	 (03PS1) 10Jbond: confd::file: allow to specify fully qualified prefix [puppet] - 10https://gerrit.wikimedia.org/r/883973 (https://phabricator.wikimedia.org/T313825)
[15:41:00] <icinga-wm>	 PROBLEM - Host flowspec1001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:50] <wikibugs>	 (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825)
[15:44:52] <wikibugs>	 (03PS1) 10Jbond: P:firewall: use fully qualified confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/883974
[15:46:54] <hashar>	 !log Restart Jenkins for upgrade
[15:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39276/console" [puppet] - 10https://gerrit.wikimedia.org/r/883974 (owner: 10Jbond)
[15:48:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd::file: allow to specify fully qualified prefix [puppet] - 10https://gerrit.wikimedia.org/r/883973 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[15:48:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:firewall: use fully qualified confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/883974 (owner: 10Jbond)
[15:49:20] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudgw2001-dev.codfw.wmnet
[15:49:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s8 T328024
[15:49:45] <stashbot>	 T328024: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T328024
[15:50:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T328024', diff saved to https://phabricator.wikimedia.org/P43419 and previous config saved to /var/cache/conftool/dbconfig/20230126-155000-root.json
[15:50:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s8 T328024
[15:50:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/883521 (https://phabricator.wikimedia.org/T328024) (owner: 10Gerrit maintenance bot)
[15:51:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[15:51:25] <hashar>	 !log Restarting CI Jenkins for upgrade
[15:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:24] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[15:55:42] <jbond>	 !log enable-puppet post deploy requestctl ferm chage gerrit:883935
[15:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:48] <wikibugs>	 (03PS2) 10Muehlenhoff: slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942)
[16:02:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[16:04:47] <wikibugs>	 (03PS1) 10Clément Goubert: httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794)
[16:05:34] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1009.eqiad.wmnet
[16:05:52] <wikibugs>	 (03PS3) 10Muehlenhoff: slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942)
[16:06:02] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002"
[16:06:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[16:06:29] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[16:08:03] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002"
[16:08:03] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:08:04] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudgw2001-dev.codfw.wmnet
[16:08:05] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569)
[16:08:07] <wikibugs>	 (03PS1) 10Jbond: confd: ensure python package [puppet] - 10https://gerrit.wikimedia.org/r/883979
[16:09:06] <moritzm>	 !log installing distro-info-data updates from Bullseye point release
[16:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[16:10:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] confd: ensure python package [puppet] - 10https://gerrit.wikimedia.org/r/883979 (owner: 10Jbond)
[16:10:37] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:10:38] <marostegui>	 !log Starting s8 codfw failover from db2161 to db2165 - T328024
[16:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:42] <stashbot>	 T328024: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T328024
[16:10:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 primary T328024', diff saved to https://phabricator.wikimedia.org/P43420 and previous config saved to /var/cache/conftool/dbconfig/20230126-161058-marostegui.json
[16:11:06] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert)
[16:11:14] <wikibugs>	 (03PS5) 10Herron: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[16:11:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2161 T328024', diff saved to https://phabricator.wikimedia.org/P43421 and previous config saved to /var/cache/conftool/dbconfig/20230126-161137-root.json
[16:12:37] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[16:12:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43422 and previous config saved to /var/cache/conftool/dbconfig/20230126-161242-root.json
[16:13:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[16:13:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717
[16:13:37] <stashbot>	 T323717: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717
[16:13:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717
[16:14:13] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui)
[16:14:45] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet
[16:17:00] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert)
[16:17:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983
[16:18:01] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS bullseye
[16:18:07] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6007.drmrs.wmnet with OS bullseye
[16:19:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:08] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.dns.netbox
[16:19:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[16:19:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1080.eqiad.wmnet
[16:20:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[16:20:19] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:20:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[16:21:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM. Let's give it a go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene)
[16:21:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983 (owner: 10Muehlenhoff)
[16:21:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[16:21:32] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet
[16:21:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983 (owner: 10Muehlenhoff)
[16:23:06] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet
[16:23:34] <icinga-wm>	 PROBLEM - Check systemd state on mw1411 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:44] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001-dev
[16:24:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:24:42] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001-dev
[16:25:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:26:33] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1080.eqiad.wmnet
[16:27:16] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:36] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:40] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[16:27:45] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[16:27:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43423 and previous config saved to /var/cache/conftool/dbconfig/20230126-162747-root.json
[16:27:49] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye
[16:27:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1084.eqiad.wmnet
[16:27:57] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[16:28:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw2001-dev: rename server to cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884027 (https://phabricator.wikimedia.org/T327908)
[16:28:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[16:28:29] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[16:28:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2001-dev: rename server to cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884027 (https://phabricator.wikimedia.org/T327908) (owner: 10Arturo Borrero Gonzalez)
[16:30:04] <wikibugs>	 (03PS4) 10Muehlenhoff: slapd: Add support to configure MDB storage backend [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942)
[16:31:09] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet
[16:32:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:33:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1084.eqiad.wmnet
[16:34:35] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:30] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage
[16:38:32] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye
[16:38:36] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Puppet and PuppetDB if p...
[16:39:08] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite)
[16:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:41:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2027']
[16:41:49] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage
[16:42:36] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:42:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43424 and previous config saved to /var/cache/conftool/dbconfig/20230126-164252-root.json
[16:45:08] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo)
[16:46:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:48:05] <sukhe>	 !log pooling lvs2009 after T326564
[16:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:09] <stashbot>	 T326564: codfw: Relocate servers to make space for new switches  in rowA and rowB - https://phabricator.wikimedia.org/T326564
[16:48:14] <sukhe>	 !log correcting earlier log: pooling lvs2007 after T326564
[16:48:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:22] <wikibugs>	 10SRE, 10Incident Tooling: Pagination parameters required for Statuspage's authenticated REST API - https://phabricator.wikimedia.org/T328044 (10lmata)
[16:49:42] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['cp2027']
[16:50:52] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 169, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:51:00] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:51:39] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye
[16:51:44] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye
[16:52:46] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10herron)
[16:53:42] <claime>	 !log Running scap sync-file -D php_fpm_restart_script:/bin/true tox.ini "Rebuilding mediawiki-webserver image" - T326794
[16:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:46] <stashbot>	 T326794: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794
[16:54:14] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767)
[16:54:19] <wikibugs>	 (03PS17) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:54:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:56:05] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1012.eqiad.wmnet
[16:57:15] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39278/console" [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[16:57:31] <wikibugs>	 (03PS1) 10Jbond: wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036
[16:57:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43425 and previous config saved to /var/cache/conftool/dbconfig/20230126-165757-root.json
[16:58:31] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add separate ensure for docker::network [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949)
[16:58:42] <sukhe>	 jbond: lol that looks great
[16:59:04] <wikibugs>	 (03PS18) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:59:16] <wikibugs>	 (03PS1) 10Elukey: admin_ng: update ml-serve-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767)
[16:59:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond)
[16:59:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:59:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks, looks amazing!" [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond)
[16:59:52] <logmsgbot>	 !log cgoubert@deploy1002 Synchronized tox.ini: Rebuilding mediawiki-webserver (duration: 07m 19s)
[16:59:59] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Looks like a reasonable solution" [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto)
[17:00:04] <jouncebot>	 jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:54] <Lucas_WMDE>	 jouncebot: what about when your hammer is puppetlang though
[17:02:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[17:02:10] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:50] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1012.eqiad.wmnet
[17:03:05] <wikibugs>	 (03PS19) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:03:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:03:28] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6007.drmrs.wmnet with OS bullseye
[17:03:34] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6007.drmrs.wmnet with OS bullseye completed: - cp6007 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[17:04:32] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6007.drmrs.wmnet
[17:05:09] <wikibugs>	 (03PS20) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:05:10] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[17:05:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[17:05:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[17:05:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039
[17:05:26] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet
[17:05:51] <wikibugs>	 (03PS2) 10Jbond: wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036
[17:06:12] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS bullseye
[17:06:19] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6016.drmrs.wmnet with OS bullseye
[17:06:35] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1013.eqiad.wmnet
[17:07:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage
[17:07:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond)
[17:10:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage
[17:12:31] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1013.eqiad.wmnet
[17:13:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43426 and previous config saved to /var/cache/conftool/dbconfig/20230126-171302-root.json
[17:14:34] <wikibugs>	 (03PS21) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:14:52] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Eevans)
[17:16:13] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet
[17:16:45] <wikibugs>	 (03PS22) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:17:44] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans)
[17:18:05] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:19:04] <logmsgbot>	 !log dancy@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[17:19:10] <wikibugs>	 (03PS23) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:19:15] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 11s)
[17:21:31] <wikibugs>	 (03PS24) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:22:36] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1014.eqiad.wmnet
[17:22:43] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39285/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:23:21] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans)
[17:24:25] <wikibugs>	 (03PS1) 10Jbond: network: drop abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/884040
[17:24:40] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[17:24:41] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet
[17:26:43] <wikibugs>	 (03CR) 10Jbond: network: drop abuse_networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond)
[17:27:42] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[17:28:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43427 and previous config saved to /var/cache/conftool/dbconfig/20230126-172806-root.json
[17:28:51] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:30:37] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1015.eqiad.wmnet
[17:33:57] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:45] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:39:53] <wikibugs>	 (03PS25) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:41:37] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884045 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[17:44:51] <wikibugs>	 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) This is the return address  Seagrove C/O Celestica Killam Industrial Park 13701 N Lamar Dr. Laredo, TX 78045 USA Project: CLS HUB Laredo, TX Attn: Juniper Returns...
[17:46:36] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884045 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[17:47:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:49:28] <wikibugs>	 (03PS26) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:49:45] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS bullseye
[17:49:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6016.drmrs.wmnet with OS bullseye completed: - cp6016 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[17:50:37] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39288/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:52:52] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "Okay I finally think I got it!" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:54:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:55:00] <wikibugs>	 (03PS27) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:55:22] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet
[17:56:12] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39289/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:58:02] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall)
[17:58:57] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[17:59:12] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS bullseye
[17:59:18] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6008.drmrs.wmnet with OS bullseye
[18:00:04] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800).
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800)
[18:00:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884046 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[18:01:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:02:13] <bd808>	 I don't have anything for the Technical Engagement window this week.
[18:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884046 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[18:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:09:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:10:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:10:14] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:10:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:11:44] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:11:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[18:12:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[18:12:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[18:13:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[18:13:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[18:14:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[18:14:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[18:14:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[18:15:03] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[18:15:03] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[18:15:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[18:15:23] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[18:15:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[18:15:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[18:15:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[18:15:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[18:15:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:16:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:16:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:16:48] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron)
[18:17:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:17:25] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron)
[18:17:36] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage
[18:20:18] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage
[18:27:07] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:34:36] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall)
[18:36:10] <TheresNoTime>	 jouncebot: nowandnext
[18:36:11] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800)
[18:36:11] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800)
[18:36:11] <jouncebot>	 In 0 hour(s) and 23 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1900)
[18:40:59] <wikibugs>	 (03PS2) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970)
[18:46:40] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS bullseye
[18:46:47] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6008.drmrs.wmnet with OS bullseye completed: - cp6008 (**WARN**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[18:57:52] <wikibugs>	 (03PS3) 10Jdlrobson: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859)
[18:57:56] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet
[18:59:11] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[18:59:53] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[18:59:59] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye
[19:00:04] <jouncebot>	 brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1900).
[19:00:17] <brennen>	 o/
[19:00:45] <brennen>	 !log 1.40.0-wmf.20 train (T325583): no current blockers, rolling to all wikis.
[19:00:48] <wikibugs>	 (03PS1) 10BBlack: esitest: compat with haproxy >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/884052 (https://phabricator.wikimedia.org/T321775)
[19:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:49] <stashbot>	 T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583
[19:01:28] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583)
[19:01:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot)
[19:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot)
[19:04:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:05:40] <claime>	 brennen: Has the train already deployed to k8s or not?
[19:06:13] <brennen>	 claime: just started i believe
[19:06:20] <claime>	 ok I'll wait for it to be done then
[19:06:40] <claime>	 I have a config fix for mw-on-k8s but I don't want to step on scap's toes :p
[19:06:45] <wikibugs>	 (03PS1) 10Jdlrobson: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045)
[19:06:50] <brennen>	 claime: cool, thx
[19:06:53] <bd808>	 https://test2.wikipedia.org/wiki/Special:Version shows wmf.20 and https://versions.toolforge.org/ shows wmf.20 for all wikis
[19:07:22] <brennen>	 19:06:53 Finished sync-prod-k8s (duration: 00m 54s)
[19:07:30] <claime>	 Fantastic, thanks
[19:07:50] <brennen>	 note overall train deploy is still underway.
[19:08:07] <wikibugs>	 (03PS1) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309)
[19:09:14] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39290/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:09:41] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.20  refs T325583
[19:09:41] <claime>	 Noted, I'm not touching anything but the mw-on-k8s deployment. Once scap is done with it, what I'm doing shouldn't interfere with the train
[19:09:45] <stashbot>	 T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583
[19:09:54] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix php-slowlog rsyslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884051 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[19:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:10:13] <icinga-wm>	 PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: esitest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:23] <brennen>	 yeah, just always a good window of time to keep in mind i might be running another deploy if a rollback is needed for anything.
[19:10:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp2027.codfw.wmnet with reason: reimaging
[19:10:59] <claime>	 brennen: ack, I won't be long
[19:11:03] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2027.codfw.wmnet with reason: reimaging
[19:11:25] <claime>	 if jenkins would get in gear :P
[19:16:04] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix php-slowlog rsyslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884051 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert)
[19:20:24] <wikibugs>	 (03PS2) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309)
[19:21:05] <claime>	 brennen: I'm all done.
[19:21:27] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39291/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:29:38] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:35:56] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:39:50] <wikibugs>	 (03PS3) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309)
[19:40:52] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39292/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:44:37] <wikibugs>	 (03PS1) 10Dwisehaupt: Swap fundraising db origin to frdb1005 [dns] - 10https://gerrit.wikimedia.org/r/884066 (https://phabricator.wikimedia.org/T315601)
[19:46:18] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:49:15] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM. Jenkins failures seems unrelated and should not be fixed as part of this CR." [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[19:49:54] <wikibugs>	 (03PS4) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494)
[19:50:29] <wikibugs>	 (03CR) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[19:53:20] <wikibugs>	 (03PS4) 10Ssingh: esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309)
[19:54:23] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39293/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:56:31] <logmsgbot>	 !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4038.ulsfo.wmnet with OS bullseye
[19:56:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4038 (**FAIL**)   - Downtimed on Ic...
[19:56:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:56:54] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:57:18] <wikibugs>	 (03PS6) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[19:57:21] <wikibugs>	 (03PS1) 10Gehel: idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070
[19:58:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:58:26] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:59:23] <wikibugs>	 (03PS2) 10Gehel: idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070
[19:59:25] <wikibugs>	 (03PS7) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[19:59:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel)
[19:59:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:02:03] <wikibugs>	 (03CR) 10Bking: [C: 03+1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel)
[20:02:05] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel)
[20:02:11] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel)
[20:02:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:04:30] <wikibugs>	 (03PS8) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:05:43] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye
[20:06:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**)   - Removed from Pu...
[20:06:12] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on cp2027.codfw.wmnet with reason: reimaging
[20:06:16] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on cp2027.codfw.wmnet with reason: reimaging
[20:06:24] <wikibugs>	 (03PS1) 10Ryan Kemper: django_oidc: fix formatting [puppet] - 10https://gerrit.wikimedia.org/r/884077
[20:07:05] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884077 (owner: 10Ryan Kemper)
[20:07:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:08:48] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] django_oidc: fix formatting [puppet] - 10https://gerrit.wikimedia.org/r/884077 (owner: 10Ryan Kemper)
[20:09:40] <wikibugs>	 (03PS9) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064)
[20:12:11] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:13:02] <ryankemper>	 !log `ryankemper@thanos-fe1001:~$ sudo run-puppet-agent` following merge of wdqs recording rule patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883610
[20:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:14:04] <wikibugs>	 (03PS5) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494)
[20:15:47] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[20:15:49] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[20:18:22] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:18:35] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Swap fundraising db origin to frdb1005 [dns] - 10https://gerrit.wikimedia.org/r/884066 (https://phabricator.wikimedia.org/T315601) (owner: 10Dwisehaupt)
[20:25:31] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) Update: This happened again when imaging cp4038. I was unable to ping the interfaces but was able to connect to the mgmt interface/iDRAC....
[20:26:57] <wikibugs>	 (03PS1) 10Jgreen: Switch fundraising database queue icinga reporting from frdb1004 to frdb1005. [puppet] - 10https://gerrit.wikimedia.org/r/884081
[20:29:02] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Switch fundraising database queue icinga reporting from frdb1004 to frdb1005. [puppet] - 10https://gerrit.wikimedia.org/r/884081 (owner: 10Jgreen)
[20:36:22] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[20:36:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye
[20:40:33] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper)
[20:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:41:06] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) Re-running the cookbook and I watched it get past that screen with no delay  {F36521655}
[20:43:45] <wikibugs>	 (03PS1) 10Andrew Bogott: valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085
[20:47:23] <wikibugs>	 (03PS2) 10Andrew Bogott: valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085 (https://phabricator.wikimedia.org/T328079)
[20:47:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Move clouddb2001-dev to spare [puppet] - 10https://gerrit.wikimedia.org/r/884086 (https://phabricator.wikimedia.org/T328079)
[20:47:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079)
[20:49:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott)
[20:49:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move clouddb2001-dev to spare [puppet] - 10https://gerrit.wikimedia.org/r/884086 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott)
[20:56:16] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704)
[20:56:40] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[21:00:04] <jouncebot>	 brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T2100). Please do the needful.
[21:00:04] <jouncebot>	 Dreamy_Jazz, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <Jdlrobson>	 present o/
[21:01:01] <MatmaRex>	 hi
[21:01:22] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[21:05:31] <thcipriani>	 o/ I can deploy
[21:06:18] <thcipriani>	 Dreamy_Jazz: around for backports?
[21:06:52] <Dreamy_Jazz>	 Sorry didn't hear the ping
[21:06:54] <wikibugs>	 (03PS1) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612)
[21:06:55] <Dreamy_Jazz>	 I'm here
[21:07:04] <thcipriani>	 no problem
[21:07:29] <thcipriani>	 you're up first
[21:07:44] <Dreamy_Jazz>	 Nice. Okay. I can test this one as I have checkuser on enwiki.
[21:08:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[21:08:38] <wikibugs>	 (03CR) 10Sbailey: "Preparing for monday backport window for linter write code enable on group 0 only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[21:09:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable write new for CheckUserLog comment fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[21:09:25] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]]
[21:09:30] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:11:08] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and dreamyjazz: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[21:11:22] <thcipriani>	 ^ Dreamy_Jazz ok, should be on mwdebug, check please
[21:11:44] <Dreamy_Jazz>	 Sure. Testing now.
[21:12:23] <Dreamy_Jazz>	 Test complete - working as expected
[21:13:53] <wikibugs>	 (03Abandoned) 10BBlack: esitest: compat with haproxy >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/884052 (https://phabricator.wikimedia.org/T321775) (owner: 10BBlack)
[21:13:55] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704)
[21:14:48] <thcipriani>	 Dreamy_Jazz: great, thanks for checking, going live
[21:19:53] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) (owner: 10Bartosz Dziewoński)
[21:20:21] <thcipriani>	 (^ I'll get that one going while we're waiting)
[21:20:44] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]] (duration: 11m 18s)
[21:20:48] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:20:53] <thcipriani>	 ^ Dreamy_Jazz should be live now
[21:21:01] <Dreamy_Jazz>	 Thanks
[21:21:16] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:21:22] <Dreamy_Jazz>	 Yes. It looks live to me. Thanks for the backport.
[21:21:37] <thcipriani>	 nice, yw :)
[21:23:55] <thcipriani>	 alright Jdlrobson you're up
[21:24:06] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye
[21:24:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**PASS**)   - Removed from Puppet and Pu...
[21:24:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson)
[21:25:14] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet
[21:25:15] <wikibugs>	 (03PS4) 10Thcipriani: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson)
[21:25:20] <wikibugs>	 (03Merged) 10jenkins-bot: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) (owner: 10Bartosz Dziewoński)
[21:25:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:25:48] <thcipriani>	 Jdlrobson: bah, wait, did I just break the relation chain with that rebase?
[21:25:56] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye
[21:26:03] <thcipriani>	 is the "increase threshold" supposed to go first?
[21:26:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye
[21:26:32] <thcipriani>	 well. hold that thought. looks like MatmaRex 's just merged
[21:27:12] <Jdlrobson>	 thcipriani: looking
[21:27:18] <Jdlrobson>	 they can go in any order
[21:27:21] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]]
[21:27:25] <stashbot>	 T327704: DiscussionTools: unable to save comment on metawiki with comment-became-transcluded error - https://phabricator.wikimedia.org/T327704
[21:27:27] <Jdlrobson>	 the first one is a NOOP
[21:27:32] * MatmaRex waiting
[21:27:32] <Jdlrobson>	 just configuration leanup
[21:27:48] <thcipriani>	 ah, ok, thanks for checking. I never remember which way the ordering runs in the ancestor chain in gerrit :\
[21:28:19] <thcipriani>	 I'll push them both together once I'm the discussiontools change is out
[21:28:59] <logmsgbot>	 !log thcipriani@deploy1002 matmarex and thcipriani: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:29:18] <thcipriani>	 ^ MatmaRex live on mwdebug, check please :)
[21:29:48] <MatmaRex>	 thcipriani: yup, looks good!
[21:30:10] <thcipriani>	 cool, going live
[21:33:22] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS bullseye
[21:33:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**)   - Downtimed on Ic...
[21:33:34] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye
[21:33:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye
[21:33:43] <logmsgbot>	 !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS bullseye
[21:33:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**)   - Removed from Pu...
[21:34:54] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye
[21:35:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye
[21:35:03] <logmsgbot>	 !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS bullseye
[21:35:11] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**)   - Removed from Pu...
[21:36:04] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]] (duration: 08m 43s)
[21:36:08] <stashbot>	 T327704: DiscussionTools: unable to save comment on metawiki with comment-became-transcluded error - https://phabricator.wikimedia.org/T327704
[21:36:14] <thcipriani>	 ^ MatmaRex should be live now
[21:36:33] <MatmaRex>	 thanks thcipriani
[21:36:34] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson)
[21:37:05] <wikibugs>	 (03PS2) 10Thcipriani: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson)
[21:37:28] <thcipriani>	 sure thing :)
[21:37:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson)
[21:37:51] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson)
[21:38:43] <wikibugs>	 (03Merged) 10jenkins-bot: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson)
[21:39:03] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]]
[21:39:09] <stashbot>	 T328045: Increase threshold for table of contents collapsing - https://phabricator.wikimedia.org/T328045
[21:39:09] <stashbot>	 T324859: frwiktionary search config does not properly set showDescription to false - https://phabricator.wikimedia.org/T324859
[21:40:42] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:41:00] <thcipriani>	 ^ Jdlrobson okie doke, both your patches should be on mwdebug, check please
[21:41:03] <Jdlrobson>	 checking
[21:41:58] <Jdlrobson>	 LGTM please sync!
[21:42:06] * thcipriani does
[21:47:53] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]] (duration: 08m 49s)
[21:47:59] <stashbot>	 T328045: Increase threshold for table of contents collapsing - https://phabricator.wikimedia.org/T328045
[21:47:59] <stashbot>	 T324859: frwiktionary search config does not properly set showDescription to false - https://phabricator.wikimedia.org/T324859
[21:48:00] <thcipriani>	 ^ Jdlrobson all done
[21:58:06] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye
[21:58:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye
[22:02:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:09:26] <Jdlrobson>	 thanks thcipriani
[22:09:38] <Jdlrobson>	 (sorry for the delay got distracted and forgot to press enter:))
[22:09:55] <thcipriani>	 heh, no worries, yw ;)
[22:16:30] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:18:57] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage
[22:20:08] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:22:03] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage
[22:23:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:23:57] <zabe>	 !log running migrateRevisionCommentTemp.php in cebwiki in screen with --sleep 2 # T275246
[22:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:01] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[22:41:50] <wikibugs>	 (03PS4) 10Dreamy Jazz: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907)
[22:42:48] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[22:43:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[22:44:31] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS bullseye
[22:44:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye completed: - cp4046 (**PASS**)   - Removed from Puppet and Pu...
[22:44:44] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet
[22:45:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[22:45:36] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye
[22:45:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye
[22:46:39] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:53:37] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) (owner: 10Dreamy Jazz)
[22:54:20] <wikibugs>	 (03Merged) 10jenkins-bot: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) (owner: 10Dreamy Jazz)
[22:54:51] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]]
[22:54:56] <stashbot>	 T324907: Create seperate tables for log events in CheckUser - https://phabricator.wikimedia.org/T324907
[22:56:31] <logmsgbot>	 !log zabe@deploy1002 dreamyjazz and zabe: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[22:56:38] <sbassett>	 Hey all - going to scap out PS.php (removed emergency spam mitigations)
[22:58:05] <zabe>	 sbassett, could you wait a sec, currently deploying
[22:58:09] <zabe>	 I can ping you
[22:58:29] <sbassett>	 Yes, got the lock warn, thanks.
[23:03:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:03:28] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]] (duration: 08m 36s)
[23:03:33] <stashbot>	 T324907: Create seperate tables for log events in CheckUser - https://phabricator.wikimedia.org/T324907
[23:04:05] <zabe>	 sbassett, done
[23:04:40] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS bullseye
[23:04:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye executed with errors: - cp4039 (**FAIL**)   - Downtimed on Ic...
[23:04:59] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye
[23:05:08] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye
[23:05:25] <wikibugs>	 (03CR) 10Jdlrobson: "Jan: Should this be abandoned?" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[23:05:31] <wikibugs>	 (03CR) 10Jdlrobson: "Jan: Should this be abandoned?" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[23:06:38] <sbassett>	 Tx, Zabe
[23:07:04] <wikibugs>	 (03PS3) 10Superpes15: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987)
[23:07:48] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[23:10:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:10:48] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) This is happening the first time I run the cookbooks on any of the newer servers. I've now adapted to the workflow of running the cookbook...
[23:13:24] <logmsgbot>	 !log sbassett@deploy1002 Synchronized private/PrivateSettings.php: T326691 - remove mitigation and monitor (duration: 06m 52s)
[23:19:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:22:52] <wikibugs>	 (03PS4) 10Zabe: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[23:22:56] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[23:23:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15)
[23:24:27] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]]
[23:24:30] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[23:24:32] <stashbot>	 T327987: Change project logo in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327987
[23:25:41] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage
[23:25:54] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:26:06] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:26:06] <logmsgbot>	 !log zabe@deploy1002 zabe and superpes: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[23:28:49] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage
[23:46:25] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:51:50] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4039.ulsfo.wmnet with OS bullseye
[23:51:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye completed: - cp4039 (**PASS**)   - Removed from Puppet and Pu...
[23:52:20] <logmsgbot>	 !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4039.ulsfo.wmnet
[23:53:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[23:54:30] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye
[23:54:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye
[23:59:10] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]] (duration: 34m 42s)
[23:59:14] <stashbot>	 T327987: Change project logo in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327987