[00:00:12] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:48] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:00] (03PS1) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 [00:41:19] (03PS2) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 [00:41:59] (03CR) 10Jdlrobson: "Hey Zabe thanks for catching that. (FWIW luckily this is a NOOP in production servers right now, but would have led to this rolling out wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson) [00:45:33] (03PS3) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 [00:47:29] (03PS1) 10Eevans: cassandra-dev: enable internode encryption [puppet] - 10https://gerrit.wikimedia.org/r/883682 [00:48:25] (03CR) 10Eevans: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans) [00:48:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2028.codfw.wmnet,service=cdn [00:48:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2028.codfw.wmnet,service=ats-be [00:49:11] !log depool cp2028 for testing firmware update cookbook: T321309 [00:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:15] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [00:50:02] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2028.codfw.wmnet [00:50:56] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans) [00:51:26] (03PS4) 10Jdlrobson: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 [00:51:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2028.codfw.wmnet [00:53:24] (03CR) 10Eevans: [C: 03+2] cassandra-dev: enable internode encryption [puppet] - 10https://gerrit.wikimedia.org/r/883682 (owner: 10Eevans) [00:53:49] (03CR) 10Jdrewniak: [C: 03+2] Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson) [00:54:32] (03Merged) 10jenkins-bot: Enable ResourceLoader client preferences on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson) [01:00:12] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2*: Enable internode encryption - eevans@cumin1001 [01:02:58] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:03:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2028.codfw.wmnet [01:03:10] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp2028.codfw.wmnet [01:05:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2028.codfw.wmnet with OS bullseye [01:05:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2028.codfw.wmnet with OS bullseye [01:19:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2*: Enable internode encryption - eevans@cumin1001 [01:20:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage [01:23:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage [01:28:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [01:46:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2028.codfw.wmnet with OS bullseye [01:46:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2028.codfw.wmnet with OS bullseye completed: - cp2028 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [01:46:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2027.codfw.wmnet,service=cdn [01:46:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2027.codfw.wmnet,service=ats-be [01:47:08] PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:00] ^ fixing [01:48:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp2027.codfw.wmnet with reason: firmware test [01:48:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2027.codfw.wmnet with reason: firmware test [01:49:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2028.codfw.wmnet,service=cdn [01:49:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2028.codfw.wmnet,service=ats-be [01:51:04] 10SRE, 10Wikimedia-Mailing-lists: Archive metavid-l - https://phabricator.wikimedia.org/T327971 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} [01:53:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [01:53:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [01:53:58] RECOVERY - Host cp2027 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [01:55:59] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) Since we started reimaging the cp hosts to bullseye, this has come up again and I was loo... [01:59:22] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:01:40] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:05:48] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:09:04] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 28 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:10:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:10] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:15:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:35] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye [02:17:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [02:17:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [02:18:00] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [02:20:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:05] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) `cp2027`, for later debugging: ` Jan 26 02:23:56 partman-auto-raid: Selected spare count: 0 Jan 26 02:23:56 partman-auto-raid: Spare devices count: 0 Jan 26 02:23:56 partman-auto-raid: mdadm: cannot open... [02:30:30] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye [02:30:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [02:41:44] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS bullseye [02:41:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6013.drmrs.wmnet with OS bullseye [02:46:44] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:01:33] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [03:04:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:33] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6013.drmrs.wmnet with reason: host reimage [03:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:26:35] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS bullseye [03:26:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6013.drmrs.wmnet with OS bullseye completed: - cp6013 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [03:27:57] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6013.drmrs.wmnet [03:28:07] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:29:00] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [03:29:15] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS bullseye [03:29:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6005.drmrs.wmnet with OS bullseye [03:49:12] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [03:52:04] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6005.drmrs.wmnet with reason: host reimage [03:59:47] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:10:19] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:17:55] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS bullseye [04:18:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6005.drmrs.wmnet with OS bullseye completed: - cp6005 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [04:22:28] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6005.drmrs.wmnet [04:23:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [04:24:01] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS bullseye [04:24:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6014.drmrs.wmnet with OS bullseye [04:42:19] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [04:45:31] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6014.drmrs.wmnet with reason: host reimage [04:47:48] (03PS1) 10Ladsgroup: Revert "Disable PHP L10n in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707 [04:48:01] (03PS2) 10Ladsgroup: Revert "Disable PHP L10n in beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707 (https://phabricator.wikimedia.org/T99740) [05:04:17] 10SRE, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10JJMC89) [05:07:00] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS bullseye [05:07:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6014.drmrs.wmnet with OS bullseye completed: - cp6014 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [05:09:15] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6014.drmrs.wmnet [05:10:00] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [05:10:15] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS bullseye [05:10:21] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6006.drmrs.wmnet with OS bullseye [05:11:34] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:28:40] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [05:32:30] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6006.drmrs.wmnet with reason: host reimage [05:42:21] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:53:21] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS bullseye [05:53:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6006.drmrs.wmnet with OS bullseye completed: - cp6006 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [05:53:57] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6006.drmrs.wmnet [05:54:25] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [05:57:16] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS bullseye [05:57:24] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6015.drmrs.wmnet with OS bullseye [06:10:13] jouncebot: nowandnext [06:10:14] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [06:10:14] In 0 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700) [06:10:14] In 0 hour(s) and 49 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700) [06:13:51] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:16:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327861 [06:16:48] T327861: Switchover x1 master (db1120 -> db1103) - https://phabricator.wikimedia.org/T327861 [06:17:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T327861 [06:17:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1103 with weight 0 T327861', diff saved to https://phabricator.wikimedia.org/P43350 and previous config saved to /var/cache/conftool/dbconfig/20230126-061751-root.json [06:18:08] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [06:18:32] (03CR) 10Marostegui: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [06:18:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/882785 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [06:19:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:20:41] hmm [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:52] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [06:22:17] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 42 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:22:24] false alert [06:24:25] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:24:45] so wikiuser2023 works just fine in mwdebug, syncing [06:26:35] 37 ______▇ 0622 ○ 0626 ● DBReadOnlyError..... .19 i/l/r/d/Database:675 Database is read-only: The database is read-only until replication lag decreases. [06:26:43] only 37 though [06:26:56] It's Manuel's x1 switchover [06:28:16] yep [06:30:09] (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883710 [06:30:14] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:30:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883710 (owner: 10Marostegui) [06:32:49] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Rotating wikiuser password (T326802) (duration: 07m 23s) [06:32:53] T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802 [06:37:52] (03PS1) 10Ladsgroup: mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) [06:38:47] (03CR) 10Marostegui: [C: 03+1] mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [06:42:12] (03PS1) 10Marostegui: db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883694 (https://phabricator.wikimedia.org/T327861) [06:42:35] (03CR) 10Marostegui: [C: 03+2] db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883694 (https://phabricator.wikimedia.org/T327861) (owner: 10Marostegui) [06:42:40] (03PS1) 10Ladsgroup: dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) [06:42:46] (03CR) 10CI reject: [V: 04-1] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [06:43:08] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:43:28] Amir1: ^ especially with the triggers, let's wait for the dc switch to avoid running into race conditions? [06:43:40] Amir1: We'd need to change the triggers across all the hosts in production [06:43:59] marostegui: I can automate that [06:44:11] the whole thing is mostly automated [06:44:17] Amir1: Sure, what I mean is, the switchover is in 15 minutes, let's wait until it is done [06:44:26] oh that one [06:44:32] sure, you said dc switchover [06:44:40] (03PS1) 10Majavah: fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) [06:44:49] I thought I need to wait months :D [06:44:49] oh sorry [06:44:51] I meant x1 [06:44:58] so many switchovers [06:45:34] (03CR) 10Marostegui: [C: 03+1] mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [06:47:00] (03PS2) 10Majavah: fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) [06:48:11] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS bullseye [06:48:17] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6015.drmrs.wmnet with OS bullseye completed: - cp6015 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [06:48:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39260/console" [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) (owner: 10Majavah) [06:48:34] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6015.drmrs.wmnet [06:49:08] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) Script used to generate the servers lists: {P43345} [06:49:21] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:49:33] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [06:50:16] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [06:51:42] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:52:45] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [06:52:57] (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [06:53:05] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Adding Jaime for the backup related hosts [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700) [07:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0700). [07:00:08] !log Starting x1 eqiad failover from db1120 to db1103 - T327861 [07:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:12] T327861: Switchover x1 master (db1120 -> db1103) - https://phabricator.wikimedia.org/T327861 [07:00:19] o/ [07:00:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1103 to x1 primary and set section read-write T327861', diff saved to https://phabricator.wikimedia.org/P43351 and previous config saved to /var/cache/conftool/dbconfig/20230126-070035-marostegui.json [07:01:12] (03CR) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [07:01:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [07:01:17] (03PS2) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/883506 (https://phabricator.wikimedia.org/T327861) (owner: 10Gerrit maintenance bot) [07:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T327861', diff saved to https://phabricator.wikimedia.org/P43352 and previous config saved to /var/cache/conftool/dbconfig/20230126-070158-root.json [07:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add some weight to db1103', diff saved to https://phabricator.wikimedia.org/P43353 and previous config saved to /var/cache/conftool/dbconfig/20230126-070220-marostegui.json [07:04:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43354 and previous config saved to /var/cache/conftool/dbconfig/20230126-070512-root.json [07:06:54] (03PS1) 10Marostegui: ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) [07:07:09] Amir1: can you review ^? [07:07:17] on it [07:07:52] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui) [07:08:06] (03PS1) 10Ayounsi: Remove single contact feature [puppet] - 10https://gerrit.wikimedia.org/r/883700 [07:09:24] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui) [07:10:05] (03Merged) 10jenkins-bot: ProductionServices.php: Depool pc2011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883699 (https://phabricator.wikimedia.org/T327925) (owner: 10Marostegui) [07:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:10:14] Amir1: what was that URL for the deployment commands? [07:10:36] scap backport 883699 [07:10:38] ? [07:10:44] ah that :) [07:10:45] thanks [07:11:02] https://deploy-commands.toolforge.org/bacc/883699 This is also useful, e.g. how to revert [07:11:13] yeah that is what I was looking for :) [07:11:53] mmm there seem to be something pending to be deployed? [07:12:14] 07:11:23 The following are unexpected commits pulled from origin for /srv/mediawiki-staging: [07:12:14] commit 4d798447521b90a0bf8af199981789c9e53fc41c [07:12:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800 [07:12:53] T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800 [07:12:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800 [07:14:35] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]] [07:14:39] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [07:15:04] (03CR) 10Ladsgroup: "Hi, please rebase this in deploy1002 after merge, it doesn't need to follow backport window but if it's not rebased, it'll confuse future " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883621 (owner: 10Jdlrobson) [07:16:25] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:16:34] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:17:04] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backupmon1001.eqiad.wmnet with reason: m1 switchover [07:17:17] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backupmon1001.eqiad.wmnet with reason: m1 switchover [07:17:39] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1001.eqiad.wmnet with reason: m1 switchover [07:18:03] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1001.eqiad.wmnet with reason: m1 switchover [07:18:10] (03PS1) 10Marostegui: mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) [07:18:57] (03PS2) 10Marostegui: mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) [07:20:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43356 and previous config saved to /var/cache/conftool/dbconfig/20230126-072017-root.json [07:21:42] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui) [07:21:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1176 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/883703 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui) [07:23:04] !log Failover m1 from db1195 to db1176 - T327800 [07:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:08] T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800 [07:23:10] jynus: I am starting ok? [07:23:21] green light for me [07:23:49] done [07:23:58] (03CR) 10Ayounsi: "Follow up from Ia0a4b2b9605a1c795fb0345e52234c5a32187887" [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi) [07:24:20] etherpad is working for me [07:24:22] yeah [07:24:23] same [07:25:05] let me update racktables and move it to archived [07:25:12] cool [07:25:17] !log T322869: depooling wdqs2009 wdqs2010 wdqs2011 wdqs2012 these hosts should not serve user traffic yet they don't have the database loaded [07:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:21] T322869: Fewer results from wdqs nodes running in codfw than eqiad - https://phabricator.wikimedia.org/T322869 [07:25:37] do you see any process on the the old host? [07:25:45] (JobUnavailable) firing: (2) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:55] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:883699|ProductionServices.php: Depool pc2011 (T327925)]] (duration: 11m 19s) [07:25:59] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [07:26:07] ^bacula is me, will resolve when I start it up [07:26:11] jynus: nope [07:27:05] let me start up bacula to 100% finalize the process [07:27:06] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:27:06] (03PS1) 10Marostegui: monitoring.yaml: Change master for m1 [puppet] - 10https://gerrit.wikimedia.org/r/883705 (https://phabricator.wikimedia.org/T327800) [07:27:08] jynus: you can merge ^ as you wish [07:27:25] oh, true, I forgot [07:27:50] (03CR) 10Jcrespo: [C: 03+2] monitoring.yaml: Change master for m1 [puppet] - 10https://gerrit.wikimedia.org/r/883705 (https://phabricator.wikimedia.org/T327800) (owner: 10Marostegui) [07:28:33] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:30:02] I think there is one more patch I have to do before starting up stuff [07:30:10] which one? [07:31:26] (03PS1) 10Marostegui: db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883726 (https://phabricator.wikimedia.org/T327995) [07:32:42] (03PS1) 10Jcrespo: dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800) [07:33:04] (03CR) 10Marostegui: [C: 03+1] dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800) (owner: 10Jcrespo) [07:33:06] T327800 [07:33:06] T327800: Switchover m1 master (db1195 -> db1176) - https://phabricator.wikimedia.org/T327800 [07:33:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/883727 [07:33:22] (03CR) 10Marostegui: [C: 03+2] db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883726 (https://phabricator.wikimedia.org/T327995) (owner: 10Marostegui) [07:33:29] this is all because proxy & tls only [07:34:49] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/883513 (https://phabricator.wikimedia.org/T327997) [07:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43357 and previous config saved to /var/cache/conftool/dbconfig/20230126-073523-root.json [07:35:26] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:35:42] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover m1 primary at which stats are pointing [puppet] - 10https://gerrit.wikimedia.org/r/883727 (https://phabricator.wikimedia.org/T327800) (owner: 10Jcrespo) [07:35:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s1 T327997 [07:35:54] T327997: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T327997 [07:36:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s1 T327997 [07:36:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2112 with weight 0 T327997', diff saved to https://phabricator.wikimedia.org/P43358 and previous config saved to /var/cache/conftool/dbconfig/20230126-073616-root.json [07:36:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/883513 (https://phabricator.wikimedia.org/T327997) (owner: 10Gerrit maintenance bot) [07:45:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/883514 (https://phabricator.wikimedia.org/T327998) [07:45:28] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:48:39] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2009.* [07:49:21] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2010.* [07:49:28] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2011.* [07:49:42] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2012.* [07:50:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43359 and previous config saved to /var/cache/conftool/dbconfig/20230126-075028-root.json [07:56:00] (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883820 [07:57:52] (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/883820 (owner: 10Marostegui) [08:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0800). [08:00:07] !log Starting s1 codfw failover from db2103 to db2112 - T327997 [08:00:08] as often happens, there are no trainees signed up to learn the ropes today, and there are no patches scheduled for deployment, so enjoy a quiet morning! [08:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:11] T327997: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T327997 [08:00:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2112 to s1 primary T327997', diff saved to https://phabricator.wikimedia.org/P43360 and previous config saved to /var/cache/conftool/dbconfig/20230126-080033-root.json [08:02:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103 T327997', diff saved to https://phabricator.wikimedia.org/P43361 and previous config saved to /var/cache/conftool/dbconfig/20230126-080159-root.json [08:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P43362 and previous config saved to /var/cache/conftool/dbconfig/20230126-080233-root.json [08:04:08] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:04:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2107 with weight 0 T327998', diff saved to https://phabricator.wikimedia.org/P43363 and previous config saved to /var/cache/conftool/dbconfig/20230126-080427-root.json [08:04:32] T327998: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T327998 [08:04:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T327998 [08:05:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T327998 [08:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43364 and previous config saved to /var/cache/conftool/dbconfig/20230126-080533-root.json [08:05:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883700 (owner: 10Ayounsi) [08:06:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2107 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/883514 (https://phabricator.wikimedia.org/T327998) (owner: 10Gerrit maintenance bot) [08:07:16] (03CR) 10Muehlenhoff: [C: 03+2] Add new install servers [puppet] - 10https://gerrit.wikimedia.org/r/883581 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:08:02] (03PS4) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [08:09:25] (03CR) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [08:14:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) @Papaul could you rename (Netbox, label, console, etc) the switch cloudsw**1**-b1-codfw? For co... [08:16:42] (03CR) 10Jcrespo: mariadb: remove grants and settings for racktables db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43365 and previous config saved to /var/cache/conftool/dbconfig/20230126-081738-root.json [08:17:43] !log Starting s2 codfw failover from db2104 to db2107 - T327998 [08:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:47] T327998: Switchover s2 master (db2104 -> db2107) - https://phabricator.wikimedia.org/T327998 [08:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2107 to s2 primary T327998', diff saved to https://phabricator.wikimedia.org/P43366 and previous config saved to /var/cache/conftool/dbconfig/20230126-081818-root.json [08:18:31] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39262/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [08:19:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T327998', diff saved to https://phabricator.wikimedia.org/P43367 and previous config saved to /var/cache/conftool/dbconfig/20230126-081916-root.json [08:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: After DIMM replacement', diff saved to https://phabricator.wikimedia.org/P43368 and previous config saved to /var/cache/conftool/dbconfig/20230126-082038-root.json [08:20:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43369 and previous config saved to /var/cache/conftool/dbconfig/20230126-082055-root.json [08:21:53] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:22:05] (03PS1) 10Muehlenhoff: Adapt cookbooks to installserver role rename [cookbooks] - 10https://gerrit.wikimedia.org/r/883833 [08:22:53] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:23:06] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/883515 (https://phabricator.wikimedia.org/T327999) [08:23:43] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:24:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999 [08:24:29] T327999: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T327999 [08:24:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2127 with weight 0 T327999', diff saved to https://phabricator.wikimedia.org/P43370 and previous config saved to /var/cache/conftool/dbconfig/20230126-082432-root.json [08:24:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999 [08:25:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/883515 (https://phabricator.wikimedia.org/T327999) (owner: 10Gerrit maintenance bot) [08:26:46] (03CR) 10Muehlenhoff: sre.ganeti.reimage: add new cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [08:27:56] (03CR) 10Muehlenhoff: Rename installserver role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [08:32:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. No point doing anything more complex if we're not gonna have it elsewhere I think." [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [08:32:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43371 and previous config saved to /var/cache/conftool/dbconfig/20230126-083243-root.json [08:34:36] !log Starting s3 codfw failover from db2105 to db2127 - T327999 [08:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:40] T327999: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T327999 [08:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2127 to s3 primary T327999', diff saved to https://phabricator.wikimedia.org/P43372 and previous config saved to /var/cache/conftool/dbconfig/20230126-083459-root.json [08:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2105 T327999', diff saved to https://phabricator.wikimedia.org/P43373 and previous config saved to /var/cache/conftool/dbconfig/20230126-083543-root.json [08:36:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43374 and previous config saved to /var/cache/conftool/dbconfig/20230126-083600-root.json [08:36:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43375 and previous config saved to /var/cache/conftool/dbconfig/20230126-083640-root.json [08:37:25] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:38:27] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:39:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2118 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/883516 (https://phabricator.wikimedia.org/T328000) [08:40:16] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:40:45] (JobUnavailable) firing: (2) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:41:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000 [08:41:05] T328000: Switchover s7 master (db2121 -> db2118) - https://phabricator.wikimedia.org/T328000 [08:41:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2118 with weight 0 T328000', diff saved to https://phabricator.wikimedia.org/P43376 and previous config saved to /var/cache/conftool/dbconfig/20230126-084112-root.json [08:41:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000 [08:41:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2118 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/883516 (https://phabricator.wikimedia.org/T328000) (owner: 10Gerrit maintenance bot) [08:44:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) @cmooney Thinking more about it... Your approach is great and careful and would suit well live... [08:44:37] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:44:37] !log added Eoghan to pwstore [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:14] (03CR) 10Ayounsi: [C: 03+2] Disable Telemetry on eqsin switches [homer/public] - 10https://gerrit.wikimedia.org/r/883497 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [08:46:42] (03PS5) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [08:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43377 and previous config saved to /var/cache/conftool/dbconfig/20230126-084748-root.json [08:48:50] gerrit seems unavailable again [08:49:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) [08:49:16] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:51:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43378 and previous config saved to /var/cache/conftool/dbconfig/20230126-085105-root.json [08:51:34] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43379 and previous config saved to /var/cache/conftool/dbconfig/20230126-085145-root.json [08:53:02] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65172 bytes in 8.989 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:54:33] (03PS1) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155) [08:55:07] (03CR) 10Volans: [C: 03+1] "LGTM, depends on I813c36b4deb4992e44a848ddc3c3a5c738914661" [cookbooks] - 10https://gerrit.wikimedia.org/r/883833 (owner: 10Muehlenhoff) [08:55:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8560178, @ayounsi wrote: >> B connection is probably sufficient, this does mean... [08:56:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [08:56:36] (03PS3) 10Muehlenhoff: Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587 [08:57:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [09:00:05] brennen and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0900). [09:00:28] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:00:28] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:00:48] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:01:37] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 4.196 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:01:37] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:02:00] !log Starting s7 codfw failover from db2121 to db2118 - T328000 [09:02:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] fix nova-metadata firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/883696 (https://phabricator.wikimedia.org/T327980) (owner: 10Majavah) [09:02:03] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66291 bytes in 7.481 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] T328000: Switchover s7 master (db2121 -> db2118) - https://phabricator.wikimedia.org/T328000 [09:02:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2118 to s7 primary T328000', diff saved to https://phabricator.wikimedia.org/P43380 and previous config saved to /var/cache/conftool/dbconfig/20230126-090212-root.json [09:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43381 and previous config saved to /var/cache/conftool/dbconfig/20230126-090253-root.json [09:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2121 T328000', diff saved to https://phabricator.wikimedia.org/P43382 and previous config saved to /var/cache/conftool/dbconfig/20230126-090302-root.json [09:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P43383 and previous config saved to /var/cache/conftool/dbconfig/20230126-090418-root.json [09:05:19] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:44] !log phedenskog@deploy1002 Started deploy [performance/navtiming@e5fdd6e]: (no justification provided) [09:05:50] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@e5fdd6e]: (no justification provided) (duration: 00m 06s) [09:05:57] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43384 and previous config saved to /var/cache/conftool/dbconfig/20230126-090610-root.json [09:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43385 and previous config saved to /var/cache/conftool/dbconfig/20230126-090650-root.json [09:08:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) LGTM! [09:11:57] (03CR) 10Muehlenhoff: [C: 03+2] Rename installserver role [puppet] - 10https://gerrit.wikimedia.org/r/883587 (owner: 10Muehlenhoff) [09:12:11] (03PS2) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155) [09:12:23] (03PS3) 10Jcrespo: dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155) [09:14:27] (03CR) 10Muehlenhoff: [C: 03+2] Adapt cookbooks to installserver role rename [cookbooks] - 10https://gerrit.wikimedia.org/r/883833 (owner: 10Muehlenhoff) [09:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43386 and previous config saved to /var/cache/conftool/dbconfig/20230126-091758-root.json [09:19:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001 [09:19:07] T328001: Switchover x2 master (db2142 -> db2144) - https://phabricator.wikimedia.org/T328001 [09:19:09] (03PS1) 10Marostegui: mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/883836 (https://phabricator.wikimedia.org/T328001) [09:19:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001 [09:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43387 and previous config saved to /var/cache/conftool/dbconfig/20230126-091923-root.json [09:19:49] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Optimize execution time and delay backups [puppet] - 10https://gerrit.wikimedia.org/r/883834 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [09:20:27] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43388 and previous config saved to /var/cache/conftool/dbconfig/20230126-092115-root.json [09:21:40] (03CR) 10DCausse: "this distribution does not seem to have the required deps in the opt folder:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [09:21:41] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69123 bytes in 0.055 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:21:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43389 and previous config saved to /var/cache/conftool/dbconfig/20230126-092155-root.json [09:22:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001 [09:22:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T328001 [09:24:01] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) Thanks for the summary! Some additional notes/thoughts: * public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (... [09:24:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2144 to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/883836 (https://phabricator.wikimedia.org/T328001) (owner: 10Marostegui) [09:24:45] !log Starting x2 codfw failover from db2142 to db2144 - T328001 [09:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:55] T328001: Switchover x2 master (db2142 -> db2144) - https://phabricator.wikimedia.org/T328001 [09:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P43390 and previous config saved to /var/cache/conftool/dbconfig/20230126-092512-root.json [09:25:17] T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811 [09:30:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39263/console" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [09:30:27] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [09:30:32] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:33:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43391 and previous config saved to /var/cache/conftool/dbconfig/20230126-093303-root.json [09:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43392 and previous config saved to /var/cache/conftool/dbconfig/20230126-093428-root.json [09:35:05] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43393 and previous config saved to /var/cache/conftool/dbconfig/20230126-093620-root.json [09:36:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add new fake pems for the mlserve's pki intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/883632 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:37:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43394 and previous config saved to /var/cache/conftool/dbconfig/20230126-093700-root.json [09:37:05] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:37:14] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:37:27] ^ checking [09:37:50] jynus: that is a backup source [09:38:08] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:38:15] looks overloaded [09:39:08] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:39:20] (03CR) 10Jbond: [C: 03+1] "lgtm: some minor nits which could also be addressed in a future change" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [09:39:23] overloaded? there are no running backups [09:40:12] there is a backup running now? why? [09:40:26] I don't know but the host is very very very slow [09:40:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:40:33] So it is either that or HW [09:40:45] no, there is something going on, but not sure why [09:41:04] HW logs are clean [09:41:06] (03PS3) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) [09:41:43] jynus: there are actually two backups running, right? for for s1 and one for s6 [09:41:46] backups just started now [09:41:54] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:42:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39265/console" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:42:11] maybe the scheduler got weird because of the time change [09:42:34] yeah could be [09:42:37] (03CR) 10Elukey: "John: fixed the name of one of the pem files, missed a _, pcc complained but now it seems ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:42:42] I will kill all of those process, but don't like that sytemd timer retroactively runs stuff [09:42:52] jynus: might happen on the other sources too? [09:42:57] yeah [09:43:07] I mean, it shouldn't overload anyway [09:43:16] but may happen as it is not the night [09:43:21] (03CR) 10Jbond: [C: 03+1] "thx, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:44:56] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:46:27] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [09:47:37] even if backups started wrongly, db2141 shouldn't have overloaded [09:47:42] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [09:47:51] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [09:47:52] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [09:47:56] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [09:48:00] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [09:48:11] (03PS6) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 [09:48:16] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis About to be decommed https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:48:19] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [09:49:07] !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435]: Regular analytics weekly train [analytics/refinery@8ed8435] [09:49:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43395 and previous config saved to /var/cache/conftool/dbconfig/20230126-094933-root.json [09:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43396 and previous config saved to /var/cache/conftool/dbconfig/20230126-095205-root.json [09:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43397 and previous config saved to /var/cache/conftool/dbconfig/20230126-095257-root.json [09:53:30] (03PS1) 10Marostegui: Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883711 [09:54:12] (03CR) 10Marostegui: [C: 03+2] Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/883711 (owner: 10Marostegui) [09:56:07] !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435]: Regular analytics weekly train [analytics/refinery@8ed8435] (duration: 07m 00s) [09:57:09] !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (thin): Regular analytics weekly train THIN [analytics/refinery@8ed8435] [09:57:15] !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (thin): Regular analytics weekly train THIN [analytics/refinery@8ed8435] (duration: 00m 05s) [09:57:24] !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8ed8435] [09:58:32] !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8ed8435] (duration: 01m 08s) [09:58:58] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582 [09:59:25] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:59:31] (03PS3) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582 [09:59:54] (03CR) 10Marostegui: "This requires applying all the events live" [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [09:59:59] (03CR) 10Marostegui: [C: 03+1] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [10:00:41] (03CR) 10Ladsgroup: dbtools: Rotate wikiuser (031 comment) [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [10:03:16] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 [10:04:15] (03CR) 10Slyngshede: [C: 03+1] "Test on db1206." [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [10:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43398 and previous config saved to /var/cache/conftool/dbconfig/20230126-100438-root.json [10:05:05] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond) [10:07:31] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) We can't migrate the puppetdb2002 VM (it's being moved to baremetal, but that is unlikely completed by then), so we'll need to disable Puppet f... [10:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43399 and previous config saved to /var/cache/conftool/dbconfig/20230126-100802-root.json [10:08:21] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/883582 (owner: 10Muehlenhoff) [10:08:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:08:31] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts sretest1002.eqiad.wmnet [10:08:38] (03CR) 10Slyngshede: [C: 03+1] "Only "concern" is someone where to use this with a system that parses the "nagios" output, it might get confused about the topology inform" [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [10:10:23] (03CR) 10Filippo Giunchedi: "patch LGTM, not +1'ing yet though because centrallog1002 is failing its rsyslog probes: https://logstash.wikimedia.org/goto/2155b6c052cd06" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [10:10:53] (03PS1) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712 [10:11:13] (03CR) 10Jelto: [C: 04-1] "Thanks for finding this fix for the start issues of phd." [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:11:37] (03CR) 10Filippo Giunchedi: "LGTM, modulo CI failure that doesn't look related?" [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [10:11:58] (03PS2) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712 [10:12:01] (03CR) 10Filippo Giunchedi: [C: 03+1] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [10:12:10] (03PS3) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712 [10:13:33] (03Abandoned) 10Jcrespo: Revert "dbbackups: Optimize execution time and delay backups" [puppet] - 10https://gerrit.wikimedia.org/r/883712 (owner: 10Jcrespo) [10:14:40] (03PS4) 10Clément Goubert: wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) [10:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43400 and previous config saved to /var/cache/conftool/dbconfig/20230126-101943-root.json [10:21:40] !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - Second after failure [analytics/refinery@8ed8435] [10:21:45] !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - Second after failure [analytics/refinery@8ed8435] (duration: 00m 04s) [10:22:33] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 [10:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43401 and previous config saved to /var/cache/conftool/dbconfig/20230126-102307-root.json [10:24:11] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 (owner: 10Jbond) [10:24:19] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: Add additional logging [cookbooks] - 10https://gerrit.wikimedia.org/r/883847 [10:31:24] (03CR) 10JMeybohm: [C: 04-1] pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:31:27] !log joal@deploy1002 Started deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - third after failure [analytics/refinery@8ed8435] [10:31:56] (03CR) 10Muehlenhoff: [C: 03+2] perccli: Print human-readable topology information on disk failure [puppet] - 10https://gerrit.wikimedia.org/r/883600 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [10:32:08] (03PS2) 10Ladsgroup: mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) [10:32:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Rotate wikiuser to wikiuser2023 [puppet] - 10https://gerrit.wikimedia.org/r/883693 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [10:32:43] !log joal@deploy1002 Finished deploy [analytics/refinery@8ed8435] (hadoop-test): Regular analytics weekly train TEST - third after failure [analytics/refinery@8ed8435] (duration: 01m 16s) [10:32:52] (03PS5) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 [10:33:01] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [10:34:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [10:34:31] (03CR) 10Ladsgroup: [C: 03+2] dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [10:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43402 and previous config saved to /var/cache/conftool/dbconfig/20230126-103448-root.json [10:35:32] (03Merged) 10jenkins-bot: dbtools: Rotate wikiuser [software] - 10https://gerrit.wikimedia.org/r/883695 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [10:35:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo the fact that I don't know what (if any) things will need to be removed (e.g. left behind/unmanaged by puppet)" [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [10:36:11] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:38:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43403 and previous config saved to /var/cache/conftool/dbconfig/20230126-103812-root.json [10:40:13] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15) [10:41:28] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Rename aux-k8s-ingress service to k8s-ingress-aux [dns] - 10https://gerrit.wikimedia.org/r/883551 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [10:41:50] !log cgoubert@authdns1001:~$ sudo -i authdns-update [10:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:33] !log joal@deploy1002 Started deploy [airflow-dags/analytics@e52205b]: (no justification provided) [10:42:44] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@e52205b]: (no justification provided) (duration: 00m 11s) [10:43:43] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [10:45:17] !log installing postgresql-13 security updates [10:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename aux-k8s-ingress service to k8s-ingress-aux - cgoubert@cumin1001" [10:49:45] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:49:55] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:50:14] ^ I am on those gerrit alarms [10:50:17] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:50:43] gerrit is back hashar [10:51:03] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:51:07] (03PS1) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) [10:51:11] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 972 bytes in 0.027 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:51:27] (03CR) 10CI reject: [V: 04-1] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [10:51:37] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61922 bytes in 0.042 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:52:21] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:53:02] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [10:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43404 and previous config saved to /var/cache/conftool/dbconfig/20230126-105317-root.json [10:54:29] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [10:54:39] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [10:55:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename aux-k8s-ingress service to k8s-ingress-aux - cgoubert@cumin1001" [10:55:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:16] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [10:55:26] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] service: Rename aux-k8s-ingress service to k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883552 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [10:56:20] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [10:57:01] (03CR) 10Ladsgroup: [C: 03+1] "https://www.php.net/manual/en/timezones.asia.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15) [10:57:12] jouncebot: nowandnext [10:57:12] For the next 0 hour(s) and 2 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T0900) [10:57:12] In 0 hour(s) and 2 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100) [10:57:12] In 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100) [10:57:17] sad [11:00:05] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100). Please do the needful. [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1100) [11:01:59] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:02:07] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:02:33] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:03:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Rename ceph profiles to cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [11:03:27] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 01 Mar 2023 09:47:05 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:03:37] (03CR) 10Btullis: Rename ceph profiles to cloudceph (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [11:03:37] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:03:50] !log Restarted Apache 2 on gerrit.wikimedia.org [11:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:01] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66644 bytes in 0.046 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:04:05] (03CR) 10Btullis: [C: 03+2] Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [11:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43405 and previous config saved to /var/cache/conftool/dbconfig/20230126-110822-root.json [11:10:02] (03PS1) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [11:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:10:46] (03PS1) 10Slyngshede: PERC RAID: Fix formatting for Nagios output. [puppet] - 10https://gerrit.wikimedia.org/r/883864 [11:12:26] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [11:12:53] PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:22] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10jbond) @ssingh i have created a patch to defer reboots until all drivers have been uploaded. Are... [11:23:02] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:24:14] (03PS1) 10Muehlenhoff: Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865 [11:26:39] (03PS1) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 [11:26:41] (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869 [11:26:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883864 (owner: 10Slyngshede) [11:28:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39267/console" [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond) [11:28:47] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:29:29] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [11:29:36] (03CR) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:29:49] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [11:30:17] RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:31:03] (03PS2) 10Muehlenhoff: Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865 [11:31:14] (03CR) 10Hashar: [C: 03+1] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond) [11:32:01] (03CR) 10Clément Goubert: [C: 03+1] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond) [11:33:36] (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869 [11:33:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883868 (owner: 10Jbond) [11:36:41] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux [11:37:31] (03CR) 10Muehlenhoff: [C: 03+2] Add new hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/883865 (owner: 10Muehlenhoff) [11:39:46] (03PS1) 10Jbond: Revert "gerrit: Add requestctl support to ferm gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/883725 [11:40:18] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts flowspec1001 [11:40:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "gerrit: Add requestctl support to ferm gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/883725 (owner: 10Jbond) [11:41:17] (03PS1) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886 [11:42:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:43:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:43:48] (03PS1) 10Ayounsi: flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) [11:44:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:30] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:46:39] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flowspec1001 decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001" [11:48:03] (03CR) 10Muehlenhoff: flowspec1001: remove everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi) [11:48:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flowspec1001 decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1001" [11:48:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:48:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flowspec1001 [11:48:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Decom flowspec1001 - https://phabricator.wikimedia.org/T328009 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `flowspec1001` - flowspec1001 (**PASS**) - Downtimed host on Icinga/Alertmanag... [11:48:55] (03PS1) 10Jbond: wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825) [11:49:04] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes1010.eqiad.wmnet [11:49:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [11:50:05] (ConfdResourceFailed) firing: confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:50:14] (03PS2) 10Jbond: gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886 [11:50:20] (03PS2) 10Ayounsi: flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) [11:50:52] (03CR) 10Ayounsi: flowspec1001: remove everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi) [11:52:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi) [11:52:45] (03CR) 10Ayounsi: [C: 03+2] flowspec1001: remove everything [puppet] - 10https://gerrit.wikimedia.org/r/883877 (https://phabricator.wikimedia.org/T328009) (owner: 10Ayounsi) [11:53:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:54:08] (03CR) 10Jbond: [C: 03+2] wikimedia.org: add cond SRV records [dns] - 10https://gerrit.wikimedia.org/r/883878 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:54:24] (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869 [11:55:31] (03CR) 10Jbond: [C: 03+2] gerrit: Add requestctl support to ferm gerrit [puppet] - 10https://gerrit.wikimedia.org/r/883886 (owner: 10Jbond) [11:56:08] (03PS2) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) [11:56:33] (03CR) 10CI reject: [V: 04-1] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [11:56:34] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache _etcd-client-ssl._tcp.wikimedia.org on all recursors [11:56:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd-client-ssl._tcp.wikimedia.org on all recursors [11:57:30] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [11:57:56] (03PS3) 10Jcrespo: dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) [11:59:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:05] (ConfdResourceFailed) resolved: confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:00:13] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39268/console" [puppet] - 10https://gerrit.wikimedia.org/r/883869 (owner: 10Jbond) [12:02:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883869 (owner: 10Jbond) [12:03:31] !log enable profile::base::firewall::defs_from_etcd: true globally [12:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:10] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups to avoid overload [puppet] - 10https://gerrit.wikimedia.org/r/883857 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [12:04:53] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:09:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:14] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-proxies rolling restart_daemons on A:eqiad and not A:thanos-fe and A:swift-fe or A:thanos-fe [12:10:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [12:10:39] jbond: :-( the global ferm thing made puppet agent sad in some of our servers [12:10:56] arturo: can you give me an example ill take a look [12:11:10] jbond: https://www.irccloud.com/pastebin/BuAfS8XD/ [12:12:03] * jbond looking [12:12:11] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [12:12:45] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01105 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:12:58] ok rolling back [12:13:19] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:24] (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883887 [12:13:41] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:13:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:14:02] jbond: sorry :-( [12:14:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883887 (owner: 10Jbond) [12:15:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet [12:16:22] (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [12:16:42] arturo: shuld be fixed now sorry about that [12:16:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:17:01] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:18:37] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:46] (03PS1) 10Jaime Nuche: scap3 Jenkins deployment (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/883913 [12:20:47] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002513 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:21:35] (03CR) 10Muehlenhoff: [C: 03+2] puppet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:21:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:22:05] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:23:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:26:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:29:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-proxies (exit_code=0) rolling restart_daemons on A:eqiad and not A:thanos-fe and A:swift-fe or A:thanos-fe [12:29:53] (03PS6) 10Clément Goubert: httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 [12:31:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:31:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:35:12] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [12:35:17] (03PS1) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 [12:35:37] (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:35:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:37:06] (03PS2) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 [12:37:12] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [12:37:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Abhas) [12:37:39] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:38:10] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:38:14] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:38:18] (03CR) 10Clément Goubert: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:38:46] (03PS3) 10Jcrespo: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 [12:39:18] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:39:20] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:39:37] Haha jynus fixing things before I can redact a comment lol [12:39:52] yeah, I thought the . was a ./ [12:40:03] Same at first [12:40:09] (03CR) 10Clément Goubert: [C: 03+1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:40:11] I am not familiar with Path, I usually use os.path (join) [12:40:16] Looks good now [12:40:42] yeah, but better jbond can have a look, minimal changes sometimes are not what it is supposed to do [12:40:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet [12:40:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:40:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:40:58] jynus: agreed [12:41:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet [12:41:40] !log depool cp3051.esams.wmnet for firmware update testing: T323717 [12:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:43] T323717: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 [12:42:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3051.esams.wmnet,service=cdn [12:42:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3051.esams.wmnet,service=ats-be [12:42:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp3051.esams.wmnet with reason: T323717 [12:43:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp3051.esams.wmnet with reason: T323717 [12:45:19] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) >>! In T323717#8559564, @ssingh wrote: > Since we started reimaging the cp hosts to bulls... [12:46:31] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10ssingh) If I don't upgrade the iDRAC firmware, the NIC firmware fails to update for me so I have... [12:46:39] (03PS4) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) [12:46:43] (03CR) 10Elukey: pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [12:46:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:47:07] (03PS5) 10Muehlenhoff: Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) [12:47:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet [12:49:10] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr can you add the disk back? [12:49:14] (03CR) 10Jcrespo: "please test on strech to make sure it works as intended :-D" [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:50:26] (03CR) 10Muehlenhoff: [C: 03+2] Split Swift cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/883228 (https://phabricator.wikimedia.org/T327783) (owner: 10Muehlenhoff) [12:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:52:25] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: sre.swift.roll-restart-reboot-proxies fails on thanos hosts, which lack nginx - https://phabricator.wikimedia.org/T327783 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been fixed by splitting the restart cookbooks i... [12:53:23] (03Abandoned) 10Muehlenhoff: sre.swift.roll-restart-reboot-proxies: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff) [12:53:27] (03PS4) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:53:29] (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [12:53:32] jynus: claime: i have made a small update can you both take another look [12:53:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [12:53:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) a:03Clement_Goubert [12:53:49] (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:54:16] (03PS5) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:54:37] (03CR) 10CI reject: [V: 04-1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:55:33] jbond: ideal looks good, needs a concrete exception [12:55:54] (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [12:55:55] updated [12:56:04] ImportError I guess= [12:56:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) [] Approval from @Ottomata or @odimitrijevic as group approvers [] Approval from @JanWMF as manager [] Out of band key verification [12:56:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:57:25] (03PS6) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:57:26] ok not updated :) [12:57:27] (03PS4) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [12:57:44] jbond: looks good to me, feel free to sqash both changes, as long as it works on all versions it is ok to me [12:57:48] (03PS1) 10Elukey: ml-services: update revscoring model servers to the latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/883929 (https://phabricator.wikimedia.org/T325528) [12:57:53] (03PS7) 10Jbond: confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [12:58:02] ack thanks jynus [12:58:20] ill push the other through later though as there is also a acl blocking [12:58:51] going to lunch, but please you or someone else have a look at the thumbor hosts complaining (probably same fix than swift) [12:59:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Ottomata) Approved. I'm not certain this will need kerberos access, but I'd go ahead and give it for good measure. I'd expect there to be times when it will just be easier t... [13:00:26] (03CR) 10Jcrespo: [C: 03+1] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [13:00:28] (03PS4) 10Muehlenhoff: trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) [13:02:47] (03CR) 10Filippo Giunchedi: [C: 03+1] disc_desired_state: Add k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [13:04:18] (03CR) 10Jbond: [C: 03+2] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:04:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [13:04:26] (03PS7) 10Jbond: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:04:31] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring model servers to the latest docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/883929 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [13:04:33] (03CR) 10Jbond: [V: 03+2] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:04:54] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Abhas - https://phabricator.wikimedia.org/T328015 (10Clement_Goubert) [13:05:50] (03CR) 10Slyngshede: [C: 03+2] PERC RAID: Fix formatting for Nagios output. [puppet] - 10https://gerrit.wikimedia.org/r/883864 (owner: 10Slyngshede) [13:06:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:07:06] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:03] !log Rebooting gerrit2002.wikimedia.org host to validate Apache 2 services starts AFTER network went online | T326125 [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:07] T326125: apache2 fails to start after gerrit hosts are rebooted - https://phabricator.wikimedia.org/T326125 [13:10:01] !log installing nodejs security updates on bullseye [13:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:42] ACKNOWLEDGEMENT - Host gerrit2002 is DOWN: PING CRITICAL - Packet loss = 100% amusso reboot! [13:12:24] PROBLEM - Check whether ferm is active by checking the default input chain on ml-staging2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:13:22] (03CR) 10Clément Goubert: [C: 03+2] disc_desired_state: Add k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/883880 (https://phabricator.wikimedia.org/T327756) (owner: 10Clément Goubert) [13:16:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:43] (03PS1) 10Jbond: confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 [13:18:45] jouncebot: nowandnext [13:18:45] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [13:18:45] In 0 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400) [13:18:45] In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400) [13:19:11] (03CR) 10Ladsgroup: [C: 03+2] Change time zone setting on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15) [13:19:55] (03Merged) 10jenkins-bot: Change time zone setting on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883723 (https://phabricator.wikimedia.org/T327986) (owner: 10Superpes15) [13:20:05] (03PS5) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [13:20:08] PROBLEM - Check systemd state on ml-staging2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:16] (03CR) 10Jbond: [C: 03+2] confd: Make confd_prometheus_metrics.py 3.4-compatible [puppet] - 10https://gerrit.wikimedia.org/r/883926 (owner: 10Jcrespo) [13:20:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]] [13:20:55] T327986: Change time zone setting in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327986 [13:21:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [13:22:39] !log ladsgroup@deploy1002 superpes and ladsgroup: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:22:39] (03PS6) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) [13:25:47] !log restarting turnilo for nodejs security update [13:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) Adding Jaime for the backup hosts. [13:32:15] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:32:53] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:883723|Change time zone setting on gorwiktionary (T327986)]] (duration: 12m 02s) [13:32:57] T327986: Change time zone setting in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327986 [13:33:44] (03CR) 10Jforrester: "You'll need to patch scap (or the puppet controling code) to generate the PHP i18n first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883707 (https://phabricator.wikimedia.org/T99740) (owner: 10Ladsgroup) [13:33:47] (03PS1) 10Stevemunene: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) [13:33:52] (03PS2) 10Jbond: confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 [13:35:20] (03CR) 10Ayounsi: [C: 03+1] confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 (owner: 10Jbond) [13:35:59] (03CR) 10Jbond: [C: 03+2] confd: allow cloud infrastructure to talk to confd [homer/public] - 10https://gerrit.wikimedia.org/r/883935 (owner: 10Jbond) [13:36:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [13:36:28] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:37:18] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove DNS records for removed esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [13:37:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [13:38:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove DNS records for removed esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [13:38:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:38] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/883519 (https://phabricator.wikimedia.org/T328022) [13:38:58] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:39:30] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:39:44] (03CR) 10Slyngshede: D:apereo_cas::service: Map memberOf to OIDC (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [13:39:51] (03PS1) 10Ayounsi: BGPalerter: switch to email noc@ [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600) [13:40:12] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2113 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/883520 (https://phabricator.wikimedia.org/T328023) [13:40:34] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:41:28] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:42:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/883521 (https://phabricator.wikimedia.org/T328024) [13:42:38] RECOVERY - Check whether ferm is active by checking the default input chain on ml-staging2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:43:13] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:43:54] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:43:58] (03PS1) 10Cathal Mooney: Remove include for reverse zone for 2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) [13:44:44] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [13:44:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:44:56] (03CR) 10Slyngshede: [C: 03+2] D:apereo_cas::service: Map memberOf to OIDC [puppet] - 10https://gerrit.wikimedia.org/r/883249 (owner: 10Slyngshede) [13:45:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:45:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:46:42] PROBLEM - Check systemd state on kubernetes2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:18] RECOVERY - Check systemd state on ml-staging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:56] (03PS1) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 [13:51:15] (03PS2) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields on group 0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) [13:51:18] (03CR) 10CI reject: [V: 04-1] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [13:51:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T328023 [13:52:03] T328023: Switchover s5 master (db2123 -> db2113) - https://phabricator.wikimedia.org/T328023 [13:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2113 with weight 0 T328023', diff saved to https://phabricator.wikimedia.org/P43408 and previous config saved to /var/cache/conftool/dbconfig/20230126-135215-root.json [13:52:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T328023 [13:52:40] (03PS2) 10Jbond: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [13:52:52] (03PS3) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 [13:53:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2113 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/883520 (https://phabricator.wikimedia.org/T328023) (owner: 10Gerrit maintenance bot) [13:53:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [13:53:49] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39270/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [13:54:52] (03CR) 10Jgiannelos: "This has already been tested in our last OSM import. I think that its better to merge as it is and file a ticket for further improvements." [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [13:55:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove vslow from db2113, future s5 codfw master T328023', diff saved to https://phabricator.wikimedia.org/P43409 and previous config saved to /var/cache/conftool/dbconfig/20230126-135509-marostegui.json [13:55:55] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991) [13:56:10] (03Abandoned) 10Jgiannelos: maps: Disable tilerator on codfw replicas [puppet] - 10https://gerrit.wikimedia.org/r/811737 (owner: 10Jgiannelos) [13:57:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39271/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [13:58:01] (03CR) 10Effie Mouzeli: [C: 03+2] maps: Add missing index script on import [puppet] - 10https://gerrit.wikimedia.org/r/883197 (owner: 10Jgiannelos) [13:58:24] (03PS1) 10Jbond: cr/interfaces: check for ips key before accessing it [homer/public] - 10https://gerrit.wikimedia.org/r/883947 [13:58:27] (03CR) 10Ayounsi: "one comment then lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400). nyaa~ [14:00:04] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] \o [14:00:14] o/ [14:00:22] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:00:24] (03PS4) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 [14:00:44] (03CR) 10CI reject: [V: 04-1] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [14:00:48] !log restarting etherpad-lite to pick up nodejs security update [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:52] I can deploy [14:01:12] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:01:13] 10SRE-tools, 10Infrastructure-Foundations: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300 (10ayounsi) [14:01:15] Won't be able to test myself as I do not have CheckUser permissions on group 0 or 1. Any steward or WMF employee with staff rights should be able to load Special:CheckUserLog to test. [14:01:22] (03Abandoned) 10Jbond: cr/interfaces: check for ips key before accessing it [homer/public] - 10https://gerrit.wikimedia.org/r/883947 (owner: 10Jbond) [14:01:30] (03CR) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [14:01:45] Dreamy_Jazz: i have like ~15 minutes now [14:01:54] (03PS5) 10Slyngshede: C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 [14:01:57] Okay. Thanks. [14:02:00] urbanecm: want to do the deployment? [14:02:06] (or I can deploy and let you verify ^^) [14:02:12] Lucas_WMDE: can you do it please? :) [14:02:16] sure! [14:02:19] ty [14:02:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39272/console" [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [14:03:05] Dreamy_Jazz: I assume the data was backfilled via maintenance script or something like that? [14:03:31] Yes. See https://phabricator.wikimedia.org/T327290 [14:03:51] got it, thanks [14:03:57] (I got lost in the many updates on https://phabricator.wikimedia.org/T233004 ^^) [14:04:10] Np [14:04:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:05:33] (03Merged) 10jenkins-bot: Enable write new for CheckUserLog comment fields on group 0 and 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883122 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [14:06:01] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]] [14:06:05] !log Starting s5 codfw failover from db2123 to db2113 - T328023 [14:06:05] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:09] T328023: Switchover s5 master (db2123 -> db2113) - https://phabricator.wikimedia.org/T328023 [14:06:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:apereo_cas fix missing hash [puppet] - 10https://gerrit.wikimedia.org/r/883943 (owner: 10Slyngshede) [14:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2113 to s5 primary T328023', diff saved to https://phabricator.wikimedia.org/P43410 and previous config saved to /var/cache/conftool/dbconfig/20230126-140630-root.json [14:06:36] urbanecm: Test instructions are: [14:06:36] * Load Special:CheckUserLog [14:06:36] * Find (or make) an entry with a wikilink in it's reason [14:06:36] * Copy the reason as shown in the CheckUserLog - It should be the reason without the "[[" and "]]" markup for the wikilink [14:06:36] * Paste this into the 'reason' search field [14:06:36] * Search the log [14:06:36] * The test passes if you see the entry with the wikilink shown [14:06:37] This works because the method to search changes once read new is set so that the wikilink structure is ignored when searching [14:06:43] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) [14:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2123 T328023', diff saved to https://phabricator.wikimedia.org/P43411 and previous config saved to /var/cache/conftool/dbconfig/20230126-140716-root.json [14:07:19] (03CR) 10CI reject: [V: 04-1] Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:07:35] Dreamy_Jazz: which wiki please? :-) [14:07:44] Any group 0 or 1 wiki [14:07:46] okay [14:07:51] !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:07:54] test wikis have already had the change made [14:08:05] urbanecm: should be on mwdebug now [14:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43412 and previous config saved to /var/cache/conftool/dbconfig/20230126-140804-root.json [14:08:10] okay, so non-testwiki group0/1 [14:08:16] metawiki should work? [14:08:18] Sure [14:08:50] As far as I am aware yes, as it's shown in group 1 on toolforge versions list [14:08:54] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [14:08:56] ( https://versions.toolforge.org/ ) [14:09:10] (03PS2) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [14:09:15] Lucas_WMDE: Dreamy_Jazz: change works correctly :) [14:09:20] I think even test2wiki might work, since READ_NEW is set on testwiki (a wiki) rather than testwikis (a dblist) afaict [14:09:22] okay, yay [14:09:22] (03PS3) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [14:09:23] Great thanks for the merge! [14:09:29] syncing :) [14:09:30] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/883949 (https://phabricator.wikimedia.org/T135991) [14:09:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:09:32] And testing [14:09:33] thanks for testing! [14:09:49] no problem [14:10:12] hmm, mw1448 Special:Version returned 500 according to scap [14:10:15] let’s hope that was just flaky [14:10:31] (it’s continuing so far, iirc one failed canary isn’t enough to stop the sync) [14:10:38] RECOVERY - Check systemd state on kubernetes2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:54] (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883888 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [14:10:58] PROBLEM - Check systemd state on kubernetes2014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:11:14] !log disable puppet fleet wide to role out etcd ferm change gerrit:883888 [14:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:11:52] Yeah. Special:Version shouldn't have been affected by that config change, so probably just being flaky [14:12:08] yeah, really doesn’t seem like it should be related [14:12:54] ok I can see it in logstash, it was “shellbox server returned status code 503” [14:13:01] (reporting the lilypond version) [14:13:20] out of caution, i tested special:Version at mw1448. it works fine. [14:13:28] so, a onetime error [14:13:32] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Jclark-ctr) Drive 2 reinserted. [14:13:32] Thanks! [14:13:34] thanks [14:13:42] how do you test a specific non-mwdebug server? [14:13:57] Lucas_WMDE: ssh there, `curl -i --connect-to ::$HOSTNAME 'https://test.wikipedia.org/wiki/Special:Version'` [14:14:10] ok, thanks! [14:14:17] if you have the proxy env variables set in your bashrc like i do, you need to unset those first [14:14:51] (03PS2) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) [14:14:53] (03PS1) 10Giuseppe Lavagetto: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 [14:15:12] PROBLEM - Check systemd state on ms-be2062 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:17] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:883122|Enable write new for CheckUserLog comment fields on group 0 and 1 (T233004)]] (duration: 09m 16s) [14:15:21] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:16:21] jouncebot: nowandnext [14:16:21] For the next 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400) [14:16:22] For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1400) [14:16:22] In 2 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700) [14:16:31] !log UTC afternoon backport+config window done [14:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] I was about to ask :D [14:16:41] (03CR) 10CI reject: [V: 04-1] sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) (owner: 10Giuseppe Lavagetto) [14:16:41] :) [14:16:43] (03CR) 10CI reject: [V: 04-1] sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 (owner: 10Giuseppe Lavagetto) [14:16:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:07] (unimportant thing that *I* was about to ask earlier: the s5 master is on codfw?) [14:17:33] *primary [14:17:41] nope, codfw has it's own master but it's a replica of the eqiad one [14:17:48] ok, thx ^^ [14:17:50] https://orchestrator.wikimedia.org/web/cluster/alias/s5 [14:18:36] it still needs switchovers for maint because if we stop replication on it, it'll break replication to the whole codfw :D [14:19:33] makes sense [14:20:22] (03PS1) 10Dreamy Jazz: Enable write new for CheckUserLog comment fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004) [14:21:06] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) I see it rebuilding, I will ping you once the alert recovers so we can pull it out again: ` perccli64 /c0 show rebuildrate CLI Version =... [14:22:08] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2062 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:39] (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883890 [14:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43413 and previous config saved to /var/cache/conftool/dbconfig/20230126-142309-root.json [14:23:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [14:23:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883890 (owner: 10Jbond) [14:24:29] (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883891 (https://phabricator.wikimedia.org/T313825) [14:24:46] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:24:58] (03PS3) 10Giuseppe Lavagetto: sre-mediawiki: add mean latency alerts [alerts] - 10https://gerrit.wikimedia.org/r/883502 (https://phabricator.wikimedia.org/T326544) [14:25:00] (03PS2) 10Giuseppe Lavagetto: sre-mediawiki: port the other prometheus-based alerts [alerts] - 10https://gerrit.wikimedia.org/r/883950 [14:27:11] !log installing containerd security updates [14:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] (03PS2) 10Cathal Mooney: Remove include for reverse zone for 2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) [14:28:57] (03CR) 10Cathal Mooney: Remove include for reverse zone for 2620:0:861:fe03::/64 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [14:29:02] (03CR) 10Cathal Mooney: [C: 03+2] Remove include for reverse zone for 2620:0:861:fe03::/64 [dns] - 10https://gerrit.wikimedia.org/r/883942 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [14:30:36] (03CR) 10Btullis: "You'll need to increment the `version` value in charts/datahub/Chart.yaml and charts/datahub/charts/datahub-frontend/Chart.yaml as well, o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [14:31:01] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Rotating wikiadmin password (T326802) (duration: 07m 04s) [14:31:05] T326802: Rotate wikiuser and wikiadmin passwords - https://phabricator.wikimedia.org/T326802 [14:31:23] (03PS1) 10Ladsgroup: dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) [14:31:33] (03CR) 10CI reject: [V: 04-1] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [14:31:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:31:46] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:32:50] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:15] (03PS1) 10Cathal Mooney: Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) [14:34:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:35:43] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) I pasted the wrong command above: ` root@db1206:~# perccli64 /c0/e252/s2 show rebuild CLI Version = 007.1910.0000.0000 Oct 08, 2021 Opera... [14:36:25] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [14:37:14] (03PS1) 10Jbond: cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960 [14:37:24] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:37:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [14:37:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:38:12] RECOVERY - Check systemd state on kubernetes2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43414 and previous config saved to /var/cache/conftool/dbconfig/20230126-143814-root.json [14:39:21] (03CR) 10Ayounsi: [C: 03+1] "My bad!" [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond) [14:39:36] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [14:39:40] (03CR) 10Jbond: [C: 03+2] cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond) [14:40:19] (03Merged) 10jenkins-bot: cr-labs: move confd-client rule to cr-labs [homer/public] - 10https://gerrit.wikimedia.org/r/883960 (owner: 10Jbond) [14:40:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:eqiad and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:40:39] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-run DNS cookbook after updating zone files - remove esams eqiad GRE tunnel link IPs. - cmooney@cumin1001" [14:40:39] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:40:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:42:24] RECOVERY - Check systemd state on ms-be2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:42] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [14:44:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [14:45:10] RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [14:45:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:46:18] (03CR) 10Bking: [C: 03+2] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T289836) (owner: 10Bking) [14:46:46] (03PS3) 10Bking: dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409) [14:46:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:03] (03CR) 10Bking: dse-k8s: add rdf-streaming-updater namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409) (owner: 10Bking) [14:47:05] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:47:07] (03CR) 10Bking: [V: 03+2 C: 03+2] dse-k8s: add rdf-streaming-updater namespace [puppet] - 10https://gerrit.wikimedia.org/r/882748 (https://phabricator.wikimedia.org/T326409) (owner: 10Bking) [14:47:15] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) RAID is now back in optimal status, waiting for Icinga to recover before pulling the disk out again ` VD LIST : ======= ----------------... [14:48:40] (03PS1) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) [14:49:13] (03PS2) 10Ladsgroup: mariadb: Centralize and change wikiadmin user grants [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) [14:49:29] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/883946 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:49:42] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) [14:49:52] (03CR) 10Ayounsi: [C: 03+2] BGPalerter: switch to email noc@ [puppet] - 10https://gerrit.wikimedia.org/r/883941 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [14:51:07] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) ` root@db1206:~# sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication: 0 OK | controller: 0 OK | physical_disk: 0 OK... [14:52:06] (ConfdResourceFailed) resolved: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:52:44] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2062 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43415 and previous config saved to /var/cache/conftool/dbconfig/20230126-145319-root.json [14:54:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:06] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) @Jclark-ctr whenever you can, pull the disk out again. Thank you [14:55:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [14:55:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:55:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:56:52] (03CR) 10Ladsgroup: "PCC looks fine: https://puppet-compiler.wmflabs.org/output/883961/39273/" [puppet] - 10https://gerrit.wikimedia.org/r/883961 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [14:57:26] (03PS2) 10Stevemunene: Enable oidc env vars for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) [14:59:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:27] (03CR) 10Ayounsi: [C: 03+1] Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [15:00:36] (03CR) 10Ayounsi: [C: 03+2] Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [15:01:16] (03Merged) 10jenkins-bot: Remove OSPF interface configuration for old esams<->eqiad GRE tunnel [homer/public] - 10https://gerrit.wikimedia.org/r/883959 (https://phabricator.wikimedia.org/T327266) (owner: 10Cathal Mooney) [15:02:16] (03PS1) 10Elukey: admin_ng: add SANs to the inference endpoints for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) [15:02:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [15:02:33] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [15:02:36] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye [15:02:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [15:04:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [15:04:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [15:04:12] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye [15:04:14] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [15:04:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [15:04:32] (03PS2) 10Elukey: admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) [15:04:41] (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [15:05:59] (03PS1) 10Hashar: gerrit: listen on all port, DROP requests to host [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) [15:08:05] (03CR) 10Hashar: "For the record, that did not work cause `network-online.target` is reached immediately after the interface script have completed and they " [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43417 and previous config saved to /var/cache/conftool/dbconfig/20230126-150824-root.json [15:08:39] (03CR) 10Marostegui: [C: 03+1] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [15:09:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564 [15:09:10] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [15:09:14] (03CR) 10Ladsgroup: [C: 03+2] dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [15:09:16] (03CR) 10CI reject: [V: 04-1] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [15:09:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564 [15:09:23] ah lovely [15:09:23] !log stop pybal on lvs2007: T326564 [15:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39274/console" [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:09:47] (03Merged) 10jenkins-bot: dbtools: Update call to wikiadmin [software] - 10https://gerrit.wikimedia.org/r/883957 (https://phabricator.wikimedia.org/T326802) (owner: 10Ladsgroup) [15:10:05] (03CR) 10Jgiannelos: [C: 03+1] maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [15:10:11] (03CR) 10Hashar: "That is a continuation of https://gerrit.wikimedia.org/r/c/operations/puppet/+/875315/ which did not work ;)" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:11:12] 10SRE-swift-storage, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10MatthewVernon) [15:11:55] (03PS3) 10Elukey: admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) [15:12:35] !log disabl-puppet deplot requestctl ferm chage gerrit:883935 [15:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:04] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:32] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:14:00] ^ BGP alerts on cr*-codfw expected as lvs2007 is depooled [15:15:34] (03PS4) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [15:15:38] (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883891 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [15:15:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:16:01] (03CR) 10JMeybohm: [C: 03+1] pki: Add public certs and config for mlserve clusters' intermediates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:16:31] (03PS5) 10Effie Mouzeli: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [15:16:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:19] (03PS1) 10Jbond: Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883892 [15:18:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "firewall: Add requestctl support to ferm globaly" [puppet] - 10https://gerrit.wikimedia.org/r/883892 (owner: 10Jbond) [15:18:45] (03PS1) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825) [15:19:03] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/883965 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:22:10] (03CR) 10Elukey: [V: 03+1 C: 03+2] pki: Add public certs and config for mlserve clusters' intermediates [puppet] - 10https://gerrit.wikimedia.org/r/883630 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:23:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43418 and previous config saved to /var/cache/conftool/dbconfig/20230126-152329-root.json [15:25:08] jouncebot: nowandnext [15:25:08] No deployments scheduled for the next 1 hour(s) and 34 minute(s) [15:25:08] In 1 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700) [15:25:43] (03CR) 10Effie Mouzeli: [C: 03+2] maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [15:26:06] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] httpd-fcgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [15:27:47] !log poweroff lvs2007: T326564 [15:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:52] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [15:29:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [15:29:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [15:29:34] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye [15:29:39] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [15:29:50] hmm ttyS1-115200/cp2027.conf exists, removing doesn't help too [15:29:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:30:17] !log install2003: rm /etc/dhcp/automation/ttyS1-115200/cp2027.conf [15:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [15:30:43] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye [15:30:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [15:30:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:30:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [15:31:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:35:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:39:54] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1009.eqiad.wmnet [15:40:47] (03PS2) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825) [15:40:50] (03PS1) 10Jbond: confd::file: allow to specify fully qualified prefix [puppet] - 10https://gerrit.wikimedia.org/r/883973 (https://phabricator.wikimedia.org/T313825) [15:41:00] PROBLEM - Host flowspec1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:50] (03PS3) 10Jbond: firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825) [15:44:52] (03PS1) 10Jbond: P:firewall: use fully qualified confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/883974 [15:46:54] !log Restart Jenkins for upgrade [15:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39276/console" [puppet] - 10https://gerrit.wikimedia.org/r/883974 (owner: 10Jbond) [15:48:21] (03CR) 10Jbond: [C: 03+2] confd::file: allow to specify fully qualified prefix [puppet] - 10https://gerrit.wikimedia.org/r/883973 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [15:48:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:firewall: use fully qualified confd prefix [puppet] - 10https://gerrit.wikimedia.org/r/883974 (owner: 10Jbond) [15:49:20] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudgw2001-dev.codfw.wmnet [15:49:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s8 T328024 [15:49:45] T328024: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T328024 [15:50:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T328024', diff saved to https://phabricator.wikimedia.org/P43419 and previous config saved to /var/cache/conftool/dbconfig/20230126-155000-root.json [15:50:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s8 T328024 [15:50:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/883521 (https://phabricator.wikimedia.org/T328024) (owner: 10Gerrit maintenance bot) [15:51:00] (03CR) 10Jbond: [C: 03+2] firewall: Add requestctl support to ferm globaly [puppet] - 10https://gerrit.wikimedia.org/r/883893 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [15:51:25] !log Restarting CI Jenkins for upgrade [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:24] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [15:55:42] !log enable-puppet post deploy requestctl ferm chage gerrit:883935 [15:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:48] (03PS2) 10Muehlenhoff: slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) [16:02:16] (03CR) 10CI reject: [V: 04-1] slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [16:04:47] (03PS1) 10Clément Goubert: httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794) [16:05:34] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1009.eqiad.wmnet [16:05:52] (03PS3) 10Muehlenhoff: slapd: Add support to configure MDB storage backend (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) [16:06:02] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [16:06:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [16:06:29] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] httpd-fcgi: Fix system logs test [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883978 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [16:08:03] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [16:08:03] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:08:04] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudgw2001-dev.codfw.wmnet [16:08:05] (03PS2) 10Jelto: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) [16:08:07] (03PS1) 10Jbond: confd: ensure python package [puppet] - 10https://gerrit.wikimedia.org/r/883979 [16:09:06] !log installing distro-info-data updates from Bullseye point release [16:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [16:10:23] (03CR) 10Jbond: [C: 03+2] confd: ensure python package [puppet] - 10https://gerrit.wikimedia.org/r/883979 (owner: 10Jbond) [16:10:37] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:10:38] !log Starting s8 codfw failover from db2161 to db2165 - T328024 [16:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:42] T328024: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T328024 [16:10:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 primary T328024', diff saved to https://phabricator.wikimedia.org/P43420 and previous config saved to /var/cache/conftool/dbconfig/20230126-161058-marostegui.json [16:11:06] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [16:11:14] (03PS5) 10Herron: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [16:11:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2161 T328024', diff saved to https://phabricator.wikimedia.org/P43421 and previous config saved to /var/cache/conftool/dbconfig/20230126-161137-root.json [16:12:37] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: use all=True parameter to disable pagination [cookbooks] - 10https://gerrit.wikimedia.org/r/883593 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [16:12:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P43422 and previous config saved to /var/cache/conftool/dbconfig/20230126-161242-root.json [16:13:18] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [16:13:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717 [16:13:37] T323717: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 [16:13:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717 [16:14:13] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [16:14:45] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet [16:17:00] (03Merged) 10jenkins-bot: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [16:17:58] (03PS1) 10Muehlenhoff: Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983 [16:18:01] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS bullseye [16:18:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6007.drmrs.wmnet with OS bullseye [16:19:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:08] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [16:19:22] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:19:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1080.eqiad.wmnet [16:20:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:20:19] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:21:01] (03CR) 10Btullis: [C: 03+1] "LGTM. Let's give it a go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/883939 (https://phabricator.wikimedia.org/T327884) (owner: 10Stevemunene) [16:21:19] (03CR) 10Ayounsi: [C: 03+1] Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983 (owner: 10Muehlenhoff) [16:21:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:21:32] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet [16:21:32] (03CR) 10Muehlenhoff: [C: 03+2] Fix up installserver Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/883983 (owner: 10Muehlenhoff) [16:23:06] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet [16:23:34] PROBLEM - Check systemd state on mw1411 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:44] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb1001-dev [16:24:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:42] !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb1001-dev [16:25:42] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:26:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1080.eqiad.wmnet [16:27:16] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:36] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [16:27:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [16:27:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P43423 and previous config saved to /var/cache/conftool/dbconfig/20230126-162747-root.json [16:27:49] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2027.codfw.wmnet with OS bullseye [16:27:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1084.eqiad.wmnet [16:27:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [16:28:02] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2001-dev: rename server to cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884027 (https://phabricator.wikimedia.org/T327908) [16:28:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [16:28:29] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [16:28:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2001-dev: rename server to cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884027 (https://phabricator.wikimedia.org/T327908) (owner: 10Arturo Borrero Gonzalez) [16:30:04] (03PS4) 10Muehlenhoff: slapd: Add support to configure MDB storage backend [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) [16:31:09] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet [16:32:35] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:33:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1084.eqiad.wmnet [16:34:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:30] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage [16:38:32] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye [16:38:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Puppet and PuppetDB if p... [16:39:08] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [16:40:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:21] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2027'] [16:41:49] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6007.drmrs.wmnet with reason: host reimage [16:42:36] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:42:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P43424 and previous config saved to /var/cache/conftool/dbconfig/20230126-164252-root.json [16:45:08] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) [16:46:48] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:48:05] !log pooling lvs2009 after T326564 [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:09] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [16:48:14] !log correcting earlier log: pooling lvs2007 after T326564 [16:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:22] 10SRE, 10Incident Tooling: Pagination parameters required for Statuspage's authenticated REST API - https://phabricator.wikimedia.org/T328044 (10lmata) [16:49:42] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=True) upgrade firmware for hosts ['cp2027'] [16:50:52] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 169, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2027.codfw.wmnet with OS bullseye [16:51:44] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye [16:52:46] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10herron) [16:53:42] !log Running scap sync-file -D php_fpm_restart_script:/bin/true tox.ini "Rebuilding mediawiki-webserver image" - T326794 [16:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:46] T326794: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 [16:54:14] (03PS1) 10Elukey: role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) [16:54:19] (03PS17) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:54:40] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:56:05] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1012.eqiad.wmnet [16:57:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39278/console" [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:57:31] (03PS1) 10Jbond: wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 [16:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P43425 and previous config saved to /var/cache/conftool/dbconfig/20230126-165757-root.json [16:58:31] (03PS1) 10Jelto: gitlab_runner: add separate ensure for docker::network [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) [16:58:42] jbond: lol that looks great [16:59:04] (03PS18) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:59:16] (03PS1) 10Elukey: admin_ng: update ml-serve-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) [16:59:24] (03CR) 10CI reject: [V: 04-1] wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond) [16:59:26] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:59:37] (03CR) 10Ssingh: [C: 03+1] "Thanks, looks amazing!" [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond) [16:59:52] !log cgoubert@deploy1002 Synchronized tox.ini: Rebuilding mediawiki-webserver (duration: 07m 19s) [16:59:59] (03CR) 10Ahmon Dancy: [C: 03+1] "Looks like a reasonable solution" [puppet] - 10https://gerrit.wikimedia.org/r/884037 (https://phabricator.wikimedia.org/T327949) (owner: 10Jelto) [17:00:04] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:54] jouncebot: what about when your hammer is puppetlang though [17:02:08] (03CR) 10Elukey: [C: 03+2] admin_ng: add SANs to the inference endpoints for mlserve staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/883964 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:02:10] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:50] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1012.eqiad.wmnet [17:03:05] (03PS19) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:03:26] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:03:28] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6007.drmrs.wmnet with OS bullseye [17:03:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6007.drmrs.wmnet with OS bullseye completed: - cp6007 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:04:32] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6007.drmrs.wmnet [17:05:09] (03PS20) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:05:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:05:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:05:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:05:19] (03PS1) 10Giuseppe Lavagetto: Add missing paging alert for high backend errors in trafficserver [alerts] - 10https://gerrit.wikimedia.org/r/884039 [17:05:26] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet [17:05:51] (03PS2) 10Jbond: wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 [17:06:12] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS bullseye [17:06:19] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6016.drmrs.wmnet with OS bullseye [17:06:35] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1013.eqiad.wmnet [17:07:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [17:07:42] (03CR) 10Jbond: [C: 03+2] wikidough: add some colour to HAL [puppet] - 10https://gerrit.wikimedia.org/r/884036 (owner: 10Jbond) [17:10:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2027.codfw.wmnet with reason: host reimage [17:12:31] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1013.eqiad.wmnet [17:13:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P43426 and previous config saved to /var/cache/conftool/dbconfig/20230126-171302-root.json [17:14:34] (03PS21) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:14:52] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Eevans) [17:16:13] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet [17:16:45] (03PS22) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:17:44] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:18:05] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:19:04] !log dancy@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [17:19:10] (03PS23) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:19:15] !log dancy@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 11s) [17:21:31] (03PS24) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:22:36] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1014.eqiad.wmnet [17:22:43] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39285/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:23:21] 10SRE, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Eevans) [17:24:25] (03PS1) 10Jbond: network: drop abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/884040 [17:24:40] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [17:24:41] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet [17:26:43] (03CR) 10Jbond: network: drop abuse_networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond) [17:27:42] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [17:28:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P43427 and previous config saved to /var/cache/conftool/dbconfig/20230126-172806-root.json [17:28:51] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:15] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:37] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1015.eqiad.wmnet [17:33:57] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:45] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:39:53] (03PS25) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:41:37] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884045 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [17:44:51] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) This is the return address Seagrove C/O Celestica Killam Industrial Park 13701 N Lamar Dr. Laredo, TX 78045 USA Project: CLS HUB Laredo, TX Attn: Juniper Returns... [17:46:36] (03Merged) 10jenkins-bot: mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884045 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [17:47:55] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:49:28] (03PS26) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:49:45] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS bullseye [17:49:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6016.drmrs.wmnet with OS bullseye completed: - cp6016 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [17:50:37] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39288/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:52:52] (03CR) 10Ottomata: [V: 03+1] "Okay I finally think I got it!" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:54:35] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:55:00] (03PS27) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:55:22] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet [17:56:12] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39289/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:58:02] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) [17:58:57] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:59:12] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS bullseye [17:59:18] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp6008.drmrs.wmnet with OS bullseye [18:00:04] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800) [18:00:23] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884046 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [18:01:58] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:02:13] I don't have anything for the Technical Engagement window this week. [18:06:02] (03Merged) 10jenkins-bot: mediawiki: Fix syslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884046 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [18:06:58] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:09:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:10:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:10:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:10:47] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:11:44] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:11:45] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:12:40] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:12:41] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:13:37] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:13:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:14:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:14:20] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:14:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:15:03] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [18:15:03] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [18:15:22] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [18:15:23] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [18:15:28] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [18:15:28] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [18:15:48] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [18:15:48] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [18:15:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:16:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:16:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:16:48] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:17:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:17:25] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:17:36] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [18:20:18] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [18:27:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:34:36] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10BCornwall) [18:36:10] jouncebot: nowandnext [18:36:11] For the next 0 hour(s) and 23 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800) [18:36:11] For the next 0 hour(s) and 23 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1800) [18:36:11] In 0 hour(s) and 23 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1900) [18:40:59] (03PS2) 10Ebernhardson: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) [18:46:40] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS bullseye [18:46:47] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp6008.drmrs.wmnet with OS bullseye completed: - cp6008 (**WARN**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [18:57:52] (03PS3) 10Jdlrobson: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) [18:57:56] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet [18:59:11] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:59:53] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [18:59:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye [19:00:04] brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T1900). [19:00:17] o/ [19:00:45] !log 1.40.0-wmf.20 train (T325583): no current blockers, rolling to all wikis. [19:00:48] (03PS1) 10BBlack: esitest: compat with haproxy >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/884052 (https://phabricator.wikimedia.org/T321775) [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:49] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:01:28] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583) [19:01:30] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:02:08] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884054 (https://phabricator.wikimedia.org/T325583) (owner: 10TrainBranchBot) [19:04:23] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:05:40] brennen: Has the train already deployed to k8s or not? [19:06:13] claime: just started i believe [19:06:20] ok I'll wait for it to be done then [19:06:40] I have a config fix for mw-on-k8s but I don't want to step on scap's toes :p [19:06:45] (03PS1) 10Jdlrobson: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) [19:06:50] claime: cool, thx [19:06:53] https://test2.wikipedia.org/wiki/Special:Version shows wmf.20 and https://versions.toolforge.org/ shows wmf.20 for all wikis [19:07:22] 19:06:53 Finished sync-prod-k8s (duration: 00m 54s) [19:07:30] Fantastic, thanks [19:07:50] note overall train deploy is still underway. [19:08:07] (03PS1) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) [19:09:14] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39290/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:09:41] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.20 refs T325583 [19:09:41] Noted, I'm not touching anything but the mw-on-k8s deployment. Once scap is done with it, what I'm doing shouldn't interfere with the train [19:09:45] T325583: 1.40.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T325583 [19:09:54] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix php-slowlog rsyslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884051 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [19:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:10:13] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: esitest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:23] yeah, just always a good window of time to keep in mind i might be running another deploy if a rollback is needed for anything. [19:10:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp2027.codfw.wmnet with reason: reimaging [19:10:59] brennen: ack, I won't be long [19:11:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2027.codfw.wmnet with reason: reimaging [19:11:25] if jenkins would get in gear :P [19:16:04] (03Merged) 10jenkins-bot: mediawiki: Fix php-slowlog rsyslog json [deployment-charts] - 10https://gerrit.wikimedia.org/r/884051 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [19:20:24] (03PS2) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) [19:21:05] brennen: I'm all done. [19:21:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39291/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:29:38] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:35:56] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:39:50] (03PS3) 10Ssingh: esitest: add conditional for bullseye in esitest.cfg [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) [19:40:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39292/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:44:37] (03PS1) 10Dwisehaupt: Swap fundraising db origin to frdb1005 [dns] - 10https://gerrit.wikimedia.org/r/884066 (https://phabricator.wikimedia.org/T315601) [19:46:18] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:49:15] (03CR) 10Gehel: [C: 03+1] "LGTM. Jenkins failures seems unrelated and should not be fixed as part of this CR." [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:49:54] (03PS4) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) [19:50:29] (03CR) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [19:53:20] (03PS4) 10Ssingh: esitest: remove deprecated nbproc config option [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) [19:54:23] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39293/console" [puppet] - 10https://gerrit.wikimedia.org/r/884056 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:56:31] !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4038.ulsfo.wmnet with OS bullseye [19:56:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4038 (**FAIL**) - Downtimed on Ic... [19:56:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:18] (03PS6) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:57:21] (03PS1) 10Gehel: idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 [19:58:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:58:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:23] (03PS2) 10Gehel: idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 [19:59:25] (03PS7) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:59:42] (03CR) 10CI reject: [V: 04-1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel) [19:59:59] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:02:03] (03CR) 10Bking: [C: 03+1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel) [20:02:05] (03CR) 10Ryan Kemper: [C: 03+1] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel) [20:02:11] (03CR) 10Gehel: [C: 03+2] idp: comment out unused imports in models.py [puppet] - 10https://gerrit.wikimedia.org/r/884070 (owner: 10Gehel) [20:02:13] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:04:30] (03PS8) 10Gehel: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:05:43] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2027.codfw.wmnet with OS bullseye [20:06:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2027.codfw.wmnet with OS bullseye executed with errors: - cp2027 (**FAIL**) - Removed from Pu... [20:06:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on cp2027.codfw.wmnet with reason: reimaging [20:06:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on cp2027.codfw.wmnet with reason: reimaging [20:06:24] (03PS1) 10Ryan Kemper: django_oidc: fix formatting [puppet] - 10https://gerrit.wikimedia.org/r/884077 [20:07:05] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/884077 (owner: 10Ryan Kemper) [20:07:14] (03CR) 10CI reject: [V: 04-1] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:08:48] (03CR) 10Ryan Kemper: [C: 03+2] django_oidc: fix formatting [puppet] - 10https://gerrit.wikimedia.org/r/884077 (owner: 10Ryan Kemper) [20:09:40] (03PS9) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) [20:12:11] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/883610 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:13:02] !log `ryankemper@thanos-fe1001:~$ sudo run-puppet-agent` following merge of wdqs recording rule patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883610 [20:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:04] (03PS5) 10Ottomata: flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) [20:15:47] (03CR) 10Ottomata: [C: 03+2] flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [20:15:49] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink-1.16.0-wmf4 - Install flink via `pip install apache-flink`. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883653 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [20:18:22] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:18:35] (03CR) 10Jgreen: [C: 03+2] Swap fundraising db origin to frdb1005 [dns] - 10https://gerrit.wikimedia.org/r/884066 (https://phabricator.wikimedia.org/T315601) (owner: 10Dwisehaupt) [20:25:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) Update: This happened again when imaging cp4038. I was unable to ping the interfaces but was able to connect to the mgmt interface/iDRAC.... [20:26:57] (03PS1) 10Jgreen: Switch fundraising database queue icinga reporting from frdb1004 to frdb1005. [puppet] - 10https://gerrit.wikimedia.org/r/884081 [20:29:02] (03CR) 10Jgreen: [C: 03+2] Switch fundraising database queue icinga reporting from frdb1004 to frdb1005. [puppet] - 10https://gerrit.wikimedia.org/r/884081 (owner: 10Jgreen) [20:36:22] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [20:36:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye [20:40:33] 10SRE, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [20:40:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:41:06] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) Re-running the cookbook and I watched it get past that screen with no delay {F36521655} [20:43:45] (03PS1) 10Andrew Bogott: valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085 [20:47:23] (03PS2) 10Andrew Bogott: valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085 (https://phabricator.wikimedia.org/T328079) [20:47:25] (03PS1) 10Andrew Bogott: Move clouddb2001-dev to spare [puppet] - 10https://gerrit.wikimedia.org/r/884086 (https://phabricator.wikimedia.org/T328079) [20:47:27] (03PS1) 10Andrew Bogott: Remove puppet refs to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/884087 (https://phabricator.wikimedia.org/T328079) [20:49:33] (03CR) 10Andrew Bogott: [C: 03+2] valid_section: update a comment reflect that labtestwiki has moved [puppet] - 10https://gerrit.wikimedia.org/r/884085 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott) [20:49:42] (03CR) 10Andrew Bogott: [C: 03+2] Move clouddb2001-dev to spare [puppet] - 10https://gerrit.wikimedia.org/r/884086 (https://phabricator.wikimedia.org/T328079) (owner: 10Andrew Bogott) [20:56:16] (03PS1) 10Bartosz Dziewoński: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) [20:56:40] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [21:00:04] brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230126T2100). Please do the needful. [21:00:04] Dreamy_Jazz, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] present o/ [21:01:01] hi [21:01:22] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [21:05:31] o/ I can deploy [21:06:18] Dreamy_Jazz: around for backports? [21:06:52] Sorry didn't hear the ping [21:06:54] (03PS1) 10Sbailey: Enable Linter write namespace, tag and template from core, group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) [21:06:55] I'm here [21:07:04] no problem [21:07:29] you're up first [21:07:44] Nice. Okay. I can test this one as I have checkuser on enwiki. [21:08:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:08:38] (03CR) 10Sbailey: "Preparing for monday backport window for linter write code enable on group 0 only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884090 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:09:10] (03Merged) 10jenkins-bot: Enable write new for CheckUserLog comment fields everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883952 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:09:25] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]] [21:09:30] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:11:08] !log thcipriani@deploy1002 thcipriani and dreamyjazz: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:11:22] ^ Dreamy_Jazz ok, should be on mwdebug, check please [21:11:44] Sure. Testing now. [21:12:23] Test complete - working as expected [21:13:53] (03Abandoned) 10BBlack: esitest: compat with haproxy >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/884052 (https://phabricator.wikimedia.org/T321775) (owner: 10BBlack) [21:13:55] (03PS2) 10Bartosz Dziewoński: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) [21:14:48] Dreamy_Jazz: great, thanks for checking, going live [21:19:53] (03CR) 10Thcipriani: [C: 03+2] ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) (owner: 10Bartosz Dziewoński) [21:20:21] (^ I'll get that one going while we're waiting) [21:20:44] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:883952|Enable write new for CheckUserLog comment fields everywhere (T233004)]] (duration: 11m 18s) [21:20:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:20:53] ^ Dreamy_Jazz should be live now [21:21:01] Thanks [21:21:16] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:21:22] Yes. It looks live to me. Thanks for the backport. [21:21:37] nice, yw :) [21:23:55] alright Jdlrobson you're up [21:24:06] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye [21:24:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**PASS**) - Removed from Puppet and Pu... [21:24:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson) [21:25:14] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [21:25:15] (03PS4) 10Thcipriani: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson) [21:25:20] (03Merged) 10jenkins-bot: ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing [extensions/DiscussionTools] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/884013 (https://phabricator.wikimedia.org/T327704) (owner: 10Bartosz Dziewoński) [21:25:43] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:25:48] Jdlrobson: bah, wait, did I just break the relation chain with that rebase? [21:25:56] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye [21:26:03] is the "increase threshold" supposed to go first? [21:26:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye [21:26:32] well. hold that thought. looks like MatmaRex 's just merged [21:27:12] thcipriani: looking [21:27:18] they can go in any order [21:27:21] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]] [21:27:25] T327704: DiscussionTools: unable to save comment on metawiki with comment-became-transcluded error - https://phabricator.wikimedia.org/T327704 [21:27:27] the first one is a NOOP [21:27:32] * MatmaRex waiting [21:27:32] just configuration leanup [21:27:48] ah, ok, thanks for checking. I never remember which way the ordering runs in the ancestor chain in gerrit :\ [21:28:19] I'll push them both together once I'm the discussiontools change is out [21:28:59] !log thcipriani@deploy1002 matmarex and thcipriani: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:29:18] ^ MatmaRex live on mwdebug, check please :) [21:29:48] thcipriani: yup, looks good! [21:30:10] cool, going live [21:33:22] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4046.ulsfo.wmnet with OS bullseye [21:33:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**) - Downtimed on Ic... [21:33:34] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye [21:33:43] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye [21:33:43] !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS bullseye [21:33:51] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**) - Removed from Pu... [21:34:54] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye [21:35:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye [21:35:03] !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4046.ulsfo.wmnet with OS bullseye [21:35:11] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye executed with errors: - cp4046 (**FAIL**) - Removed from Pu... [21:36:04] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:884013|ApiDiscussionToolsEdit: Unwrap Parsoid sections before parsing (T327704)]] (duration: 08m 43s) [21:36:08] T327704: DiscussionTools: unable to save comment on metawiki with comment-became-transcluded error - https://phabricator.wikimedia.org/T327704 [21:36:14] ^ MatmaRex should be live now [21:36:33] thanks thcipriani [21:36:34] (03CR) 10Thcipriani: [C: 03+2] Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson) [21:37:05] (03PS2) 10Thcipriani: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson) [21:37:28] sure thing :) [21:37:33] (03Merged) 10jenkins-bot: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) (owner: 10Jdlrobson) [21:37:51] (03CR) 10Thcipriani: [C: 03+2] Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson) [21:38:43] (03Merged) 10jenkins-bot: Increase threshold for table of contents collapsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884055 (https://phabricator.wikimedia.org/T328045) (owner: 10Jdlrobson) [21:39:03] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]] [21:39:09] T328045: Increase threshold for table of contents collapsing - https://phabricator.wikimedia.org/T328045 [21:39:09] T324859: frwiktionary search config does not properly set showDescription to false - https://phabricator.wikimedia.org/T324859 [21:40:42] !log thcipriani@deploy1002 thcipriani and jdlrobson: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:41:00] ^ Jdlrobson okie doke, both your patches should be on mwdebug, check please [21:41:03] checking [21:41:58] LGTM please sync! [21:42:06] * thcipriani does [21:47:53] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:884055|Increase threshold for table of contents collapsing (T328045)]], [[gerrit:879664|Remove redundant block for search descriptions (T324859)]] (duration: 08m 49s) [21:47:59] T328045: Increase threshold for table of contents collapsing - https://phabricator.wikimedia.org/T328045 [21:47:59] T324859: frwiktionary search config does not properly set showDescription to false - https://phabricator.wikimedia.org/T324859 [21:48:00] ^ Jdlrobson all done [21:58:06] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS bullseye [21:58:14] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye [22:02:44] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:06:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:09:26] thanks thcipriani [22:09:38] (sorry for the delay got distracted and forgot to press enter:)) [22:09:55] heh, no worries, yw ;) [22:16:30] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:18:57] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [22:20:08] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:22:03] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [22:23:31] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:23:57] !log running migrateRevisionCommentTemp.php in cebwiki in screen with --sleep 2 # T275246 [22:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:01] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [22:41:50] (03PS4) 10Dreamy Jazz: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) [22:42:48] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [22:43:23] (03CR) 10CI reject: [V: 04-1] Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [22:44:31] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS bullseye [22:44:41] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4046.ulsfo.wmnet with OS bullseye completed: - cp4046 (**PASS**) - Removed from Puppet and Pu... [22:44:44] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet [22:45:23] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:45:36] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye [22:45:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye [22:46:39] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:53:37] (03CR) 10Zabe: [C: 03+2] Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) (owner: 10Dreamy Jazz) [22:54:20] (03Merged) 10jenkins-bot: Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) (owner: 10Dreamy Jazz) [22:54:51] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]] [22:54:56] T324907: Create seperate tables for log events in CheckUser - https://phabricator.wikimedia.org/T324907 [22:56:31] !log zabe@deploy1002 dreamyjazz and zabe: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:56:38] Hey all - going to scap out PS.php (removed emergency spam mitigations) [22:58:05] sbassett, could you wait a sec, currently deploying [22:58:09] I can ping you [22:58:29] Yes, got the lock warn, thanks. [23:03:20] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:28] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881390|Pin CheckUserEventTablesMigrationStage to read and write old (T324907)]] (duration: 08m 36s) [23:03:33] T324907: Create seperate tables for log events in CheckUser - https://phabricator.wikimedia.org/T324907 [23:04:05] sbassett, done [23:04:40] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS bullseye [23:04:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye executed with errors: - cp4039 (**FAIL**) - Downtimed on Ic... [23:04:59] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye [23:05:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye [23:05:25] (03CR) 10Jdlrobson: "Jan: Should this be abandoned?" [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [23:05:31] (03CR) 10Jdlrobson: "Jan: Should this be abandoned?" [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [23:06:38] Tx, Zabe [23:07:04] (03PS3) 10Superpes15: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) [23:07:48] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [23:10:11] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:10:48] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) This is happening the first time I run the cookbooks on any of the newer servers. I've now adapted to the workflow of running the cookbook... [23:13:24] !log sbassett@deploy1002 Synchronized private/PrivateSettings.php: T326691 - remove mitigation and monitor (duration: 06m 52s) [23:19:24] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:52] (03PS4) 10Zabe: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [23:22:56] (03CR) 10Zabe: [C: 03+2] Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [23:23:36] (03Merged) 10jenkins-bot: Add a project logo on gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883724 (https://phabricator.wikimedia.org/T327987) (owner: 10Superpes15) [23:24:27] !log zabe@deploy1002 Started scap: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]] [23:24:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [23:24:32] T327987: Change project logo in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327987 [23:25:41] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [23:25:54] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:26:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:26:06] !log zabe@deploy1002 zabe and superpes: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [23:28:49] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [23:46:25] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:51:50] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4039.ulsfo.wmnet with OS bullseye [23:51:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp4039.ulsfo.wmnet with OS bullseye completed: - cp4039 (**PASS**) - Removed from Puppet and Pu... [23:52:20] !log brett@cumin1001 conftool action : set/pooled=yes; selector: name=cp4039.ulsfo.wmnet [23:53:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [23:54:30] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS bullseye [23:54:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp4047.ulsfo.wmnet with OS bullseye [23:59:10] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:883724|Add a project logo on gorwiktionary (T327987)]] (duration: 34m 42s) [23:59:14] T327987: Change project logo in Wiktionary Gorontalo - https://phabricator.wikimedia.org/T327987