[00:02:40] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:03:26] 10SRE, 10Traffic-Icebox: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10RobH) [00:03:39] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) 05In progress→03Open will resume tomorrow late evening for esams / afternoon for me. [00:07:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:09:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:07:10] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:02:48] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:48] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:58] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:01] (03PS1) 10KartikMistry: Enable Section Translation in cs, el, he, ko, sw and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791107 (https://phabricator.wikimedia.org/T304855) [03:42:31] (03PS1) 10Andrew Bogott: nova-fullstack: switch to using the secondary bastion for ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/791108 (https://phabricator.wikimedia.org/T305909) [03:43:09] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: switch to using the secondary bastion for ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/791108 (https://phabricator.wikimedia.org/T305909) (owner: 10Andrew Bogott) [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:47:17] * kart_ updating cxserver [03:47:56] (03PS2) 10Andrew Bogott: nova-fullstack: switch to using the secondary bastion for ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/791108 (https://phabricator.wikimedia.org/T305909) [03:47:58] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-11-135122-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/791052 (https://phabricator.wikimedia.org/T307967) (owner: 10KartikMistry) [03:49:56] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: switch to using the secondary bastion for ssh tests [puppet] - 10https://gerrit.wikimedia.org/r/791108 (https://phabricator.wikimedia.org/T305909) (owner: 10Andrew Bogott) [03:52:07] (03Merged) 10jenkins-bot: Update cxserver to 2022-05-11-135122-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/791052 (https://phabricator.wikimedia.org/T307967) (owner: 10KartikMistry) [03:56:23] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [03:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:01] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [03:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:07] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:54] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:28] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:19] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:00] !log Updated cxserver to 2022-05-11-135122-production (T307967, T306999, T298239, T304853, T307507, T308039) [04:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:13] T308039: cxserver CI fails with missing file - https://phabricator.wikimedia.org/T308039 [04:07:13] T307967: Ongoing translations refuse to load in the Content translation tool - https://phabricator.wikimedia.org/T307967 [04:07:13] T304853: Enable Content and Section Translation for Turkish Wikipedia - https://phabricator.wikimedia.org/T304853 [04:07:14] T306999: Content lost after translation at euwiki - https://phabricator.wikimedia.org/T306999 [04:07:14] T307507: Fully deprecate service-pipeline-test and service-pipeline-test-and-publish jobs - https://phabricator.wikimedia.org/T307507 [04:07:14] T298239: Enable Content and Section Translation for Korean Wikipedia - https://phabricator.wikimedia.org/T298239 [04:16:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:34] PROBLEM - MariaDB Replica SQL: s4 on db2140 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:28:02] PROBLEM - MariaDB Replica IO: s4 on db2140 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:28:08] PROBLEM - MariaDB read only s4 on db2140 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:28:38] PROBLEM - MariaDB disk space on db2140 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:28:44] PROBLEM - Check systemd state on db2140 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-debian-version-textfile.service,prometheus-mysqld-exporter.service,prometheus-nic-firmware-textfile.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:54] PROBLEM - mysqld processes on db2140 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:34:42] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [04:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:45:14] PROBLEM - MariaDB Replica Lag: s4 on db2140 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:48:52] PROBLEM - Disk space on db2140 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2140&var-datasource=codfw+prometheus/ops [05:03:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.822 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:03] 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Marostegui) [05:10:13] 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Marostegui) p:05Triage→03Medium [05:11:01] (03PS1) 10Marostegui: db2140: Broken host [puppet] - 10https://gerrit.wikimedia.org/r/791112 (https://phabricator.wikimedia.org/T308202) [05:11:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2140 T308202', diff saved to https://phabricator.wikimedia.org/P27791 and previous config saved to /var/cache/conftool/dbconfig/20220512-051106-marostegui.json [05:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:12] T308202: db2140 broken storage - https://phabricator.wikimedia.org/T308202 [05:13:57] (03CR) 10Marostegui: [C: 03+2] db2140: Broken host [puppet] - 10https://gerrit.wikimedia.org/r/791112 (https://phabricator.wikimedia.org/T308202) (owner: 10Marostegui) [05:24:59] ACKNOWLEDGEMENT - Check systemd state on db2140 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,mariadb.service,prometheus-debian-version-textfile.service,prometheus-mysqld-exporter.service,prometheus-nic-firmware-textfile.service,prometheus_intel_microcode.service,prometheus_puppet_agent_stats.service Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/Moni [05:24:59] heck_systemd_state [05:24:59] ACKNOWLEDGEMENT - Disk space on db2140 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2140&var-datasource=codfw+prometheus/ops [05:24:59] ACKNOWLEDGEMENT - MariaDB Replica IO: s4 on db2140 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:59] ACKNOWLEDGEMENT - MariaDB Replica Lag: s4 on db2140 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:59] ACKNOWLEDGEMENT - MariaDB Replica SQL: s4 on db2140 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:24:59] ACKNOWLEDGEMENT - MariaDB disk space on db2140 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:25:00] ACKNOWLEDGEMENT - MariaDB read only s4 on db2140 is CRITICAL: Could not connect to localhost:3306 Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:25:00] ACKNOWLEDGEMENT - mysqld processes on db2140 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui https://phabricator.wikimedia.org/T308202 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:34:24] (03PS1) 10Marostegui: db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791230 (https://phabricator.wikimedia.org/T308126) [05:34:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27792 and previous config saved to /var/cache/conftool/dbconfig/20220512-053444-marostegui.json [05:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:51] T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 [05:35:14] (03CR) 10Marostegui: [C: 03+2] db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791230 (https://phabricator.wikimedia.org/T308126) (owner: 10Marostegui) [05:41:17] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) I have migrated db2140 to 10.6 and it looks like the query, after the alter... [05:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2122 T307501', diff saved to https://phabricator.wikimedia.org/P27793 and previous config saved to /var/cache/conftool/dbconfig/20220512-054138-marostegui.json [05:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:44] T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 [05:41:49] (03PS1) 10Marostegui: Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/791246 [05:42:30] (03CR) 10Marostegui: [C: 03+2] Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/791246 (owner: 10Marostegui) [05:44:09] (03PS1) 10Marostegui: db2122: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/791231 (https://phabricator.wikimedia.org/T308126) [05:45:37] (03CR) 10Marostegui: [C: 03+2] db2122: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/791231 (https://phabricator.wikimedia.org/T308126) (owner: 10Marostegui) [05:59:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 T308126', diff saved to https://phabricator.wikimedia.org/P27794 and previous config saved to /var/cache/conftool/dbconfig/20220512-055918-marostegui.json [05:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:24] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T0600). [06:00:13] (03PS1) 10Marostegui: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791232 (https://phabricator.wikimedia.org/T308126) [06:01:24] (03CR) 10Marostegui: [C: 03+2] db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791232 (https://phabricator.wikimedia.org/T308126) (owner: 10Marostegui) [06:03:57] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) And of course, I alter db1127 before migrating to 10.6 and now the optimizer... [06:08:36] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) The migration to 10.6 at least, keeps the optimizer working fine. [06:10:11] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) Reverted the alter table on db1127 and db2140 to leave those hosts consisten... [06:11:27] (03PS1) 10Marostegui: db1127: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/791234 (https://phabricator.wikimedia.org/T308126) [06:12:10] (03CR) 10Marostegui: [C: 03+2] db1127: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/791234 (https://phabricator.wikimedia.org/T308126) (owner: 10Marostegui) [06:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1127 with low weight T308126', diff saved to https://phabricator.wikimedia.org/P27795 and previous config saved to /var/cache/conftool/dbconfig/20220512-061305-marostegui.json [06:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:12] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [06:14:29] (03PS1) 10Marostegui: db1127: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791291 (https://phabricator.wikimedia.org/T308126) [06:15:57] (03CR) 10Marostegui: [C: 03+2] db1127: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791291 (https://phabricator.wikimedia.org/T308126) (owner: 10Marostegui) [06:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27796 and previous config saved to /var/cache/conftool/dbconfig/20220512-062241-marostegui.json [06:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:47] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [06:32:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27797 and previous config saved to /var/cache/conftool/dbconfig/20220512-063217-marostegui.json [06:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:23] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:39:03] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:42:56] (03PS1) 10Elukey: Set celery 5 settings for ores1001 [puppet] - 10https://gerrit.wikimedia.org/r/791295 (https://phabricator.wikimedia.org/T303801) [06:43:37] (03CR) 10Elukey: [C: 03+2] Set celery 5 settings for ores1001 [puppet] - 10https://gerrit.wikimedia.org/r/791295 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [06:44:18] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores1001.eqiad.wmnet with OS buster [06:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, apergos, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] * kart_ is here [07:00:41] I can self-deploy.. [07:01:03] hello [07:01:07] no trainees today [07:01:38] go for it, kart_ [07:01:39] yeah. [07:01:48] you're the only patch today so it's all you [07:01:51] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation in cs, el, he, ko, sw and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791107 (https://phabricator.wikimedia.org/T304855) (owner: 10KartikMistry) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:41] (03Merged) 10jenkins-bot: Enable Section Translation in cs, el, he, ko, sw and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791107 (https://phabricator.wikimedia.org/T304855) (owner: 10KartikMistry) [07:03:32] zippy! [07:04:14] Testing on mwdebug1001 [07:05:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:05:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:21] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1001.eqiad.wmnet with reason: host reimage [07:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:50] Tests looks good! [07:06:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:07] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791107|Enable Section Translation in cs, el, he, ko, sw and tr WPs (T304855 T304854 T298239 T304863 T304853 T304828)]] (duration: 00m 51s) [07:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:21] T304854: Enable Content and Section Translation for Greek Wikipedia - https://phabricator.wikimedia.org/T304854 [07:08:21] T304855: Enable Content and Section Translation for Czech Wikipedia - https://phabricator.wikimedia.org/T304855 [07:08:21] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [07:08:22] T304863: Enable Content and Section Translation for Hebrew Wikipedia - https://phabricator.wikimedia.org/T304863 [07:08:22] T304853: Enable Content and Section Translation for Turkish Wikipedia - https://phabricator.wikimedia.org/T304853 [07:08:22] T298239: Enable Content and Section Translation for Korean Wikipedia - https://phabricator.wikimedia.org/T298239 [07:08:32] apergos: I'm done :) [07:08:42] zooooom that was fast [07:09:15] guess that's it for the window. I mean if someone else shows up who wants to self deploy, they can still do it [07:09:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1001.eqiad.wmnet with reason: host reimage [07:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:34] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:12:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:24] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye [07:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:28] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye [07:18:52] !log dbmaint s7@codfw T308206 [07:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:57] T308206: Slow query on echo_event table - https://phabricator.wikimedia.org/T308206 [07:22:13] (03PS3) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [07:22:48] (03CR) 10jerkins-bot: [V: 04-1] zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:26:36] RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:53] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) Quick question about how to proceed. Would it make sense to start testing adding manual labels in the ml-serve-eqiad clu... [07:29:12] !log dbmaint s3@eqiad T308206 [07:29:14] !log dbmaint s3@codfw T308206 [07:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:17] T308206: Slow query on echo_event table - https://phabricator.wikimedia.org/T308206 [07:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:58] (03PS1) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc [software] - 10https://gerrit.wikimedia.org/r/791297 [07:32:18] !log dbmaint s6@codfw T308206 [07:32:21] !log dbmaint s6@eqiad T308206 [07:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:55] !log dbmaint s7@codfw T308206 [07:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:56] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4001.ulsfo.wmnet with OS bullseye [07:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:59] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**) - Removed from... [07:36:38] (03PS4) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [07:37:00] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:43:28] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:45:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1001.eqiad.wmnet with OS buster [07:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) Without a firmware update of the system firmware and the NIC firmware, there was no network link in the Debian installer. Rob updated everything to the latest version, but it... [07:47:04] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:47:55] (03CR) 10Volans: [C: 03+1] "LGTM, verified that all settings are the default ones and no changes in prod." [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [07:48:32] (03PS1) 10Elukey: Set celery 5 settings for ores1002 [puppet] - 10https://gerrit.wikimedia.org/r/791299 (https://phabricator.wikimedia.org/T303801) [07:49:23] (03CR) 10Volans: [C: 03+2] "Thanks for the typo fix!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/789923 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [07:53:19] anybody experiencing some slowdown with gerrit? [07:53:49] 10SRE, 10SRE-Access-Requests, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10jnuche) @RLazarus awesome, thank you so much! I'll verify we can use the new identity once https://gerrit.wikimedia.org/r/c/operations/puppet/+/789146 has been merged. [07:53:51] elukey: it works fine for me at the moment [07:54:23] marostegui: thanks, than it may be my ISP [07:54:26] *then [07:54:47] (03PS13) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [07:54:49] (03PS1) 10Muehlenhoff: Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300 [07:56:48] (03Merged) 10jenkins-bot: Fix typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/789923 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [07:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27798 and previous config saved to /var/cache/conftool/dbconfig/20220512-075703-marostegui.json [07:57:06] (03CR) 10Volans: "post-merge nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [07:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:11] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [07:57:25] (03CR) 10Volans: "post-merge nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [07:59:22] (03PS7) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [08:00:19] (03CR) 10Volans: "It would be nice to migrate all the way to the same version of homer for simplicity of use and script maintenance, basically just having t" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/789571 (owner: 10Ayounsi) [08:07:30] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) [08:08:27] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @elukey yes I think that makes sense, no need to hold off on testing. Your suggested label naming makes sense so let's... [08:12:13] (03PS6) 10Jaime Nuche: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) [08:12:53] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) [08:18:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27799 and previous config saved to /var/cache/conftool/dbconfig/20220512-081814-marostegui.json [08:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:19] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [08:25:51] (03CR) 10Jaime Nuche: "https://phabricator.wikimedia.org/T307351 has been resolved" [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:25:54] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [08:27:16] (03PS1) 10Muehlenhoff: Enable Ganeti3 component on eqsin servers [puppet] - 10https://gerrit.wikimedia.org/r/791306 (https://phabricator.wikimedia.org/T308211) [08:31:56] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host idp-test2002.wikimedia.org [08:31:57] !log jmm@cumin1001 START - Cookbook sre.dns.netbox [08:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:33:48] 10SRE, 10Infrastructure-Foundations: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) [08:33:53] mvolz: Hi, would it interfere with your work if I deployed a mwext-Kartographer backport? [08:34:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:34:30] awight: nope, go right ahead! [08:34:39] ty! [08:36:22] (03PS1) 10Awight: Duplicate "latest revision may be special" logic from FlaggedRevs [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791248 (https://phabricator.wikimedia.org/T304813) [08:37:17] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:37:55] (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791248 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [08:38:09] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:38:11] (03PS1) 10Klausman: hiera: Use celery v5 on ores1003 [puppet] - 10https://gerrit.wikimedia.org/r/791308 (https://phabricator.wikimedia.org/T303801) [08:39:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:40:32] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores1003 [puppet] - 10https://gerrit.wikimedia.org/r/791308 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [08:40:56] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1003.eqiad.wmnet with OS buster [08:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:45:39] !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:33] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:50:17] (03PS2) 10Filippo Giunchedi: WIP move network routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [08:50:19] (03PS1) 10Filippo Giunchedi: wmflib: extend sites [puppet] - 10https://gerrit.wikimedia.org/r/791309 [08:50:21] (03PS1) 10Filippo Giunchedi: netops: add site/role to netops::check to cater for new data structure [puppet] - 10https://gerrit.wikimedia.org/r/791310 [08:50:46] (03CR) 10Klausman: [C: 03+2] Set celery 5 settings for ores1002 [puppet] - 10https://gerrit.wikimedia.org/r/791299 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [08:51:55] (03CR) 10jerkins-bot: [V: 04-1] Duplicate "latest revision may be special" logic from FlaggedRevs [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791248 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [08:52:01] (03PS2) 10Thiemo Kreuz (WMDE): Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [08:52:06] (03CR) 10Volans: "Should we have them split maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [08:52:27] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores1002.eqiad.wmnet with OS buster [08:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:34] (03CR) 10Muehlenhoff: [C: 03+2] scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:54:07] (03CR) 10Filippo Giunchedi: WIP move network routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:54:17] (03PS3) 10Thiemo Kreuz (WMDE): Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [08:55:18] mvolz: jfyi, I'm aborting my deployment because of an unrelated test failure. Nothing was changed in mediawiki-staging. [08:55:42] (03PS8) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [08:55:49] awight: the errors are known we ignored them on the past backports [08:56:04] it's an issue on the .10 branch [08:56:21] WMDE-Fisch: Ignored and forced-submit? I'm okay with waiting for the patch to go out next week... [08:56:34] awight: yes [08:56:39] see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/790406 [08:56:45] and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/790329 [08:57:21] ty, I think I will pass since this isn't an emergency <_< [08:57:55] +1 [08:58:29] (03PS4) 10Thiemo Kreuz (WMDE): Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [08:59:23] my window isn't for another hour anyway [08:59:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Drop unused FlaggedRevs threshold level names (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [08:59:33] (03CR) 10Jbond: [C: 03+1] Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:00:36] mvolz: O_O I see now! Well then, I'm happy to leave you with a blank slate :-) [09:00:47] (03CR) 10Filippo Giunchedi: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [09:03:10] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1003.eqiad.wmnet with reason: host reimage [09:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35198/console" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:05:12] (03PS1) 10Jbond: requestctl_checkip: Addressing post-merge optimisation comments [puppet] - 10https://gerrit.wikimedia.org/r/791313 [09:05:23] (03CR) 10Jbond: [C: 03+2] P:conftool::requestctl_client: add simple script to check for block ips (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [09:06:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1003.eqiad.wmnet with reason: host reimage [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:06] (03CR) 10jerkins-bot: [V: 04-1] requestctl_checkip: Addressing post-merge optimisation comments [puppet] - 10https://gerrit.wikimedia.org/r/791313 (owner: 10Jbond) [09:07:37] (03CR) 10Muehlenhoff: [C: 03+2] scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:08:37] (03CR) 10Vgutierrez: [C: 03+2] "merging to get some initial data, we can adjust the thresholds later" [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [09:09:16] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch) [09:09:24] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [09:09:36] (03Abandoned) 10Elukey: celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 (owner: 10Elukey) [09:10:51] (03Abandoned) 10Elukey: WIP - ores::base: add conditionals for buster [puppet] - 10https://gerrit.wikimedia.org/r/771947 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:14:17] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1002.eqiad.wmnet with reason: host reimage [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:31] (03CR) 10Filippo Giunchedi: [V: 03+1] WIP move network routers definitions to hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:14:44] (03CR) 10Jbond: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [09:15:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:15:47] (03PS7) 10Muehlenhoff: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:16:12] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Addshore) Just trying to figure out the status of this feature from the comments. Would anyone in the know be able to write a summary? [09:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27800 and previous config saved to /var/cache/conftool/dbconfig/20220512-091706-marostegui.json [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [09:17:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1002.eqiad.wmnet with reason: host reimage [09:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:19] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Deploy template search improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [09:18:59] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:19:16] (03CR) 10Muehlenhoff: [C: 03+2] scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [09:19:51] (03CR) 10Thiemo Kreuz (WMDE): Deploy VE template dialog improvements to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch) [09:22:39] (03CR) 10Volans: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [09:23:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Deploy template search improvements to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [09:27:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:28:07] (03CR) 10Jbond: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [09:32:19] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/791313 (owner: 10Jbond) [09:32:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:15] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:30] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1003.eqiad.wmnet with OS buster [09:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:12] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [09:38:42] (03PS2) 10Jbond: requestctl_checkip: Addressing post-merge optimisation comments [puppet] - 10https://gerrit.wikimedia.org/r/791313 [09:40:13] (03PS1) 10Slyngshede: Switch out more OSM cronjobs from systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791318 (https://phabricator.wikimedia.org/T273673) [09:40:48] (03CR) 10Awight: [C: 03+1] Deploy VE template dialog improvements to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch) [09:41:23] 10SRE, 10SRE-Access-Requests, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10jnuche) @RLazarus the new user `scap` exists now on the required hosts, but unfortunately access through Keyholder is not working: ` jnuche@deploy1002:~$ export SSH_AUTH_SOCK=/run... [09:41:48] (03CR) 10Awight: [C: 03+1] Deploy template search improvements to enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [09:41:54] (03PS1) 10Btullis: Double the number of eventgate_analytics_external replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) [09:42:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:37] (03PS1) 10Gergő Tisza: Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) [09:46:19] (03PS1) 10Aklapper: Redirect dev.wikimedia.org to developer.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) [09:46:31] (03CR) 10Volans: [C: 03+1] "Seems reasonable to me" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi) [09:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27802 and previous config saved to /var/cache/conftool/dbconfig/20220512-094642-marostegui.json [09:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:50] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [09:46:50] (03CR) 10Aklapper: [C: 04-1] "DO NOT MERGE YET; depends on resolving T261510" [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) (owner: 10Aklapper) [09:48:23] (03PS7) 10Hashar: docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [09:48:57] (03PS3) 10Hashar: ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 [09:49:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1002.eqiad.wmnet with OS buster [09:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:29] (03PS2) 10Slyngshede: Switch out more OSM cronjobs from systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791318 (https://phabricator.wikimedia.org/T273673) [09:50:31] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) [09:50:39] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) (owner: 10Hashar) [09:51:20] (03CR) 10Kosta Harlan: "Deployment should be synchronized with Ia88088da1b002c9d141e274653e35d594b786e95" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [09:52:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:59] (03PS1) 10Joal: Add profile::hadoop:spark3 class and resources [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) [09:55:31] (03CR) 10Joal: Add profile::hadoop:spark3 class and resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [09:58:19] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:59:46] (03CR) 10Hashar: "PPC: https://puppet-compiler.wmflabs.org/pcc-worker1001/1329/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) (owner: 10Hashar) [10:00:04] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T1000). [10:02:38] ACKNOWLEDGEMENT - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed Btullis Working on this: T307102 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:02:38] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed Btullis Working on this: T307102 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:02:38] ACKNOWLEDGEMENT - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check did not succeed Btullis Working on this: T307102 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:02:38] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed Btullis Working on this: T307102 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:03:30] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [10:04:25] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: migrate container stats cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:05:51] PROBLEM - puppet last run on db2140 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:08:15] (03PS7) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [10:09:34] (03CR) 10jerkins-bot: [V: 04-1] Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [10:11:10] !log installing Apache 2.4.53 updates on bullseye [10:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:17] (03PS1) 10Hnowlan: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) [10:15:13] (03CR) 10jerkins-bot: [V: 04-1] Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:18:58] (03PS5) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [10:19:00] (03PS1) 10Hashar: git: add define for abritrarily named config file [puppet] - 10https://gerrit.wikimedia.org/r/791327 (https://phabricator.wikimedia.org/T307620) [10:19:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye [10:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:52] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye [10:19:53] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1020.eqiad.wmnet with OS bullseye [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:57] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye executed with errors: - aqs1020... [10:24:50] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [10:26:10] (03CR) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:28:32] (03PS1) 10Volans: cumin: use homer ssh config for lsw devices [puppet] - 10https://gerrit.wikimedia.org/r/791328 [10:30:18] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Volans) We looked at the logs with John and Papaul during our last meeting and agreed that it took a long time for mdadm+mkfs to create the software rai... [10:33:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27803 and previous config saved to /var/cache/conftool/dbconfig/20220512-103333-marostegui.json [10:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:45:25] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye [10:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:30] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye [10:46:31] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test2002.wikimedia.org [10:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:56] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host idp-test1002.wikimedia.org [10:49:57] !log jmm@cumin1001 START - Cookbook sre.dns.netbox [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:25] (03PS1) 10Jbond: P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 [10:54:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27804 and previous config saved to /var/cache/conftool/dbconfig/20220512-105432-marostegui.json [10:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:38] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [10:55:00] (03CR) 10jerkins-bot: [V: 04-1] P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 (owner: 10Jbond) [10:55:36] !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:14] (03PS2) 10Jbond: P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 [10:58:10] (03CR) 10jerkins-bot: [V: 04-1] P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 (owner: 10Jbond) [10:58:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [10:59:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35202/console" [puppet] - 10https://gerrit.wikimedia.org/r/791332 (owner: 10Jbond) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:05:44] (03PS3) 10Jbond: P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 [11:07:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35203/console" [puppet] - 10https://gerrit.wikimedia.org/r/791332 (owner: 10Jbond) [11:11:30] (03PS1) 10Volans: remote: increase reboot wait time [software/spicerack] - 10https://gerrit.wikimedia.org/r/791335 (https://phabricator.wikimedia.org/T307260) [11:11:32] (03PS1) 10Volans: ganeti: add startup method [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 [11:12:14] (03CR) 10Volans: "This will be used by the "reimage" cookbook for VMs." [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 (owner: 10Volans) [11:13:41] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) [11:14:35] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idp-test1002.wikimedia.org [11:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:42] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) I updated the firmware of other backup affected hosts: backup2002, backup1001, backup2001 to 21.80.16.95. They all seem to work as expected and was able to upgrade them... [11:14:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base: add documentation and clean up [puppet] - 10https://gerrit.wikimedia.org/r/791332 (owner: 10Jbond) [11:17:16] (03PS1) 10Klausman: hiera: Use celery v5 on ores1005 [puppet] - 10https://gerrit.wikimedia.org/r/791341 (https://phabricator.wikimedia.org/T303801) [11:17:59] (03PS1) 10Muehlenhoff: Add idp-test1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/791342 (https://phabricator.wikimedia.org/T308214) [11:17:59] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1005.eqiad.wmnet with OS buster [11:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:14] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores1005 [puppet] - 10https://gerrit.wikimedia.org/r/791341 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [11:21:20] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1020.eqiad.wmnet with OS bullseye [11:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye executed with errors: - aqs1020... [11:23:07] (03PS2) 10Zabe: swift: remove absented container stats cron [puppet] - 10https://gerrit.wikimedia.org/r/790762 (https://phabricator.wikimedia.org/T273673) [11:23:19] (03CR) 10Sergio Gimeno: Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [11:23:23] (03PS5) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [11:24:56] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 57.58 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [11:25:53] that was just a spike that happened 30 minutes ago [11:26:27] (03CR) 10Tacsipacsi: [C: 03+1] Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [11:28:03] (03PS6) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [11:28:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 76.18 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [11:28:33] (03PS1) 10Jbond: P:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [11:28:43] (03CR) 10Btullis: [C: 03+1] "Looks OK to me." [puppet] - 10https://gerrit.wikimedia.org/r/786382 (owner: 10BryanDavis) [11:28:55] (03CR) 10jerkins-bot: [V: 04-1] P:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [11:29:22] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:07] (03PS2) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [11:30:27] (03CR) 10jerkins-bot: [V: 04-1] C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [11:31:24] (03PS1) 10Jbond: P:conftool: fix location of requestcrl_client script [puppet] - 10https://gerrit.wikimedia.org/r/791347 [11:34:43] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [11:36:50] (03CR) 10Muehlenhoff: [C: 03+2] Add idp-test1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/791342 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:37:43] (03CR) 10Jbond: [C: 03+2] P:conftool: fix location of requestcrl_client script [puppet] - 10https://gerrit.wikimedia.org/r/791347 (owner: 10Jbond) [11:38:14] moritzm: ok to merge your change [11:38:45] jbond: yes, please [11:38:54] done [11:38:57] thx [11:39:14] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: remove absented container stats cron [puppet] - 10https://gerrit.wikimedia.org/r/790762 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:39:27] (03PS1) 10Ladsgroup: ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791252 (https://phabricator.wikimedia.org/T308207) [11:39:35] np [11:40:01] (03PS2) 10Ladsgroup: ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791252 (https://phabricator.wikimedia.org/T308207) [11:40:02] zabe: thank you again for your help re: cron -> systemd timers migration [11:40:19] jouncebot: nowandnext [11:40:19] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [11:40:19] In 1 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T1300) [11:40:27] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1005.eqiad.wmnet with reason: host reimage [11:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:32] (03CR) 10Ladsgroup: [C: 03+2] ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791252 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [11:40:48] (03PS1) 10Ladsgroup: ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791253 (https://phabricator.wikimedia.org/T308207) [11:41:02] (03CR) 10Ladsgroup: [C: 03+2] ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791253 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [11:42:15] (03PS1) 10David Caro: wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 [11:43:53] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1005.eqiad.wmnet with reason: host reimage [11:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs-k8s-node-upgrade: add some extra logs [puppet] - 10https://gerrit.wikimedia.org/r/791348 (owner: 10David Caro) [11:44:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:09] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [11:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:47:26] (03PS1) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [11:48:20] (03CR) 10jerkins-bot: [V: 04-1] Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:50:18] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [11:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:40] (03PS1) 10Muehlenhoff: Validate the Ganeti node has been added to Hiera (and thus Ferm) [cookbooks] - 10https://gerrit.wikimedia.org/r/791350 [11:51:34] (03CR) 10Jbond: Fix permissions/ownership of helm directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [11:51:44] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2002.codfw.wmnet [11:51:46] jayme: see comment ^^ [11:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:09] let me knof if you need more infp [11:52:19] (03PS2) 10Slyngshede: Move more OSM cronjobs to systemd timers. Note: I'm not entirely sure that these timers are actively used. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [11:52:42] (03PS3) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [11:53:35] (03CR) 10jerkins-bot: [V: 04-1] Move more OSM cronjobs to systemd timers. Note: I'm not entirely sure that these timers are actively used. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:53:41] (03CR) 10jerkins-bot: [V: 04-1] C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [11:54:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27805 and previous config saved to /var/cache/conftool/dbconfig/20220512-115445-marostegui.json [11:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:51] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [11:57:34] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2002.codfw.wmnet [11:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:31] (03PS1) 10Jbond: P:cumin::unprivmaster: Add owneres variable to unpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/791351 [11:58:47] (03CR) 10jerkins-bot: [V: 04-1] Validate the Ganeti node has been added to Hiera (and thus Ferm) [cookbooks] - 10https://gerrit.wikimedia.org/r/791350 (owner: 10Muehlenhoff) [11:59:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35207/console" [puppet] - 10https://gerrit.wikimedia.org/r/791351 (owner: 10Jbond) [12:00:08] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2003.codfw.wmnet [12:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:52] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2003.codfw.wmnet [12:04:53] (03PS3) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [12:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:42] (03CR) 10jerkins-bot: [V: 04-1] Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:11:50] (03PS1) 10Elukey: Set celery 5 settings for ores1004 [puppet] - 10https://gerrit.wikimedia.org/r/791353 (https://phabricator.wikimedia.org/T303801) [12:12:34] (03CR) 10Elukey: [C: 03+2] Set celery 5 settings for ores1004 [puppet] - 10https://gerrit.wikimedia.org/r/791353 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [12:12:49] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [12:12:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores1004.eqiad.wmnet with OS buster [12:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:57] (03PS4) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [12:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:01] (03CR) 10Ayounsi: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [12:14:07] (03PS2) 10Muehlenhoff: Validate the Ganeti node has been added to Hiera (and thus Ferm) [cookbooks] - 10https://gerrit.wikimedia.org/r/791350 [12:14:53] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1005.eqiad.wmnet with OS buster [12:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35208/console" [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:17:57] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [12:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:49] (03PS1) 10Klausman: hiera: Use celery v5 on ores1007 [puppet] - 10https://gerrit.wikimedia.org/r/791355 (https://phabricator.wikimedia.org/T303801) [12:20:08] (03Merged) 10jenkins-bot: ApiQueryInfo: Force PRIMARY index on templatelinks [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/791253 (https://phabricator.wikimedia.org/T308207) (owner: 10Ladsgroup) [12:20:41] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1007.eqiad.wmnet with OS buster [12:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:12] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores1007 [puppet] - 10https://gerrit.wikimedia.org/r/791355 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [12:21:33] (03CR) 10Slyngshede: [V: 03+1] "It looks like the populate_admin scripts are disabled via hieradata, and may not even be run." [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:22:48] (03Abandoned) 10Slyngshede: Switch out more OSM cronjobs from systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791318 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:23:00] (03CR) 10Muehlenhoff: [C: 03+2] Enable Ganeti3 component on eqsin servers [puppet] - 10https://gerrit.wikimedia.org/r/791306 (https://phabricator.wikimedia.org/T308211) (owner: 10Muehlenhoff) [12:24:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [12:24:28] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [12:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:48] (03PS1) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) [12:26:08] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [12:26:16] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:26:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1127 to test 10.6 T308126', diff saved to https://phabricator.wikimedia.org/P27806 and previous config saved to /var/cache/conftool/dbconfig/20220512-122707-marostegui.json [12:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:12] T308126: MIgrate a s7 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T308126 [12:27:16] sorry for the force merge [12:27:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:27:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:24] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:25] (03Abandoned) 10Jelto: icinga: increase retries and delay for icinga status check [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [12:30:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/791350 (owner: 10Muehlenhoff) [12:30:52] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/api/ApiQueryInfo.php: Backport: [[gerrit:791252|ApiQueryInfo: Force PRIMARY index on templatelinks (T308207)]] (duration: 00m 50s) [12:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:56] T308207: ApiQueryInfo::getProtectionInfo is slow on normalized templatelinks - https://phabricator.wikimedia.org/T308207 [12:31:15] (03PS2) 10Volans: P:cumin::unprivmaster: Add owners variable to unpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/791351 (owner: 10Jbond) [12:31:36] (03CR) 10Volans: [C: 03+1] "LGTM, I just fixed a typo in commit message" [puppet] - 10https://gerrit.wikimedia.org/r/791351 (owner: 10Jbond) [12:33:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:34:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:20] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:37:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1004.eqiad.wmnet with reason: host reimage [12:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet [12:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1004.eqiad.wmnet with reason: host reimage [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:22] (03CR) 10Kosta Harlan: [C: 04-2] "waiting on new image build" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [12:42:25] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1007.eqiad.wmnet with reason: host reimage [12:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:43] (03PS2) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) [12:43:10] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet [12:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:43] (03Abandoned) 10Gergő Tisza: Temporarily disable link recommendation backend on hi, uk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791085 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [12:43:54] (03Abandoned) 10Gergő Tisza: Revert "Temporarily disable link recommendation backend on hi, uk" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791089 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [12:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:44:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 for optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27807 and previous config saved to /var/cache/conftool/dbconfig/20220512-124406-marostegui.json [12:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:33] (03PS2) 10Gergő Tisza: Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) [12:45:53] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1007.eqiad.wmnet with reason: host reimage [12:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:35] (03PS3) 10Gergő Tisza: Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) [12:48:14] (03PS1) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847) [12:48:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:35] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:52:23] (03CR) 10JMeybohm: [C: 03+2] Fix permissions/ownership of helm directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [12:53:47] (03CR) 10Cathal Mooney: [C: 03+1] "Ah good stuff, yeah I'd done this locally should have thought to add it here. We should probably also add 'ssw' (spine switch) to the lis" [puppet] - 10https://gerrit.wikimedia.org/r/791328 (owner: 10Volans) [12:56:36] (03PS2) 10Volans: cumin: use homer ssh config for lsw devices [puppet] - 10https://gerrit.wikimedia.org/r/791328 [12:57:26] (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/791328 (owner: 10Volans) [12:58:52] (03CR) 10Muehlenhoff: [C: 03+2] Validate the Ganeti node has been added to Hiera (and thus Ferm) [cookbooks] - 10https://gerrit.wikimedia.org/r/791350 (owner: 10Muehlenhoff) [12:58:58] (03CR) 10Jbond: [C: 03+2] P:cumin::unprivmaster: Add owners variable to unpriv profile [puppet] - 10https://gerrit.wikimedia.org/r/791351 (owner: 10Jbond) [12:59:14] (03CR) 10Jbond: [C: 03+2] P:cumin::unprivmaster: Add owners variable to unpriv profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791351 (owner: 10Jbond) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T1300). [13:00:04] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:00:46] I'll self-deploy, eventually. Need to update a service dependency first. [13:04:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Zabe) [13:05:14] 10SRE-swift-storage: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) 05Open→03Resolved [13:07:08] alright [13:08:02] (03PS4) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [13:08:23] (03PS5) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [13:08:41] (03PS6) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [13:09:16] (03CR) 10jerkins-bot: [V: 04-1] C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:10:00] (03PS7) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [13:10:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35209/console" [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:10:33] (03CR) 10jerkins-bot: [V: 04-1] C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:10:42] (03PS3) 10Gergő Tisza: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [13:11:02] (03CR) 10Gergő Tisza: "updated with new image tag." [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [13:11:35] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:12:00] (03PS1) 10Giuseppe Lavagetto: requestctl: add validate command [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) [13:12:01] (03PS1) 10Giuseppe Lavagetto: requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 [13:12:04] (03PS1) 10Giuseppe Lavagetto: New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 [13:12:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1004.eqiad.wmnet with OS buster [13:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: add AND NOT and OR NOT to the parsing grammar (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [13:12:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) (owner: 10Giuseppe Lavagetto) [13:13:14] (03PS8) 10Jbond: C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) [13:13:26] (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [13:13:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto) [13:14:07] (03CR) 10jerkins-bot: [V: 04-1] requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 (owner: 10Giuseppe Lavagetto) [13:14:09] (03CR) 10jerkins-bot: [V: 04-1] requestctl: add validate command [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) (owner: 10Giuseppe Lavagetto) [13:14:11] (03CR) 10jerkins-bot: [V: 04-1] New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 (owner: 10Giuseppe Lavagetto) [13:14:15] (03CR) 10jerkins-bot: [V: 04-1] Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [13:14:57] (03Merged) 10jenkins-bot: requestctl: add AND NOT and OR NOT to the parsing grammar [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [13:15:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:29] (03PS1) 10Btullis: Create new sudo rules to facilitate monitoring airflow [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) [13:15:42] (03Merged) 10jenkins-bot: requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) (owner: 10Giuseppe Lavagetto) [13:15:56] (03CR) 10Btullis: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:16:24] (03Merged) 10jenkins-bot: requestctl: add retry-after request header when applicable [software/conftool] - 10https://gerrit.wikimedia.org/r/791006 (https://phabricator.wikimedia.org/T305824) (owner: 10Giuseppe Lavagetto) [13:17:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1007.eqiad.wmnet with OS buster [13:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:46] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791322 (https://phabricator.wikimedia.org/T308186) (owner: 10Kosta Harlan) [13:19:31] !log tgr@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:19] !log tgr@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:57] (03PS1) 10Slyngshede: Move rabbitmq to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) [13:22:51] (03CR) 10Jbond: [C: 03+2] C:ssh::server: migrate ssh_server_config to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/791346 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:22:55] (03PS1) 10Majavah: kubeadm: Disable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/791368 [13:23:00] (03PS1) 10Jbond: C:ssh::client: Add profile::ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791369 (https://phabricator.wikimedia.org/T307565) [13:23:57] !log tgr@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35210/console" [puppet] - 10https://gerrit.wikimedia.org/r/791369 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:24:30] (03CR) 10Aqu: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [13:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27808 and previous config saved to /var/cache/conftool/dbconfig/20220512-132434-root.json [13:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:42] (03PS2) 10Jbond: C:ssh::client: Add profile::ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791369 (https://phabricator.wikimedia.org/T307565) [13:26:20] !log tgr@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [13:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:40] (03PS2) 10Slyngshede: Move rabbitmq to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) [13:26:45] (03PS1) 10Giuseppe Lavagetto: varnish: annotate X-Analytics header with matching requestctl actions [puppet] - 10https://gerrit.wikimedia.org/r/791372 (https://phabricator.wikimedia.org/T305582) [13:26:47] (03PS1) 10Giuseppe Lavagetto: varnish: set retry-after based on throttle duration in requestctl [puppet] - 10https://gerrit.wikimedia.org/r/791373 (https://phabricator.wikimedia.org/T305824) [13:27:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35212/console" [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:28:26] !log tgr@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:23] (03PS1) 10Klausman: hiera: Use celery v5 on ores1009 [puppet] - 10https://gerrit.wikimedia.org/r/791374 (https://phabricator.wikimedia.org/T303801) [13:29:59] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1009.eqiad.wmnet with OS buster [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:15] !log tgr@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [13:30:21] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores1009 [puppet] - 10https://gerrit.wikimedia.org/r/791374 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [13:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:33] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] Send sections_to_exclude in the POST body [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791251 (https://phabricator.wikimedia.org/T308186) (owner: 10Gergő Tisza) [13:34:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1164.eqiad.wmnet with OS bullseye [13:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:43] (03CR) 10Btullis: "Adding SREs from o11y for a review of the nagios/nrpe/sudo changes." [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:35:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:54] (03CR) 10Jbond: [C: 03+2] C:ssh::client: Add profile::ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791369 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [13:36:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:36:26] (03CR) 10David Caro: [C: 03+2] kubeadm: Disable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/791368 (owner: 10Majavah) [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:07] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/AddLink/ServiceLinkRecommendationProvider.php: Backport: [[gerrit:791251|Send sections_to_exclude in the POST body (T308186)]] (duration: 00m 49s) [13:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:11] T308186: Support long section exclusion lists for link recommendations - https://phabricator.wikimedia.org/T308186 [13:38:05] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35213/console" [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [13:38:23] !log EU mid-day deploys done [13:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27809 and previous config saved to /var/cache/conftool/dbconfig/20220512-133938-root.json [13:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:11] (03PS1) 10Slyngshede: Move the query service autodeploy to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/791376 (https://phabricator.wikimedia.org/T273673) [13:40:32] (03CR) 10Ladsgroup: Enable "upload_by_url" feature on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [13:40:48] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791377 (https://phabricator.wikimedia.org/T303171) [13:41:05] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Marostegui) Thanks Chris [13:41:25] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/791377 (https://phabricator.wikimedia.org/T303171) (owner: 10Marostegui) [13:42:26] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:22] (03PS2) 10Slyngshede: Unused manifest and script deleted as part of cronjob cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/791376 (https://phabricator.wikimedia.org/T273673) [13:45:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage [13:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] !log installing ffmpeg security updates [13:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:52:01] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1009.eqiad.wmnet with reason: host reimage [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27811 and previous config saved to /var/cache/conftool/dbconfig/20220512-135442-root.json [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1009.eqiad.wmnet with reason: host reimage [13:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1141 depooling: Maint', diff saved to https://phabricator.wikimedia.org/P27812 and previous config saved to /var/cache/conftool/dbconfig/20220512-135848-root.json [13:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:31] (03CR) 10Krinkle: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [14:05:59] 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Marostegui) [14:06:10] 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Marostegui) p:05Triage→03Medium [14:06:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1164.eqiad.wmnet with OS bullseye [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27813 and previous config saved to /var/cache/conftool/dbconfig/20220512-140946-root.json [14:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:56] (03CR) 10Ottomata: "ty!" [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [14:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P27814 and previous config saved to /var/cache/conftool/dbconfig/20220512-141042-root.json [14:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:24:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27815 and previous config saved to /var/cache/conftool/dbconfig/20220512-142450-root.json [14:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:24] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1009.eqiad.wmnet with OS buster [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P27816 and previous config saved to /var/cache/conftool/dbconfig/20220512-142546-root.json [14:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:53] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:30:39] (03CR) 10Ottomata: Add profile::hadoop:spark3 class and resources (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [14:31:18] (03CR) 10Ottomata: Add profile::hadoop:spark3 class and resources (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [14:31:49] (03CR) 10Volans: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [14:33:10] !log razzi@cumin1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:35] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Cmjohnson) @thanks marostegui, the power cord is probably not seated correctly [14:36:02] (03PS1) 10Urbanecm: Initial configuration for T305279 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791385 (https://phabricator.wikimedia.org/T305279) [14:37:11] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:50] (03PS2) 10Urbanecm: Initial configuration for kcgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791385 (https://phabricator.wikimedia.org/T305279) [14:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After optimizing recentchanges', diff saved to https://phabricator.wikimedia.org/P27817 and previous config saved to /var/cache/conftool/dbconfig/20220512-143954-root.json [14:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P27818 and previous config saved to /var/cache/conftool/dbconfig/20220512-144050-root.json [14:40:53] (03PS1) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [14:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:09] PROBLEM - Host cloudvirt2003-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:41:45] (03PS1) 10Klausman: hiera: Use celery v5 on ores1008 [puppet] - 10https://gerrit.wikimedia.org/r/791388 (https://phabricator.wikimedia.org/T303801) [14:42:46] (03CR) 10jerkins-bot: [V: 04-1] P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [14:42:53] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores1008 [puppet] - 10https://gerrit.wikimedia.org/r/791388 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [14:43:10] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1008.eqiad.wmnet with OS buster [14:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:23] (03PS2) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [14:43:54] (03PS1) 10Muehlenhoff: Add an alias to target VMs [puppet] - 10https://gerrit.wikimedia.org/r/791391 [14:44:12] !log razzi@cumin1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:21] (03CR) 10jerkins-bot: [V: 04-1] P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [14:45:29] !log installing gnupg2 updates from Bullseye point release [14:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:40] (03PS3) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [14:46:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35216/console" [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [14:47:09] RECOVERY - Host cloudvirt2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:47:54] !log razzi@cumin1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:04] !log razzi@cumin1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:11] !log razzi@cumin1001 conftool action : set/pooled=inactive; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:14] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [14:52:58] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [14:53:03] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:53:58] (03CR) 10David Caro: [C: 03+1] "LGTM, just to confirm, the output is pulled from syslog by rsyslog that puts in in the logfile, right?" [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:54:26] (03CR) 10David Caro: [C: 03+2] "Probably you know, but this needs editing manually the /etc/kubernetes/manifests/* right?" [puppet] - 10https://gerrit.wikimedia.org/r/791368 (owner: 10Majavah) [14:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P27819 and previous config saved to /var/cache/conftool/dbconfig/20220512-145554-root.json [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:52] (03CR) 10Volans: [C: 03+1] "LGTM, alternative query inline" [puppet] - 10https://gerrit.wikimedia.org/r/791391 (owner: 10Muehlenhoff) [14:58:10] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: grid: add a cookbook to reboot a grid queue (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/791030 (owner: 10Majavah) [14:58:44] (03CR) 10Brennen Bearnes: GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:37] (03CR) 10Ahmon Dancy: [V: 03+1 C: 03+1] GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [15:05:07] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1008.eqiad.wmnet with reason: host reimage [15:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] !log razzi@cumin1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [15:06:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host labstore1004.eqiad.wmnet [15:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] !log razzi@cumin1001 conftool action : set/pooled=inactive; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [15:08:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1008.eqiad.wmnet with reason: host reimage [15:14:51] !log razzi@deploy1002 Started deploy [analytics/superset/deploy@09094de]: Deploy superset 1.4.2 to production [15:15:23] !log razzi@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: Deploy superset 1.4.2 to production (duration: 00m 32s) [15:18:04] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host labstore1004.eqiad.wmnet [15:18:27] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: NRPE: Command check_nfs-exportd-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:18:31] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:29:20] RECOVERY - Disk space on db2140 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2140&var-datasource=codfw+prometheus/ops [15:29:56] RECOVERY - Check systemd state on db2140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:08] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:30:36] RECOVERY - puppet last run on db2140 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:32:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:38:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1008.eqiad.wmnet with OS buster [15:39:15] (03PS4) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [15:40:29] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:40:45] (03CR) 10Vgutierrez: [C: 03+1] "looking good, LGTM after refactoring the $::memorysize_mb to $facts['memory']['memorysize_mb']" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [15:40:53] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) 05Open→03Resolved This is complete. @Andrew thanks for all you help [15:41:51] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Patch-For-Review: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) 05Stalled→03Open [15:42:15] (03PS5) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [15:42:23] (03PS1) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:42:27] (03CR) 10jerkins-bot: [V: 04-1] C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:42:31] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) 05Open→03Resolved I forgot there's a way to upgrade a virtual machine's operating system: https://wikitech.wikimedia.org/wiki/Ganeti#Reinst... [15:42:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] Double the number of eventgate_analytics_external replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [15:43:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host labstore1005.eqiad.wmnet [15:43:17] (03PS2) 10Btullis: Create new sudo rules to facilitate monitoring airflow [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) [15:43:25] (03CR) 10Btullis: Create new sudo rules to facilitate monitoring airflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [15:43:32] RECOVERY - MariaDB disk space on db2140 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:43:37] (03PS2) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:43:41] 10SRE, 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Papaul) a:05Papaul→03Marostegui Upgrading the BIOS seems to have fixed the ssh issue. @Marostegui i was getting the error below when rebooting the server ` systemd-journal [301]: failed to write entry (22 item... [15:43:55] (03PS2) 10Hashar: gerrit: replicate to codfw with 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) [15:44:03] (03CR) 10Hashar: "Rebased to clear out a the merge conflict state in Gerrit due to a change made to hieradata/role/common/gerrit.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) (owner: 10Hashar) [15:44:10] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:11] (03CR) 10jerkins-bot: [V: 04-1] C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:44:15] (03PS1) 10Razzi: an-tool1005: set operating system image to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/791397 (https://phabricator.wikimedia.org/T301990) [15:44:19] (03PS3) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:44:23] (03PS2) 10Razzi: an-tool1005: set operating system image to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/791397 (https://phabricator.wikimedia.org/T301990) [15:44:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 (owner: 10Volans) [15:44:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/791335 (https://phabricator.wikimedia.org/T307260) (owner: 10Volans) [15:44:43] (03CR) 10jerkins-bot: [V: 04-1] C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:45:23] (03PS22) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:45:27] (03CR) 10Jbond: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:45:31] (03PS4) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:45:35] (03CR) 10Razzi: [C: 03+1] ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [15:45:39] (03CR) 10Razzi: [C: 03+2] ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [15:45:47] (03CR) 10Razzi: [C: 03+2] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [15:46:03] (03PS4) 10Hashar: ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 [15:46:37] (03PS15) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:47:00] (03CR) 10Volans: [C: 03+2] remote: increase reboot wait time [software/spicerack] - 10https://gerrit.wikimedia.org/r/791335 (https://phabricator.wikimedia.org/T307260) (owner: 10Volans) [15:47:07] (03PS16) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [15:47:14] (03CR) 10Jbond: C:varnish::common: Add documentation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [15:47:53] (03CR) 10Volans: [C: 03+2] ganeti: add startup method [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 (owner: 10Volans) [15:48:03] (03PS5) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:49:16] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host labstore1007.wikimedia.org [15:49:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35221/console" [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:50:23] (03PS6) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [15:50:34] (03CR) 10Jbond: [C: 03+2] C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [15:51:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35222/console" [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:52:05] (03CR) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [15:53:02] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10hashar) I am not sure what happened but the `deployment` group does... [15:53:38] (03PS7) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [15:53:52] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host labstore1005.eqiad.wmnet [15:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:11] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) I've downgraded ganeti4001 to 21.60.22.11, flashed and confirmed. @MoritzMuehlenhoff give it a shot now! [15:54:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35223/console" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [15:55:23] (03PS1) 10Klausman: Update default celery version for ORES to v5 [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) [15:55:40] (03CR) 10jerkins-bot: [V: 04-1] ganeti: add startup method [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 (owner: 10Volans) [15:55:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35224/console" [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [15:56:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labstore1007.wikimedia.org [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:45] (03PS2) 10Volans: remote: increase reboot wait time [software/spicerack] - 10https://gerrit.wikimedia.org/r/791335 (https://phabricator.wikimedia.org/T307260) [15:57:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host labstore1006.wikimedia.org [15:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:07] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35227/console" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:00:04] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:49] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35229/console" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:01:35] (03CR) 10Jbond: elasticsearch: Java version is a fact, does not need to be a param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:03:18] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35231/console" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:05:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host labstore1006.wikimedia.org [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:54] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:06:26] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:36] (03PS2) 10Klausman: Update default celery version for ORES to v5 [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) [16:07:06] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:28] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35232/console" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:08:32] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35233/console" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:14:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:47] (03CR) 10Razzi: [C: 03+2] an-tool1005: set operating system image to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/791397 (https://phabricator.wikimedia.org/T301990) (owner: 10Razzi) [16:21:20] !log gitlab2001 - trying to stop 'puma' for debugging T308089 [16:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:25] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [16:21:58] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Metrics, 10observability, and 4 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) >>! In T240685#7923734, @Addshore wrote: > Would anyone in the know be able to write a summary? Yes! As I understand it, the next step... [16:22:58] (03PS2) 10Volans: ganeti: add startup method [software/spicerack] - 10https://gerrit.wikimedia.org/r/791336 [16:27:43] (03CR) 10Dzahn: "out of 22 ores* hosts, 17 have celery 5.2.6, 4 (the poolcounters) don't have celery and one, ores1006, has celery 4.1.1. ores1006 does not" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:28:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:28:41] klausman: hi! is ores1006 is a special case? [16:30:09] (03CR) 10Klausman: [V: 03+1] Update default celery version for ORES to v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:30:34] (03CR) 10Dzahn: [C: 03+1] "gotcha!:)" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:31:13] mutante: only in that it the last in line :) [16:32:44] I dunno why there is a "double default", but I'll clean that up (along with the special code for celery 4) once I have talked with Luca about this [16:33:01] (03CR) 10Dzahn: [C: 03+1] "fwiw, I checked with: sudo cumin 'ores*' '/srv/deployment/ores/deploy/venv/bin/celery report'" [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:33:36] klausman: gotcha! I didn't mean to get in the way, just got curious how to check versions [16:33:46] PROBLEM - Memcached on an-tool1005 is CRITICAL: connect to address 10.64.36.117 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:33:48] No problem at all :) [16:35:06] !log klausman@cumin1001 START - Cookbook sre.hosts.reimage for host ores1006.eqiad.wmnet with OS buster [16:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:30] (03CR) 10Klausman: [V: 03+1 C: 03+2] Update default celery version for ORES to v5 [puppet] - 10https://gerrit.wikimedia.org/r/791401 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [16:37:36] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:40:52] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:28] PROBLEM - puppet last run on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:42:24] PROBLEM - Disk space on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-tool1005&var-datasource=eqiad+prometheus/ops [16:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:46:30] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:35] (03PS15) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:50:10] btullis, razzi: FYI an-worker1081 raid's write cache policy above (from icinga-wm) ^^^ [16:50:36] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:50:39] volans: OK thanks. Looking now. [16:51:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35234/console" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:51:58] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [16:52:10] (03CR) 10Joal: Add profile::hadoop:spark3 class and resources (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [16:52:25] (03PS2) 10Stang: Enable "upload_by_url" feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) [16:53:00] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Attempting OS upgrade [16:53:01] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-tool1005.eqiad.wmnet with reason: Attempting OS upgrade [16:53:02] (03PS2) 10Joal: Add profile::hadoop:spark3 class and resources [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) [16:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:24] (03PS16) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [16:57:21] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores1006.eqiad.wmnet with reason: host reimage [16:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores1006.eqiad.wmnet with reason: host reimage [17:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [17:02:25] RECOVERY - Disk space on an-tool1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-tool1005&var-datasource=eqiad+prometheus/ops [17:03:53] (03CR) 10Bking: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35235/console" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [17:04:24] 10SRE, 10Data-Persistence-Backup, 10database-backups, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10jcrespo) This is weird- backups happen every day. Alerts shouldn't happen (although it is not an immediate issue, as they don't serve user requests).... [17:05:09] 10SRE, 10Data-Persistence-Backup, 10database-backups, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10jcrespo) [17:08:34] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye [17:08:38] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye [17:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:49] (03PS1) 10Dzahn: gitlab: use 'gracefull-kill' instead of stop to stop puma [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) [17:09:02] (03PS2) 10Dzahn: gitlab: use 'gracefull-kill' instead of stop to stop puma [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) [17:09:08] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791411 [17:09:15] PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.004 second response time https://wikitech.wikimedia.org/wiki/Thanos [17:09:34] (03CR) 10Bking: [V: 03+1] "Updated per Jbond's suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [17:09:46] (03CR) 10jerkins-bot: [V: 04-1] gitlab: use 'gracefull-kill' instead of stop to stop puma [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) (owner: 10Dzahn) [17:09:56] 10SRE, 10Data-Persistence-Backup, 10database-backups, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10jcrespo) p:05Triage→03High There is definitely an actionable to do here- we should reduce the load of the server and move away at least one sectio... [17:10:35] RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/Thanos [17:10:45] 10SRE, 10Data-Persistence-Backup, 10database-backups, 10observability: Icinga db alerts during backup/restore - https://phabricator.wikimedia.org/T307639 (10jcrespo) [17:10:57] (03PS3) 10Dzahn: gitlab: use 'gracefull-kill' instead of stop to stop puma [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) [17:12:11] ACKNOWLEDGEMENT - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Created ticket T308267 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:13:32] volans: thanks again Have created T308267 to investigate further. [17:13:32] T308267: RAID battery malfunction in an-worker1081 - https://phabricator.wikimedia.org/T308267 [17:17:16] (03PS17) 10Bking: elasticsearch: get java version from java class [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [17:18:45] (03CR) 10Ryan Kemper: [C: 03+1] elasticsearch: get java version from java class [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [17:20:47] (03PS2) 10Dzahn: define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171) [17:21:00] !log razzi@deploy1002 Started deploy [analytics/turnilo/deploy@9cfdfaf]: (no justification provided) [17:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:13] (03CR) 10Bking: [C: 03+2] elasticsearch: get java version from java class [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [17:21:27] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:13] (03PS4) 10Dzahn: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) [17:22:25] (03CR) 10Dzahn: [C: 03+2] move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:22:28] (03PS1) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [17:23:50] (03PS2) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [17:24:24] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791411 (owner: 10Jgiannelos) [17:24:29] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) Ok, since the last update, it now shows no connection on the switch. However, previous firmware versions all successfully had a link light on the switch (I checked when I updated to the n... [17:24:42] (03PS3) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [17:24:47] (03Merged) 10jenkins-bot: move html and httpd config for 15.wp to own directory, reorganize variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791097 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:24:56] (03PS4) 10Jcrespo: dbbackups: Setup backupmon1001 as a database backups monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/791414 (https://phabricator.wikimedia.org/T283017) [17:25:33] (03PS3) 10Dzahn: define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171) [17:25:37] (03CR) 10Dzahn: [C: 03+2] define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:26:35] !log jmm@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti4001.ulsfo.wmnet with OS bullseye [17:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:41] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors: - ganeti4001 (**FAIL**) - Removed from... [17:28:55] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/791411 (owner: 10Jgiannelos) [17:31:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores1006.eqiad.wmnet with OS buster [17:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:04] (03Merged) 10jenkins-bot: define entrypoint only once instead of in each variant, simplify test variant [container/miscweb] - 10https://gerrit.wikimedia.org/r/791094 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [17:34:28] (03PS4) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [17:37:24] (03PS3) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 [17:37:34] (03CR) 10jerkins-bot: [V: 04-1] rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (owner: 10Dzahn) [17:39:42] (03PS4) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 [17:40:03] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:41:09] (03CR) 10Majavah: [C: 04-1] "first pass, few notes inline" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:42:31] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:43:15] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:39] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:16] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:50] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:02] (03CR) 10jerkins-bot: [V: 04-1] rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (owner: 10Dzahn) [17:47:37] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] (03CR) 10Majavah: kubeadm: Disable TTLAfterFinished feature gate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791368 (owner: 10Majavah) [17:50:12] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:21] (03PS5) 10Dzahn: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (https://phabricator.wikimedia.org/T300171) [17:50:32] !log razzi@deploy1002 Finished deploy [analytics/turnilo/deploy@9cfdfaf]: (no justification provided) (duration: 29m 32s) [17:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:36] !log razzi@deploy1002 Started deploy [analytics/turnilo/deploy@5047d7d]: (no justification provided) [17:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:45] !log razzi@deploy1002 Finished deploy [analytics/turnilo/deploy@5047d7d]: (no justification provided) (duration: 00m 08s) [17:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:59] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:37] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4001.ulsfo.wmnet with OS bullseye [17:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:42] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye [17:52:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:09] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4001.ulsfo.wmnet with reason: host reimage [18:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4001.ulsfo.wmnet with reason: host reimage [18:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:01] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:17:11] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 294 probes of 667 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:18:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 91 probes of 667 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:21:45] (03PS4) 10Krinkle: static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) [18:21:58] (03CR) 10Krinkle: [C: 03+2] static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [18:22:39] (03Merged) 10jenkins-bot: static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [18:22:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 58 probes of 667 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:24:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 69 probes of 667 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:25:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:26:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4001.ulsfo.wmnet with OS bullseye [18:26:56] !log krinkle@deploy1002 Synchronized w/static.php: Ic0a5eae4f721a16403071d1b2136cf23d78e4fa9 (duration: 00m 49s) [18:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35236/console" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:27:04] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye completed: - ganeti4001 (**PASS**) - Removed from Puppet and... [18:27:23] (03CR) 10CDanis: "looks good but fix flake8" [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) (owner: 10Giuseppe Lavagetto) [18:28:41] (03CR) 10CDanis: [C: 03+1] requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 (owner: 10Giuseppe Lavagetto) [18:28:50] (03CR) 10CDanis: [C: 03+1] New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 (owner: 10Giuseppe Lavagetto) [18:29:44] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) Summary of work: * initially asked to update firmware, updated to the very latest 22.00.07.60 ** this introduced a new pxe boot failure known via T304483, Moritz requested I match to the... [18:30:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:55] (03PS1) 10Stang: ruwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791424 (https://phabricator.wikimedia.org/T308233) [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:46] (03CR) 10Jelto: [V: 03+1 C: 03+1] "Technically that looks fine for me beside one small typo." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:40:18] (03CR) 10Jelto: [V: 03+1 C: 03+1] GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:43:02] (03CR) 10Brennen Bearnes: GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:43:38] (03CR) 10Brennen Bearnes: GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:47:44] (03CR) 10Brennen Bearnes: GitLab: enable container registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:56:19] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:55] !log restart gitlab2001 [18:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:18:05] (03CR) 10Jelto: [C: 03+1] "looks like a good solution to get backup and restore working again in T308089." [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) (owner: 10Dzahn) [19:34:20] (03CR) 10Dzahn: "thanks! I will deploy it. And I agree it would be even nicer to know the real reason but then also it's not just we will always kill it wi" [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) (owner: 10Dzahn) [19:38:33] (03CR) 10Dzahn: [C: 03+2] gerrit: replicate to codfw with 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) (owner: 10Hashar) [19:40:03] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:43:28] (03CR) 10Dzahn: [C: 03+2] gitlab: use 'gracefull-kill' instead of stop to stop puma [puppet] - 10https://gerrit.wikimedia.org/r/791410 (https://phabricator.wikimedia.org/T308089) (owner: 10Dzahn) [19:43:51] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:05] (03PS1) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) [19:47:11] (03CR) 10MewOphaswongse: Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [19:50:02] (03PS1) 10Ssingh: dnsdist: update docstrings to use YARD-style tags [puppet] - 10https://gerrit.wikimedia.org/r/791429 [19:50:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35237/console" [puppet] - 10https://gerrit.wikimedia.org/r/791429 (owner: 10Ssingh) [19:51:12] (03PS1) 10Stang: viwiki: Enable "upload_by_url" for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791430 (https://phabricator.wikimedia.org/T303577) [19:51:40] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:52:12] (03CR) 10Ottomata: "One more nit about a comment, +1 though!" [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [19:53:10] !log gitlab2001 - systemctl start backup-restore - systemd[1]: Started GitLab Backup Restore. after gerrit:791410 for T308089 [19:53:10] (03CR) 10Ssingh: [V: 03+1] "I have taken some liberty with specifying the data types in the docstrings; not sure if there is a better/recommended way but at least for" [puppet] - 10https://gerrit.wikimedia.org/r/791429 (owner: 10Ssingh) [19:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:16] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [19:57:12] !log Restarting Gerrit [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] brennen: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220512T2000). [20:00:05] koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] I'm here [20:00:45] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35238/console" [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [20:01:12] hey koi [20:01:17] howdy [20:02:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:05:29] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35239/console" [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [20:05:52] Hi! Greeting thcipriani and brennen [20:06:13] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [20:08:03] hey koi - we're experimenting with some new tooling again, apologies this is a bit slower than usual [20:09:14] sorry to here that :( Is it related to toolforge maintenance today? [20:12:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:15:25] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [20:15:44] koi: no, we're just working on some improvements to deployment tooling. [20:16:18] (03Merged) 10jenkins-bot: Enable "upload_by_url" feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [20:17:01] !log brennen@deploy1002 prep aborted: (duration: 00m 01s) [20:17:01] !log brennen@deploy1002 backport aborted: (duration: 02m 05s) [20:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:51] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [20:18:17] aha? [20:18:25] (03CR) 10MewOphaswongse: [C: 04-1] Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [20:20:03] koi: is on mwdebug1002 [20:20:12] looking [20:21:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:51] lgtm [20:22:10] koi: cool, syncing [20:22:28] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [20:22:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:48] there are some kinks to work out. :) [20:23:15] !log brennen@deploy1002 Started scap: Backport for [[gerrit:785229]] Enable "upload_by_url" feature on zhwiki [20:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:51] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:25:02] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:785229]] Enable "upload_by_url" feature on zhwiki (duration: 01m 46s) [20:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:59] (03PS2) 10Brennen Bearnes: ruwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791424 (https://phabricator.wikimedia.org/T308233) (owner: 10Stang) [20:26:23] stashbot not working on task :( [20:26:23] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:28:08] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791424 (https://phabricator.wikimedia.org/T308233) (owner: 10Stang) [20:29:03] (03Merged) 10jenkins-bot: ruwiktionary: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791424 (https://phabricator.wikimedia.org/T308233) (owner: 10Stang) [20:29:15] (03CR) 10Dzahn: [C: 03+2] rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:29:54] koi: on mwdebug1002 again [20:30:30] looks good [20:30:49] syncing [20:31:46] !log brennen@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-ru.svg: Config: [[gerrit:791424|ruwiktionary: Add localized mobile wordmark (T308233)]] (duration: 00m 49s) [20:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] T308233: Add localized wordmark to ruwiktionary mobile frontend - https://phabricator.wikimedia.org/T308233 [20:32:20] (03CR) 10BryanDavis: "PCC is finally passing everywhere: https://puppet-compiler.wmflabs.org/pcc-worker1003/35240/" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:32:44] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791424|ruwiktionary: Add localized mobile wordmark (T308233)]] (duration: 00m 50s) [20:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:18] (03PS2) 10Brennen Bearnes: viwiki: Enable "upload_by_url" for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791430 (https://phabricator.wikimedia.org/T303577) (owner: 10Stang) [20:33:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:40] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791430 (https://phabricator.wikimedia.org/T303577) (owner: 10Stang) [20:34:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:50] (03CR) 10Dzahn: [C: 03+1] "simple enough. does not touch existing code." [puppet] - 10https://gerrit.wikimedia.org/r/791327 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [20:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:29] (03Merged) 10jenkins-bot: viwiki: Enable "upload_by_url" for sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791430 (https://phabricator.wikimedia.org/T303577) (owner: 10Stang) [20:35:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:29] koi: last one on mwdebug1002 [20:37:02] Also looks good [20:37:05] syncing [20:37:10] (03CR) 10Brennen Bearnes: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791430 (https://phabricator.wikimedia.org/T303577) (owner: 10Stang) [20:37:30] (03Merged) 10jenkins-bot: rename the production variant to bzstatic [container/miscweb] - 10https://gerrit.wikimedia.org/r/791072 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:37:41] !log brennen@deploy1002 Started scap: Backport for [[gerrit:791430]] viwiki: Enable "upload_by_url" for sysop [20:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:18] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:791430]] viwiki: Enable "upload_by_url" for sysop (duration: 01m 36s) [20:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:06] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) 05Open→03Resolved (Fake Stashbot): {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/ZFTwuYAB6FQ6iqKio_4O} [202... [20:43:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:47] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:44:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:01] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:17] !log krinkle@mwmaint1002$ mwscript refreshLinks.php --wiki commonswiki --category 'Media_needing_categories_requiring_human_attention' (approximately 2000 tiny pages) [20:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:54] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) cp3060 refuses to load its idrac https interface, even when i clear browser history and do a racreset on the idrac interface, skipping it and continuing wi... [20:59:15] !log utc late backport & config window closed [20:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:43] (03PS5) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [21:03:45] (03PS1) 10BryanDavis: striker: update codfw1dev openstack endpoint name [puppet] - 10https://gerrit.wikimedia.org/r/791456 [21:04:17] (03PS3) 10Ottomata: Add profile::hadoop:spark3 class and resources [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [21:06:37] (03PS4) 10Ottomata: Add profile::hadoop:spark3 class and resources [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [21:07:20] (03CR) 10Krinkle: [C: 03+1] TimedMediaHandler: Disabled the BetaFeature from wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788385 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [21:07:39] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35242/console" [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [21:08:08] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Add profile::hadoop:spark3 class and resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [21:08:11] (03CR) 10Krinkle: [C: 03+1] TimedMediaHandler: Drop Beta Feature, no longer usable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612350 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [21:08:21] (03CR) 10Krinkle: [C: 03+1] TimedMediaHandler: Don't read wmgTmhWebPlayer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612351 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [21:08:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add profile::hadoop:spark3 class and resources [puppet] - 10https://gerrit.wikimedia.org/r/791323 (https://phabricator.wikimedia.org/T295072) (owner: 10Joal) [21:10:14] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1003/2/" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [21:10:35] (03CR) 10Ahmon Dancy: buildkitd: Provide buildkitd image for trusted GitLab runners (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [21:10:41] (03CR) 10BryanDavis: striker: Add profile to provision docker container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [21:12:14] (03CR) 10Krinkle: [C: 03+1] TimedMediaHandler: Drop pre-switch config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612352 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [21:12:51] (03PS1) 10Ottomata: Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) [21:13:14] (03PS2) 10Ottomata: Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) [21:13:16] (03CR) 10jerkins-bot: [V: 04-1] Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) (owner: 10Ottomata) [21:13:49] (03CR) 10jerkins-bot: [V: 04-1] Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) (owner: 10Ottomata) [21:14:55] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [21:19:06] (03CR) 10MewOphaswongse: [C: 04-1] "Apologies for the multi-part comment, looking at https://phabricator.wikimedia.org/T303785, it turns out we won't need skipWelcomeSurvey f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [21:23:38] (03CR) 10BryanDavis: "Majavah pointed out this stale value in his comments on Idf0b0e3c0149ee1d7f3e4931a856fc4d991344d2." [puppet] - 10https://gerrit.wikimedia.org/r/791456 (owner: 10BryanDavis) [21:24:14] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) The Junos image is now on both pfw ` root@pfw3-eqiad% ls /var/tmp/junos-srxentedge-x86-64-20.4R3-S1.3.tgz /var/tmp/junos-srxentedge-x86-64-20.4R3-S1.3.tgz... [21:31:08] (03PS1) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [21:34:45] (03CR) 10jerkins-bot: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [21:35:19] (03PS1) 10Cathal Mooney: Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) [21:36:57] (03CR) 10jerkins-bot: [V: 04-1] Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [21:38:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 39.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:38:33] (03CR) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [21:38:35] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 46.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:38:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 14.18 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:40:25] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 81.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:40:51] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 72.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:40:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:41:04] (03PS2) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) [21:41:06] (03PS2) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [21:42:38] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [21:43:30] (03CR) 10jerkins-bot: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [21:48:27] (03PS2) 10Cathal Mooney: Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) [21:49:16] (03CR) 10jerkins-bot: [V: 04-1] Automation changes to support new cloudsw configuration / vrf [homer/public] - 10https://gerrit.wikimedia.org/r/791460 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [21:50:29] (03PS3) 10Ottomata: Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) [21:50:47] (03PS4) 10Ottomata: Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) [21:50:50] (03CR) 10jerkins-bot: [V: 04-1] Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) (owner: 10Ottomata) [21:50:54] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Ensure spark3 conf dir exists [puppet] - 10https://gerrit.wikimedia.org/r/791457 (https://phabricator.wikimedia.org/T295072) (owner: 10Ottomata) [21:53:55] !log razzi@deploy1002 Started deploy [analytics/turnilo/deploy@a2bdc3e]: (no justification provided) [21:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:03] !log razzi@deploy1002 Finished deploy [analytics/turnilo/deploy@a2bdc3e]: (no justification provided) (duration: 02m 08s) [21:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:08] (03PS3) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [22:00:35] (03CR) 10jerkins-bot: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [22:05:43] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:07:54] (03PS1) 10Dzahn: setup build pipelines for bzstatic and fifteenwp variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791462 (https://phabricator.wikimedia.org/T300171) [22:11:35] (03CR) 10Ahmon Dancy: buildkitd: Provide buildkitd image for trusted GitLab runners (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [22:11:56] (03PS2) 10Dzahn: setup build pipelines for bzstatic and fifteenwp variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791462 (https://phabricator.wikimedia.org/T300171) [22:14:35] (03CR) 10Ahmon Dancy: [V: 04-1] "On my system" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [22:18:24] (03CR) 10Dzahn: "recheck" [container/miscweb] - 10https://gerrit.wikimedia.org/r/791462 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [22:20:21] (03CR) 10Dzahn: [C: 03+2] setup build pipelines for bzstatic and fifteenwp variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791462 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [22:25:56] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [22:28:22] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) p:05High→03Low all done but cp3060 which refuses to pull up https on idrac for the firmware flash, it just endlessly loads the login screen until timeo... [22:28:46] (03Merged) 10jenkins-bot: setup build pipelines for bzstatic and fifteenwp variants [container/miscweb] - 10https://gerrit.wikimedia.org/r/791462 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:51:51] (03CR) 10Ahmon Dancy: [V: 04-1] buildkitd: Provide buildkitd image for trusted GitLab runners (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:47:07] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale