[00:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:56] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:28] 10ops-eqiad, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in eqiad - https://phabricator.wikimedia.org/T394109#10818933 (10Papaul) We are still have some issues with msw-d1/d4 and d6 I think it has to do with the way the cabling is done or misconfiguration on the msw1-eqiad . looking at icinga i see 3 p... [00:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1145356 [00:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1145356 (owner: 10TrainBranchBot) [00:08:42] (03PS16) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [00:22:04] (03PS17) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [00:25:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10818937 (10Papaul) @Jclark-ctr I think @MatthewVernon is the best person for that question since he knows better then I how that server will be configured. Thanks [00:29:06] (03PS18) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [00:29:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1145356 (owner: 10TrainBranchBot) [00:30:11] (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [00:32:33] (03PS19) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [00:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:42:06] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [00:46:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:47:29] 10ops-codfw, 06DC-Ops: Unresponsive management for db2184.mgmt:22 - https://phabricator.wikimedia.org/T394118 (10phaultfinder) 03NEW [02:47:29] 10ops-codfw, 06DC-Ops: Unresponsive management for mc2048.mgmt:22 - https://phabricator.wikimedia.org/T394119 (10phaultfinder) 03NEW [02:47:30] 10ops-codfw, 06DC-Ops: Unresponsive management for ms-be2058.mgmt:22 - https://phabricator.wikimedia.org/T394122 (10phaultfinder) 03NEW [02:47:31] 10ops-codfw, 06DC-Ops: Unresponsive management for backup2003.mgmt:22 - https://phabricator.wikimedia.org/T394120 (10phaultfinder) 03NEW [02:47:32] 10ops-codfw, 06DC-Ops: Unresponsive management for logstash2035.mgmt:22 - https://phabricator.wikimedia.org/T394121 (10phaultfinder) 03NEW [02:48:25] 10ops-codfw, 06DC-Ops: Unresponsive management for ms-be2072.mgmt:22 - https://phabricator.wikimedia.org/T394125 (10phaultfinder) 03NEW [02:48:26] 10ops-codfw, 06DC-Ops: Unresponsive management for ms-backup2001.mgmt:22 - https://phabricator.wikimedia.org/T394124 (10phaultfinder) 03NEW [02:48:27] 10ops-codfw, 06DC-Ops: Unresponsive management for kafka-stretch2001.mgmt:22 - https://phabricator.wikimedia.org/T394123 (10phaultfinder) 03NEW [02:48:35] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-c6-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-c6-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [02:49:29] 10ops-codfw, 06DC-Ops: Unresponsive management for ms-be2064.mgmt:22 - https://phabricator.wikimedia.org/T394126 (10phaultfinder) 03NEW [02:50:23] 10ops-codfw, 06DC-Ops: Unresponsive management for mc2047.mgmt:22 - https://phabricator.wikimedia.org/T394127 (10phaultfinder) 03NEW [02:50:24] 10ops-codfw, 06DC-Ops: Unresponsive management for wdqs2011.mgmt:22 - https://phabricator.wikimedia.org/T394128 (10phaultfinder) 03NEW [03:29:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:34:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:56] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:26:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:46:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:27] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1165.mgmt:22 - https://phabricator.wikimedia.org/T394133 (10phaultfinder) 03NEW [04:48:28] 10ops-eqiad, 06DC-Ops: Unresponsive management for conf1009.mgmt:22 - https://phabricator.wikimedia.org/T394136 (10phaultfinder) 03NEW [04:48:29] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1097.mgmt:22 - https://phabricator.wikimedia.org/T394135 (10phaultfinder) 03NEW [04:48:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1168.mgmt:22 - https://phabricator.wikimedia.org/T394134 (10phaultfinder) 03NEW [04:48:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1164.mgmt:22 - https://phabricator.wikimedia.org/T394138 (10phaultfinder) 03NEW [04:48:31] 10ops-eqiad, 06DC-Ops: Unresponsive management for kafka-main1009.mgmt:22 - https://phabricator.wikimedia.org/T394137 (10phaultfinder) 03NEW [04:48:35] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1045.mgmt:22 - https://phabricator.wikimedia.org/T394140 (10phaultfinder) 03NEW [04:48:39] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudbackup1004.mgmt:22 - https://phabricator.wikimedia.org/T394139 (10phaultfinder) 03NEW [04:48:43] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1110.mgmt:22 - https://phabricator.wikimedia.org/T394141 (10phaultfinder) 03NEW [04:48:47] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1108.mgmt:22 - https://phabricator.wikimedia.org/T394143 (10phaultfinder) 03NEW [04:48:51] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1019.mgmt:22 - https://phabricator.wikimedia.org/T394142 (10phaultfinder) 03NEW [04:48:55] 10ops-eqiad, 06DC-Ops: Unresponsive management for kubestage1004.mgmt:22 - https://phabricator.wikimedia.org/T394146 (10phaultfinder) 03NEW [04:48:59] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1232.mgmt:22 - https://phabricator.wikimedia.org/T394145 (10phaultfinder) 03NEW [04:49:03] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T394144 (10phaultfinder) 03NEW [04:49:07] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1163.mgmt:22 - https://phabricator.wikimedia.org/T394147 (10phaultfinder) 03NEW [04:49:11] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirtlocal1001.mgmt:22 - https://phabricator.wikimedia.org/T394149 (10phaultfinder) 03NEW [04:49:15] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1159.mgmt:22 - https://phabricator.wikimedia.org/T394150 (10phaultfinder) 03NEW [04:49:19] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1036.mgmt:22 - https://phabricator.wikimedia.org/T394152 (10phaultfinder) 03NEW [04:49:23] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1070.mgmt:22 - https://phabricator.wikimedia.org/T394151 (10phaultfinder) 03NEW [04:49:27] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1032.mgmt:22 - https://phabricator.wikimedia.org/T394153 (10phaultfinder) 03NEW [04:49:31] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1249.mgmt:22 - https://phabricator.wikimedia.org/T394154 (10phaultfinder) 03NEW [04:49:36] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1223.mgmt:22 - https://phabricator.wikimedia.org/T394155 (10phaultfinder) 03NEW [04:49:40] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1013.mgmt:22 - https://phabricator.wikimedia.org/T394156 (10phaultfinder) 03NEW [04:49:44] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1020.mgmt:22 - https://phabricator.wikimedia.org/T394157 (10phaultfinder) 03NEW [04:49:48] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1041.mgmt:22 - https://phabricator.wikimedia.org/T394159 (10phaultfinder) 03NEW [04:49:52] 10ops-eqiad, 06DC-Ops: Unresponsive management for an-druid1005.mgmt:22 - https://phabricator.wikimedia.org/T394158 (10phaultfinder) 03NEW [04:49:56] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1047.mgmt:22 - https://phabricator.wikimedia.org/T394163 (10phaultfinder) 03NEW [04:50:00] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1046.mgmt:22 - https://phabricator.wikimedia.org/T394162 (10phaultfinder) 03NEW [04:50:04] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1024.mgmt:22 - https://phabricator.wikimedia.org/T394165 (10phaultfinder) 03NEW [04:50:08] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1012.mgmt:22 - https://phabricator.wikimedia.org/T394164 (10phaultfinder) 03NEW [04:50:12] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1173.mgmt:22 - https://phabricator.wikimedia.org/T394172 (10phaultfinder) 03NEW [04:50:16] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1069.mgmt:22 - https://phabricator.wikimedia.org/T394170 (10phaultfinder) 03NEW [04:50:20] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1167.mgmt:22 - https://phabricator.wikimedia.org/T394166 (10phaultfinder) 03NEW [04:50:24] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1071.mgmt:22 - https://phabricator.wikimedia.org/T394169 (10phaultfinder) 03NEW [04:50:28] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1039.mgmt:22 - https://phabricator.wikimedia.org/T394171 (10phaultfinder) 03NEW [04:50:32] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1038.mgmt:22 - https://phabricator.wikimedia.org/T394167 (10phaultfinder) 03NEW [04:50:36] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1011.mgmt:22 - https://phabricator.wikimedia.org/T394168 (10phaultfinder) 03NEW [04:50:40] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudservices1005.mgmt:22 - https://phabricator.wikimedia.org/T394175 (10phaultfinder) 03NEW [04:50:44] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1020.mgmt:22 - https://phabricator.wikimedia.org/T394177 (10phaultfinder) 03NEW [04:50:48] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1109.mgmt:22 - https://phabricator.wikimedia.org/T394174 (10phaultfinder) 03NEW [04:50:52] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1042.mgmt:22 - https://phabricator.wikimedia.org/T394176 (10phaultfinder) 03NEW [04:50:56] 10ops-eqiad, 06DC-Ops: Unresponsive management for pki-root1001.mgmt:22 - https://phabricator.wikimedia.org/T394173 (10phaultfinder) 03NEW [04:51:00] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1037.mgmt:22 - https://phabricator.wikimedia.org/T394184 (10phaultfinder) 03NEW [04:51:04] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1096.mgmt:22 - https://phabricator.wikimedia.org/T394183 (10phaultfinder) 03NEW [04:51:08] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1222.mgmt:22 - https://phabricator.wikimedia.org/T394180 (10phaultfinder) 03NEW [04:51:13] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1015.mgmt:22 - https://phabricator.wikimedia.org/T394181 (10phaultfinder) 03NEW [04:51:17] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1023.mgmt:22 - https://phabricator.wikimedia.org/T394178 (10phaultfinder) 03NEW [04:51:21] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1107.mgmt:22 - https://phabricator.wikimedia.org/T394179 (10phaultfinder) 03NEW [04:51:25] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1068.mgmt:22 - https://phabricator.wikimedia.org/T394182 (10phaultfinder) 03NEW [04:51:29] 10ops-eqiad, 06DC-Ops: Unresponsive management for gitlab-runner1004.mgmt:22 - https://phabricator.wikimedia.org/T394186 (10phaultfinder) 03NEW [04:51:33] 10ops-eqiad, 06DC-Ops: Unresponsive management for dbstore1007.mgmt:22 - https://phabricator.wikimedia.org/T394188 (10phaultfinder) 03NEW [04:51:37] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394190 (10phaultfinder) 03NEW [04:51:41] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394196 (10phaultfinder) 03NEW [04:51:45] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394187 (10phaultfinder) 03NEW [04:51:49] 10ops-eqiad, 06DC-Ops: Unresponsive management for an-test-coord1001.mgmt:22 - https://phabricator.wikimedia.org/T394197 (10phaultfinder) 03NEW [04:51:54] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1014.mgmt:22 - https://phabricator.wikimedia.org/T394198 (10phaultfinder) 03NEW [04:51:58] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1034.mgmt:22 - https://phabricator.wikimedia.org/T394191 (10phaultfinder) 03NEW [04:52:02] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394189 (10phaultfinder) 03NEW [04:52:06] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394185 (10phaultfinder) 03NEW [04:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:52:10] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394192 (10phaultfinder) 03NEW [04:52:14] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394194 (10phaultfinder) 03NEW [04:52:18] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394199 (10phaultfinder) 03NEW [04:52:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394195 (10phaultfinder) 03NEW [04:52:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394200 (10phaultfinder) 03NEW [04:52:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394193 (10phaultfinder) 03NEW [04:52:34] 10ops-eqiad, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394201 (10phaultfinder) 03NEW [04:52:40] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394202 (10phaultfinder) 03NEW [04:52:44] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394203 (10phaultfinder) 03NEW [04:52:48] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394204 (10phaultfinder) 03NEW [04:52:52] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394208 (10phaultfinder) 03NEW [04:52:56] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394206 (10phaultfinder) 03NEW [04:53:00] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394207 (10phaultfinder) 03NEW [04:53:04] 10ops-eqiad, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394210 (10phaultfinder) 03NEW [04:53:08] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394211 (10phaultfinder) 03NEW [04:53:12] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394209 (10phaultfinder) 03NEW [04:53:16] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394213 (10phaultfinder) 03NEW [04:53:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394212 (10phaultfinder) 03NEW [04:53:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394215 (10phaultfinder) 03NEW [04:53:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394217 (10phaultfinder) 03NEW [04:53:34] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394216 (10phaultfinder) 03NEW [04:53:38] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394214 (10phaultfinder) 03NEW [04:53:42] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394219 (10phaultfinder) 03NEW [04:53:46] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394221 (10phaultfinder) 03NEW [04:53:50] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394218 (10phaultfinder) 03NEW [04:53:54] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394225 (10phaultfinder) 03NEW [04:53:58] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394222 (10phaultfinder) 03NEW [04:54:02] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394220 (10phaultfinder) 03NEW [04:54:06] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394224 (10phaultfinder) 03NEW [04:54:10] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394223 (10phaultfinder) 03NEW [04:54:14] 10ops-eqiad, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394226 (10phaultfinder) 03NEW [04:54:18] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394227 (10phaultfinder) 03NEW [04:54:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394229 (10phaultfinder) 03NEW [04:54:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394228 (10phaultfinder) 03NEW [04:54:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394231 (10phaultfinder) 03NEW [04:54:34] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394232 (10phaultfinder) 03NEW [04:54:42] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394233 (10phaultfinder) 03NEW [04:54:46] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394235 (10phaultfinder) 03NEW [04:54:50] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394230 (10phaultfinder) 03NEW [04:54:54] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394234 (10phaultfinder) 03NEW [04:54:58] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394236 (10phaultfinder) 03NEW [04:55:02] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394237 (10phaultfinder) 03NEW [04:55:06] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394238 (10phaultfinder) 03NEW [04:55:10] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394239 (10phaultfinder) 03NEW [04:55:14] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394240 (10phaultfinder) 03NEW [04:55:18] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394241 (10phaultfinder) 03NEW [04:55:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394243 (10phaultfinder) 03NEW [04:55:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394242 (10phaultfinder) 03NEW [04:55:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394244 (10phaultfinder) 03NEW [04:55:34] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394245 (10phaultfinder) 03NEW [04:55:38] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394246 (10phaultfinder) 03NEW [04:55:42] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394247 (10phaultfinder) 03NEW [04:55:48] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394249 (10phaultfinder) 03NEW [04:55:52] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394248 (10phaultfinder) 03NEW [04:55:56] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1175.mgmt:22 - https://phabricator.wikimedia.org/T394250 (10phaultfinder) 03NEW [04:56:00] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudnet1006.mgmt:22 - https://phabricator.wikimedia.org/T394254 (10phaultfinder) 03NEW [04:56:04] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1004.mgmt:22 - https://phabricator.wikimedia.org/T394252 (10phaultfinder) 03NEW [04:56:08] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1019.mgmt:22 - https://phabricator.wikimedia.org/T394253 (10phaultfinder) 03NEW [04:56:12] 10ops-eqiad, 06DC-Ops: Unresponsive management for es1034.mgmt:22 - https://phabricator.wikimedia.org/T394251 (10phaultfinder) 03NEW [04:56:16] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1040.mgmt:22 - https://phabricator.wikimedia.org/T394255 (10phaultfinder) 03NEW [04:56:20] 10ops-eqiad, 06DC-Ops: Unresponsive management for mc-wf1002.mgmt:22 - https://phabricator.wikimedia.org/T394257 (10phaultfinder) 03NEW [04:56:24] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcontrol1006.mgmt:22 - https://phabricator.wikimedia.org/T394256 (10phaultfinder) 03NEW [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:21] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258 (10Marostegui) 03NEW [05:12:50] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258#10819974 (10Marostegui) [05:12:51] 10ops-eqiad, 06DC-Ops: Unresponsive management for mc-wf1002.mgmt:22 - https://phabricator.wikimedia.org/T394257#10819975 (10Marostegui) [05:12:52] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1040.mgmt:22 - https://phabricator.wikimedia.org/T394255#10819977 (10Marostegui) [05:12:53] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcontrol1006.mgmt:22 - https://phabricator.wikimedia.org/T394256#10819976 (10Marostegui) [05:12:55] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudnet1006.mgmt:22 - https://phabricator.wikimedia.org/T394254#10819978 (10Marostegui) [05:12:57] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1019.mgmt:22 - https://phabricator.wikimedia.org/T394253#10819979 (10Marostegui) [05:13:01] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1004.mgmt:22 - https://phabricator.wikimedia.org/T394252#10819980 (10Marostegui) [05:13:05] 10ops-eqiad, 06DC-Ops: Unresponsive management for es1034.mgmt:22 - https://phabricator.wikimedia.org/T394251#10819981 (10Marostegui) [05:13:09] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394249#10819983 (10Marostegui) [05:13:13] 10ops-eqiad, 06DC-Ops: Unresponsive management for db1175.mgmt:22 - https://phabricator.wikimedia.org/T394250#10819982 (10Marostegui) [05:13:17] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394247#10819985 (10Marostegui) [05:13:22] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394248#10819984 (10Marostegui) [05:13:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394246#10819986 (10Marostegui) [05:13:30] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394245#10819987 (10Marostegui) [05:13:33] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394244#10819988 (10Marostegui) [05:13:38] 10ops-eqiad, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394243#10819989 (10Marostegui) [05:13:41] 10ops-eqiad, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394241#10819991 (10Marostegui) [05:13:46] 10ops-eqiad, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394242#10819990 (10Marostegui) [05:13:49] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394239#10819993 (10Marostegui) [05:13:54] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394240#10819992 (10Marostegui) [05:13:57] 10ops-eqiad, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394238#10819994 (10Marostegui) [05:18:38] (03PS1) 10Marostegui: mariadb: Productionize db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1145428 (https://phabricator.wikimedia.org/T393989) [05:22:57] !incidents [05:22:58] No incidents occurred in the past 24 hours for team SRE [05:22:58] <_joe_> marostegui: is this you or just an expired downtime? [05:23:05] I think it is expired [05:23:07] I am checking [05:23:07] <_joe_> I guess an expired downtime [05:23:08] <_joe_> yes [05:23:20] <_joe_> !ack 6118 [05:23:20] Yeah, expired [05:23:20] 6118 (ACKED) Host es1031 (paged) - PING - Packet loss = 100% [05:24:17] I just resolved it [05:24:28] This is the bug that doesn't resolve pings when they recover [05:25:18] https://phabricator.wikimedia.org/T264016 [05:29:48] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1145428 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [05:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:36:17] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1257.eqiad.wmnet onto db1258.eqiad.wmnet [05:36:21] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1257 - Depool db1257.eqiad.wmnet to then clone it to db1258.eqiad.wmnet - marostegui@cumin1002 [05:36:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1257 - Depool db1257.eqiad.wmnet to then clone it to db1258.eqiad.wmnet - marostegui@cumin1002 [05:39:59] marostegui@cumin1002 clone (PID 1939255) is awaiting input [05:45:56] marostegui@cumin1002 clone (PID 1939255) is awaiting input [05:47:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1145327 (https://phabricator.wikimedia.org/T393724) (owner: 10BCornwall) [05:48:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10820168 (10Marostegui) [05:48:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10820169 (10Marostegui) [05:49:11] !log Mark db1255 as x3 master in zarcillo T390530 [05:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:14] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [05:51:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM (once approved by manager)" [puppet] - 10https://gerrit.wikimedia.org/r/1145333 (https://phabricator.wikimedia.org/T393626) (owner: 10BCornwall) [05:52:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1145325 (https://phabricator.wikimedia.org/T393798) (owner: 10BCornwall) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0600) [06:01:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258#10820177 (10ayounsi) →14Duplicate dup:03T394109 [06:01:15] 10ops-eqiad, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in eqiad - https://phabricator.wikimedia.org/T394109#10820179 (10ayounsi) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:03] 10ops-codfw, 06DC-Ops: lsw1-c6-codfw: PEM 0 Not Powered - https://phabricator.wikimedia.org/T394261 (10ayounsi) 03NEW p:05Triage→03High [06:10:59] !log Drop query killers from parsercache T387740 [06:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:02] T387740: Evaluate query killer on parsercache hosts - https://phabricator.wikimedia.org/T387740 [06:16:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1031 and es2029 to es3 masters T391921', diff saved to https://phabricator.wikimedia.org/P76074 and previous config saved to /var/cache/conftool/dbconfig/20250514-061650-marostegui.json [06:16:54] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1034 es2034 T391921', diff saved to https://phabricator.wikimedia.org/P76075 and previous config saved to /var/cache/conftool/dbconfig/20250514-061721-marostegui.json [06:17:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2034.codfw.wmnet,es1034.eqiad.wmnet with reason: Maintenance [06:18:35] (03PS1) 10Marostegui: es1034: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145454 (https://phabricator.wikimedia.org/T391921) [06:19:22] (03CR) 10Jelto: [C:03+1] "lgtm and matches the official documentation at https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_service::cata" [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [06:19:39] (03PS1) 10Kosta Harlan: Use anonymous user when creating named account from temp account [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145455 (https://phabricator.wikimedia.org/T393628) [06:19:45] (03CR) 10Marostegui: [C:03+2] es1034: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145454 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:19:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145455 (https://phabricator.wikimedia.org/T393628) (owner: 10Kosta Harlan) [06:21:38] (03PS1) 10KartikMistry: Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) [06:22:38] (03PS1) 10Marostegui: es2034: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145457 (https://phabricator.wikimedia.org/T391921) [06:23:47] (03CR) 10Marostegui: [C:03+2] es2034: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145457 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:27:16] !log es3 migrated to MariaDB 10.11 T391921 [06:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:19] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:27:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76076 and previous config saved to /var/cache/conftool/dbconfig/20250514-062733-root.json [06:28:40] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1145339 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [06:28:51] (03CR) 10Filippo Giunchedi: [C:03+2] airflow: disable statsd_exporter relaying to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1144554 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [06:31:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76077 and previous config saved to /var/cache/conftool/dbconfig/20250514-063149-root.json [06:41:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76078 and previous config saved to /var/cache/conftool/dbconfig/20250514-064106-root.json [06:41:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 (10MoritzMuehlenhoff) 03NEW [06:42:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76079 and previous config saved to /var/cache/conftool/dbconfig/20250514-064238-root.json [06:45:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76080 and previous config saved to /var/cache/conftool/dbconfig/20250514-064654-root.json [06:55:25] FIRING: [2x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P76081 and previous config saved to /var/cache/conftool/dbconfig/20250514-065611-root.json [06:57:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76082 and previous config saved to /var/cache/conftool/dbconfig/20250514-065744-root.json [06:58:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10820290 (10ayounsi) [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0700) [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:32] I'm here, and can sync the change [07:02:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76083 and previous config saved to /var/cache/conftool/dbconfig/20250514-070200-root.json [07:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145455 (https://phabricator.wikimedia.org/T393628) (owner: 10Kosta Harlan) [07:06:09] (03Merged) 10jenkins-bot: Use anonymous user when creating named account from temp account [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145455 (https://phabricator.wikimedia.org/T393628) (owner: 10Kosta Harlan) [07:06:52] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1145455|Use anonymous user when creating named account from temp account (T393628)]] [07:06:56] T393628: Temporary accounts: Use anonymous user performer when creating a named account - https://phabricator.wikimedia.org/T393628 [07:11:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P76084 and previous config saved to /var/cache/conftool/dbconfig/20250514-071117-root.json [07:11:36] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1145455|Use anonymous user when creating named account from temp account (T393628)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:11:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1042 es2042 T391921', diff saved to https://phabricator.wikimedia.org/P76085 and previous config saved to /var/cache/conftool/dbconfig/20250514-071159-marostegui.json [07:12:02] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [07:12:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2042.codfw.wmnet,es1042.eqiad.wmnet with reason: Maintenance [07:12:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76086 and previous config saved to /var/cache/conftool/dbconfig/20250514-071250-root.json [07:12:57] (03PS1) 10Marostegui: es1042: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145759 (https://phabricator.wikimedia.org/T391921) [07:13:12] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10820334 (10hashar) [07:14:05] (03CR) 10Marostegui: [C:03+2] es1042: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145759 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:15:25] RESOLVED: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76087 and previous config saved to /var/cache/conftool/dbconfig/20250514-071706-root.json [07:20:04] !log kharlan@deploy1003 kharlan: Continuing with sync [07:20:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76088 and previous config saved to /var/cache/conftool/dbconfig/20250514-072042-root.json [07:20:50] (03PS1) 10Marostegui: es2042: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145761 (https://phabricator.wikimedia.org/T391921) [07:21:24] (03CR) 10JMeybohm: "I'd question that this it set up as an eqiad only service. But apart from that I think it's fine" [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [07:24:34] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P76089 and previous config saved to /var/cache/conftool/dbconfig/20250514-072622-root.json [07:26:44] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145455|Use anonymous user when creating named account from temp account (T393628)]] (duration: 19m 51s) [07:26:47] T393628: Temporary accounts: Use anonymous user performer when creating a named account - https://phabricator.wikimedia.org/T393628 [07:27:13] (03CR) 10Marostegui: [C:03+2] es2042: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1145761 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:27:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76090 and previous config saved to /var/cache/conftool/dbconfig/20250514-072755-root.json [07:30:58] I'm done with deployments for now [07:31:23] !log UTC morning deploys done [07:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76091 and previous config saved to /var/cache/conftool/dbconfig/20250514-073211-root.json [07:32:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2042.codfw.wmnet with reason: Maintenance [07:34:44] !log installing glibc security updates [07:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76092 and previous config saved to /var/cache/conftool/dbconfig/20250514-073547-root.json [07:36:23] !log ayounsi@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: esams routers upgrade, T364092] [07:36:26] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [07:36:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: esams routers upgrade, T364092] [07:40:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76093 and previous config saved to /var/cache/conftool/dbconfig/20250514-074027-root.json [07:41:02] (03PS2) 10Brouberol: analytics-hive: Enable lock transaction management in prod hive metastore [puppet] - 10https://gerrit.wikimedia.org/r/1145762 (https://phabricator.wikimedia.org/T386854) [07:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76094 and previous config saved to /var/cache/conftool/dbconfig/20250514-074128-root.json [07:41:48] (03PS2) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [07:43:01] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-esams,cr2-esams IPv6,cr2-esams.mgmt with reason: cr2-esams upgrade [07:43:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76095 and previous config saved to /var/cache/conftool/dbconfig/20250514-074300-root.json [07:43:07] (03CR) 10CI reject: [V:04-1] airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [07:43:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820417 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=239f1d24-394b-4cd2-b80b-211b30b54a1a) set by ayounsi@cumin1002 for 1:00:00 on 3 host(s) and their servic... [07:43:25] !log cr2-esams# set protocols bgp graceful-shutdown sender - T364092 [07:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:28] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [07:44:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1256.eqiad.wmnet with reason: Maintenance [07:44:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1257.eqiad.wmnet with reason: Maintenance [07:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76097 and previous config saved to /var/cache/conftool/dbconfig/20250514-074717-root.json [07:48:08] (03CR) 10MVernon: [C:03+2] swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [07:48:45] (03PS1) 10Marostegui: instances.yaml: Add db1258 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145799 (https://phabricator.wikimedia.org/T393989) [07:50:33] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-14-073338-production [puppet] - 10https://gerrit.wikimedia.org/r/1145800 [07:50:43] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1258 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1145799 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [07:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76099 and previous config saved to /var/cache/conftool/dbconfig/20250514-075052-root.json [07:52:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1258 to dbctl T393989', diff saved to https://phabricator.wikimedia.org/P76101 and previous config saved to /var/cache/conftool/dbconfig/20250514-075254-marostegui.json [07:52:58] T393989: Productionize new x3 hosts - https://phabricator.wikimedia.org/T393989 [07:55:31] (03CR) 10Gkyziridis: [C:03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [07:55:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76102 and previous config saved to /var/cache/conftool/dbconfig/20250514-075532-root.json [07:55:34] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-14-073338-production [puppet] - 10https://gerrit.wikimedia.org/r/1145800 (owner: 10Majavah) [07:56:25] (03Merged) 10jenkins-bot: swift: split find_db_paths out into separate function (nfc) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145236 (owner: 10MVernon) [07:56:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76104 and previous config saved to /var/cache/conftool/dbconfig/20250514-075633-root.json [07:58:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76105 and previous config saved to /var/cache/conftool/dbconfig/20250514-075805-root.json [07:58:11] (03CR) 10Gkyziridis: [C:03+2] ml-inference-services: edit-check experimental prod deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1144521 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [07:58:23] !log cr2-esams - disable transit/IX BGP sessions - T364092 [07:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:25] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [07:59:10] !log cr2-esams> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-23.4R2-S3.9.tgz - T364092 [07:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] jnuche and jeena: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0800). [08:00:25] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for the Prometheus Bird exporter [puppet] - 10https://gerrit.wikimedia.org/r/1145802 (https://phabricator.wikimedia.org/T135991) [08:00:28] hi, I will rollout the train in the next few minutes [08:00:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10820507 (10Esanders) I'm still getting access denied when trying to log in to spiderpig: `Service access denied due to missing privileges.` [08:01:36] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1257 gradually with 4 steps - Pool db1257.eqiad.wmnet in after cloning [08:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76107 and previous config saved to /var/cache/conftool/dbconfig/20250514-080222-root.json [08:02:56] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10820518 (10MoritzMuehlenhoff) >>! In T393724#10820507, @Esanders wrote: > I'm still getting access denied when trying to log in to spiderpig: `Service access denied due... [08:05:21] (03PS1) 10Giuseppe Lavagetto: New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1145805 [08:05:30] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1145805 (owner: 10Giuseppe Lavagetto) [08:05:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76108 and previous config saved to /var/cache/conftool/dbconfig/20250514-080557-root.json [08:06:22] !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "T393381 - oblivian@cumin2002" [08:06:25] T393381: FY 24/25 WE 4.3.11 Define a policy for maintenance of requestctl rules - https://phabricator.wikimedia.org/T393381 [08:06:26] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: T393381 - oblivian@cumin2002 [08:06:33] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145806 (https://phabricator.wikimedia.org/T392171) [08:06:35] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145806 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:06:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145802 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:06:59] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: T393381 - oblivian@cumin2002 [08:07:01] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "T393381 - oblivian@cumin2002" [08:07:27] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145806 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:10:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76109 and previous config saved to /var/cache/conftool/dbconfig/20250514-081037-root.json [08:11:29] (03CR) 10David Caro: ceph: Remove extraneous logging configuration statement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [08:11:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76110 and previous config saved to /var/cache/conftool/dbconfig/20250514-081139-root.json [08:13:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76111 and previous config saved to /var/cache/conftool/dbconfig/20250514-081311-root.json [08:13:42] !log cr2-esams> request vmhost reboot - T364092 [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:47] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [08:15:50] (03CR) 10Elukey: "Hey folks! I noticed this issue due to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1145214, I have a chain of upgrades" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [08:16:03] (03CR) 10Elukey: "Blocked by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1074168" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:16:50] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145807 (https://phabricator.wikimedia.org/T392171) [08:16:51] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145807 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76113 and previous config saved to /var/cache/conftool/dbconfig/20250514-081728-root.json [08:17:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr3-knams (208.80.153.216) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr3-knams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:17:39] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145807 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:18:51] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/1/0 (Core: cr2-esams:xe-0/0/2:0 {#001}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:19:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:21:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76114 and previous config saved to /var/cache/conftool/dbconfig/20250514-082102-root.json [08:22:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:23:13] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Follow 302 redirects when fetching the checksum [puppet] - 10https://gerrit.wikimedia.org/r/1145245 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76115 and previous config saved to /var/cache/conftool/dbconfig/20250514-082543-root.json [08:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76116 and previous config saved to /var/cache/conftool/dbconfig/20250514-082644-root.json [08:27:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr2-esams (185.15.59.158) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:27:58] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for cortobot [puppet] - 10https://gerrit.wikimedia.org/r/1145808 (https://phabricator.wikimedia.org/T135991) [08:28:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76117 and previous config saved to /var/cache/conftool/dbconfig/20250514-082815-root.json [08:28:19] (03PS1) 10Marostegui: wmnet: Update es3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1145809 (https://phabricator.wikimedia.org/T391921) [08:28:39] (03CR) 10Marostegui: "This is a noop" [dns] - 10https://gerrit.wikimedia.org/r/1145809 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [08:28:51] RESOLVED: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/1/0 (Core: cr2-esams:xe-0/0/2:0 {#001}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:29:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:29:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:29:20] jouncebot: nowandnext [08:29:20] For the next 1 hour(s) and 30 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0800) [08:29:21] In 1 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1000) [08:29:28] (03CR) 10Marostegui: [C:03+2] wmnet: Update es3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1145809 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [08:29:32] !log marostegui@dns1006 START - running authdns-update [08:30:14] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.1 refs T392171 [08:30:18] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [08:30:45] !log marostegui@dns1006 END - running authdns-update [08:31:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1257 gradually with 4 steps - Pool db1257.eqiad.wmnet in after cloning [08:31:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1257.eqiad.wmnet onto db1258.eqiad.wmnet [08:32:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76119 and previous config saved to /var/cache/conftool/dbconfig/20250514-083233-root.json [08:36:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76120 and previous config saved to /var/cache/conftool/dbconfig/20250514-083609-root.json [08:39:26] !log cr1-esams# set protocols bgp graceful-shutdown sender - T364092 [08:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:29] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [08:40:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76121 and previous config saved to /var/cache/conftool/dbconfig/20250514-084049-root.json [08:41:37] !log ayounsi@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on cr1-esams,cr1-esams IPv6,cr1-esams.mgmt with reason: cr1-esams upgrade [08:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:41:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76122 and previous config saved to /var/cache/conftool/dbconfig/20250514-084149-root.json [08:43:04] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Follow 302 redirects when fetching the checksum [puppet] - 10https://gerrit.wikimedia.org/r/1145245 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:43:09] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on re0.cr1-esams.mgmt with reason: cr1-esams upgrade [08:43:18] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820629 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0ccf059a-76d1-46d7-9ee7-b67d79c235aa) set by ayounsi@cumin1002 for 1:00:00 on 1 host(s) and their servic... [08:43:35] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr1-esams,cr1-esams IPv6 with reason: cr1-esams upgrade [08:43:41] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820631 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ed684b09-6354-460a-9fbf-3df20fbe3f21) set by ayounsi@cumin1002 for 1:00:00 on 2 host(s) and their servic... [08:44:19] !log cr1-esams - disable transit/IX BGP sessions - T364092 [08:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:20] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1145815 (https://phabricator.wikimedia.org/T135991) [08:46:15] !log cr1-esams - Install image on backup RE - T364092 [08:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:18] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [08:47:06] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:09] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for kthxbye [puppet] - 10https://gerrit.wikimedia.org/r/1145817 (https://phabricator.wikimedia.org/T135991) [08:50:07] (03CR) 10Fabfur: cache: lua lookup experiment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [08:51:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76123 and previous config saved to /var/cache/conftool/dbconfig/20250514-085115-root.json [08:52:05] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10820687 (10MatthewVernon) [08:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:52:09] (03PS1) 10Brouberol: deployment_server: provision the airflow-dev kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) [08:53:57] !log Mark db2241 as x3 master in zarcillo T390530 [08:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:01] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [08:54:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10820693 (10MatthewVernon) @Jclark-ctr the boss card should be left as RAID 1, thank you (but all the other drives should be JBOD). [08:54:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145808 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:54:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145815 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:55:04] (03PS1) 10Brouberol: dse-k8s-eqiad: define the airflow-dev namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145824 (https://phabricator.wikimedia.org/T394001) [08:55:06] (03PS1) 10Brouberol: dse-k8s-eqiad: add airflow-dev to the PG/Ceph operator tenant NS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145825 (https://phabricator.wikimedia.org/T394001) [08:55:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145817 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:55:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76124 and previous config saved to /var/cache/conftool/dbconfig/20250514-085555-root.json [08:56:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76125 and previous config saved to /var/cache/conftool/dbconfig/20250514-085655-root.json [08:58:27] !log cr1-esams request vmhost reboot re1 - T364092 [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:30] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:02:16] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for karma [puppet] - 10https://gerrit.wikimedia.org/r/1145826 (https://phabricator.wikimedia.org/T135991) [09:02:26] (03PS3) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [09:02:47] (03PS4) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [09:04:34] (03CR) 10Alexandros Kosiaris: [C:03+2] function-evaluator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145267 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [09:04:55] (03PS1) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828 [09:05:53] !log cr1-esams> request chassis routing-engine master switch - T364092 [09:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:56] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76126 and previous config saved to /var/cache/conftool/dbconfig/20250514-090621-root.json [09:06:23] (03Merged) 10jenkins-bot: function-evaluator: Bump CPU requests/limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145267 (https://phabricator.wikimedia.org/T389375) (owner: 10Alexandros Kosiaris) [09:07:52] (03CR) 10Fabfur: [C:03+2] cache: remove unused allowed_methods check from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:08:33] moritzm: can I merge your changes too ? [09:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and cr1-esams (185.15.59.154) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=core&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:10:11] (going to merge) [09:10:20] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [09:10:43] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [09:10:44] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [09:10:47] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [09:10:49] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [09:10:51] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/0/2:0 (Core: cr1-esams:xe-0/1/0 {#001}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:52] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [09:10:53] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [09:11:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76127 and previous config saved to /var/cache/conftool/dbconfig/20250514-091100-root.json [09:11:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:11:36] fabfur: yes, sorry for that [09:11:41] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [09:11:43] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [09:11:50] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [09:11:51] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [09:11:58] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [09:12:00] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [09:12:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76128 and previous config saved to /var/cache/conftool/dbconfig/20250514-091200-root.json [09:12:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145826 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:12:43] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [09:12:44] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [09:12:47] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [09:12:48] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [09:12:51] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [09:13:15] np :) [09:14:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:15:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:19:39] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:20:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:21:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:21:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76129 and previous config saved to /var/cache/conftool/dbconfig/20250514-092126-root.json [09:21:40] !log re1.cr1-esams> request vmhost reboot re0 - T364092 [09:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:43] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:24:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:25:04] !log retry full planet import for Bookworm maps master (the one yesterday failed due to a bug now fixed) T381565 [09:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:08] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [09:25:34] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:25:47] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:26:21] (03PS1) 10Volans: Add support for trixie [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1145833 [09:26:55] (03PS2) 10Volans: Add support for trixie [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1145833 (https://phabricator.wikimedia.org/T391083) [09:28:14] (03CR) 10Hnowlan: [C:03+2] Revert^2 "mw::maintenance: migrate mediamoderation-hourlyScan to k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1144582 (https://phabricator.wikimedia.org/T393236) (owner: 10Dreamy Jazz) [09:28:33] !log cr1-esams> request chassis routing-engine master switch - T364092 [09:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:31:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1145833 (https://phabricator.wikimedia.org/T391083) (owner: 10Volans) [09:31:26] (03PS13) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object, migrate s1 [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [09:31:45] (03PS1) 10Marostegui: dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1145834 (https://phabricator.wikimedia.org/T390530) [09:32:22] (03CR) 10Ladsgroup: [C:03+1] dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1145834 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [09:32:54] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:33:29] (03CR) 10Effie Mouzeli: [C:03+2] cronjobs: update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074165 (owner: 10Effie Mouzeli) [09:33:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:34:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:34:35] (03CR) 10CI reject: [V:04-1] mw::maintenance: move refreshLinkRecommendations job to shared object, migrate s1 [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [09:34:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:34:55] (03Merged) 10jenkins-bot: cronjobs: update modules (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074165 (owner: 10Effie Mouzeli) [09:34:59] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:35:16] (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1145834 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [09:35:19] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:35:43] (03CR) 10Abijeet Patro: [C:03+1] Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [09:36:06] (03PS14) 10Hnowlan: mw::maintenance: replace refreshLinkRecommendations define, s1 to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [09:36:25] (03PS7) 10Effie Mouzeli: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 [09:37:26] jouncebot: nowandnext [09:37:26] For the next 0 hour(s) and 22 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0800) [09:37:26] In 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1000) [09:38:05] (03CR) 10Effie Mouzeli: [C:03+2] cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [09:38:51] RESOLVED: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:38:56] !log repool cr1-esams - T364092 [09:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:59] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:39:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [09:39:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:39:27] (03Merged) 10jenkins-bot: cronjobs: update to 3.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074168 (owner: 10Effie Mouzeli) [09:39:54] RESOLVED: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:40:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add x3 eqiad T390530', diff saved to https://phabricator.wikimedia.org/P76131 and previous config saved to /var/cache/conftool/dbconfig/20250514-094038-marostegui.json [09:40:42] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [09:40:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76132 and previous config saved to /var/cache/conftool/dbconfig/20250514-094047-root.json [09:40:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76133 and previous config saved to /var/cache/conftool/dbconfig/20250514-094048-root.json [09:42:19] jouncebot: nowandnext [09:42:20] For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T0800) [09:42:20] In 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1000) [09:42:30] dcausse: should we try to backport that WikibaseCirrusSearch change right away? [09:42:50] Lucas_WMDE: sure, 17mins should be enough? [09:43:15] well, probably not including CI :S [09:43:22] but I hope the SREs will be okay with us trying to unblock the train [09:43:27] effie: ^ [09:43:35] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:43:59] Lucas_WMDE: train is more important :) [09:44:03] (03PS1) 10Lucas Werkmeister (WMDE): Also merge fields if stemming settings empty on one side [extensions/WikibaseCirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145836 (https://phabricator.wikimedia.org/T394274) [09:44:04] ack :) [09:44:10] ^ there’s the backport, I’ll start the spiderpig [09:44:29] if anyone objects, go ahead and cancel https://spiderpig.wikimedia.org/jobs/57 [09:44:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145836 (https://phabricator.wikimedia.org/T394274) (owner: 10Lucas Werkmeister (WMDE)) [09:44:37] (03PS5) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [09:46:15] (03PS2) 10Cathal Mooney: Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) [09:47:44] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [09:47:56] (03CR) 10Cathal Mooney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [09:49:56] !log ayounsi@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: esams routers upgrade finished, T364092] [09:49:59] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:50:00] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: esams routers upgrade finished, T364092] [09:50:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add x3 codfw T390530', diff saved to https://phabricator.wikimedia.org/P76135 and previous config saved to /var/cache/conftool/dbconfig/20250514-095031-marostegui.json [09:50:34] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [09:50:35] hm... not finding any images on test-commons, would have been helpful for testing :/ [09:50:58] Lucas_WMDE, dcausse: thank you for taking care of the blocker! [09:51:21] dcausse: yeah, I do not understand the rationale of closing down test commons. test wikis are useful! [09:51:29] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820938 (10ayounsi) [09:51:41] anyway, let me see if I can reproduce the exceptions on commons, for testing the backport [09:52:07] commons might not longer be on the right version [09:52:51] hmph [09:53:01] looks like it [09:53:15] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [09:53:29] I wonder if it’s still possible to hand-edit wikiversions.json on the bare-metal mwdebug’s [09:53:36] while they haven’t been decommissioned yet [09:54:51] (03PS1) 10Marostegui: wmnet: Add x3-master [dns] - 10https://gerrit.wikimedia.org/r/1145841 (https://phabricator.wikimedia.org/T390530) [09:55:26] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10820949 (10elukey) Thanks a lot! So the changes will not be reverted by future Pyrra filesystem syncs (new SLOs etc..) as for the time window ri... [09:55:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76136 and previous config saved to /var/cache/conftool/dbconfig/20250514-095552-root.json [09:55:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76137 and previous config saved to /var/cache/conftool/dbconfig/20250514-095553-root.json [09:56:05] (03CR) 10Ladsgroup: [C:03+1] wmnet: Add x3-master [dns] - 10https://gerrit.wikimedia.org/r/1145841 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [09:56:07] (03Merged) 10jenkins-bot: Also merge fields if stemming settings empty on one side [extensions/WikibaseCirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145836 (https://phabricator.wikimedia.org/T394274) (owner: 10Lucas Werkmeister (WMDE)) [09:56:12] (03CR) 10Marostegui: [C:03+2] wmnet: Add x3-master [dns] - 10https://gerrit.wikimedia.org/r/1145841 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [09:56:23] (03PS2) 10Elukey: ipoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) [09:56:23] (03PS2) 10Elukey: kartotherian: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145215 (https://phabricator.wikimedia.org/T391333) [09:56:23] (03PS2) 10Elukey: kask: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) [09:56:23] (03PS2) 10Elukey: linkrecommendation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145217 (https://phabricator.wikimedia.org/T391333) [09:56:24] (03PS2) 10Elukey: machinetranslation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145218 (https://phabricator.wikimedia.org/T391333) [09:56:27] (03PS2) 10Elukey: mathoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145219 (https://phabricator.wikimedia.org/T391333) [09:56:28] bleh, I’m not allowed to `sudo -i /usr/local/sbin/restart-php8.1-fpm` on mwdebug1002 [09:56:31] (03PS2) 10Elukey: mediawiki: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) [09:56:34] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1145836|Also merge fields if stemming settings empty on one side (T394274)]] [09:56:35] (03PS2) 10Elukey: mediawiki-dumps-legacy: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145221 (https://phabricator.wikimedia.org/T391333) [09:56:37] T394274: InvalidArgumentException: Duplicate field labels for model wikibase-mediainfo - https://phabricator.wikimedia.org/T394274 [09:56:39] (03PS2) 10Elukey: miscweb: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145222 (https://phabricator.wikimedia.org/T391333) [09:56:43] (03PS2) 10Elukey: mobileapps: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145223 (https://phabricator.wikimedia.org/T391333) [09:56:47] (03PS2) 10Elukey: mpic: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145224 (https://phabricator.wikimedia.org/T391333) [09:56:51] (03PS2) 10Elukey: push-notifications: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145225 (https://phabricator.wikimedia.org/T391333) [09:56:55] (03PS2) 10Elukey: python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) [09:56:59] (03PS2) 10Elukey: recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) [09:57:03] (03PS2) 10Elukey: shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) [09:57:07] (03PS2) 10Elukey: spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) [09:57:11] !log marostegui@dns1006 START - running authdns-update [09:57:11] (03PS2) 10Elukey: superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) [09:57:15] (03PS2) 10Elukey: tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) [09:57:16] yeah, puppet only has a sudo rule for php7.4 seemingly [09:57:19] (03PS2) 10Elukey: termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) [09:57:22] that’s just great [09:57:23] (03PS2) 10Elukey: thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) [09:57:29] (03CR) 10Effie Mouzeli: [C:03+2] cache.mcrouter: upgrade to 1.3.3 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141194 (https://phabricator.wikimedia.org/T393281) (owner: 10Effie Mouzeli) [09:57:33] (03PS1) 10Ladsgroup: Move production term store traffic to x3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145844 (https://phabricator.wikimedia.org/T351820) [09:57:57] (03Abandoned) 10Kosta Harlan: UserInfo: Conditionally register the REST API route [extensions/CheckUser] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145345 (https://phabricator.wikimedia.org/T394070) (owner: 10Jforrester) [09:58:03] (03Abandoned) 10Kosta Harlan: UserInfo: Conditionally register the REST API route [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145346 (https://phabricator.wikimedia.org/T394070) (owner: 10Jforrester) [09:58:23] Lucas_WMDE: can I help? [09:58:24] !log marostegui@dns1006 END - running authdns-update [09:58:34] Lucas_WMDE: can repro via https://commons.wikimedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusbuilddoc&titles=File%3ACommons%20geocoding%20graph.svg&formatversion=2 [09:58:52] hopefully the fix has not landed yet on beta [09:58:55] dcausse: that’s not gonna help me during the deployment though :S [09:59:04] effie: if you could run /usr/local/sbin/restart-php8.1-fpm on mwdebug1002 [09:59:12] sure [09:59:19] though scap is about to overwrite my wikiversions.json changes anyway so it might be needed more than once [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1000) [10:00:06] "Safe-restart is not needed here." haha [10:00:18] I will restart it anyway [10:00:48] done [10:01:21] hm, commons still says it’s on wmf.28 for me [10:01:44] effie: can you restart again? [10:01:47] I edited wikiversions.php this time [10:02:00] done [10:02:05] there we go! got the InvalidArgumentException [10:02:08] Lucas_WMDE: the backport won't be enough, I'll need to roll the train to group1 before you can see the changes for 1.45.0-wmf.1 on commons [10:02:11] thanks [10:02:21] jnuche: I hand-edited wikiversions on mwdebug1002 [10:02:29] gotcha [10:02:39] (03CR) 10Btullis: [V:03+1] ceph: Remove extraneous logging configuration statement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [10:02:46] (and then begged effie to restart because I’m no longer allowed to, the sudo rule was never updated from php7.4 to php8.1 in the command name :S) [10:02:52] brave old non-k8s world [10:03:15] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1145836|Also merge fields if stemming settings empty on one side (T394274)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:03:16] ok, scap has deployed the change to mwdebug [10:03:18] T394274: InvalidArgumentException: Duplicate field labels for model wikibase-mediainfo - https://phabricator.wikimedia.org/T394274 [10:03:23] which means the wikiversions change was overwritten [10:03:26] I’ll edit it again [10:03:28] Lucas_WMDE: I will also sort that one too [10:03:49] thanks [10:03:52] effie: can you do another restart? ^^ [10:03:59] ack [10:04:18] (I’m testing with https://commons.wikimedia.org/w/api.php?action=query&format=json&cbbuilders=content|links&prop=cirrusbuilddoc&formatversion=2&format=json&pageids=30959075&meta=siteinfo btw, that URL shows the current mediawiki version as well) [10:05:02] looks good, now it responds with wmf.1 and with no error \o/ [10:05:31] 🎉 [10:05:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [10:05:50] and scap pulling on mwdebug1002 again [10:06:03] (I wonder how scap is still able to restart php-fpm without errors… or does it just not do it?) [10:06:10] thanks again, I'm going to wait for the backport to complete and then will continue to group1 [10:06:29] Lucas_WMDE: thanks! [10:06:32] for bare-metal machines it still restarts php-fpm, yeah [10:06:39] (the last scap pull message is “Checking if php-fpm restart needed” so maybe it just doesn’t do it and the restarts weren’t even required, I was just editing the wrong file at first?) [10:06:42] hm ok… [10:07:05] (03Merged) 10jenkins-bot: cache.mcrouter: upgrade to 1.3.3 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141194 (https://phabricator.wikimedia.org/T393281) (owner: 10Effie Mouzeli) [10:07:19] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10820991 (10cmooney) [10:07:22] (03PS1) 10Effie Mouzeli: data.yaml: allow deployers to restart php8.1-fpm [puppet] - 10https://gerrit.wikimedia.org/r/1145845 [10:08:26] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10821000 (10cmooney) [10:08:27] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10820999 (10cmooney) [10:09:09] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1145845 (owner: 10Effie Mouzeli) [10:09:54] (03PS6) 10Fabfur: cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) [10:10:04] dcausse: will go next ? [10:10:12] (03PS1) 10Aqu: airflow-main: Bump parallelism for Airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145847 (https://phabricator.wikimedia.org/T369845) [10:10:37] jnuche: do you know if there’s a phab task for “let deployers change a wiki’s version under k8s”? is that in scope for https://phabricator.wikimedia.org/T276994 ? [10:10:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76138 and previous config saved to /var/cache/conftool/dbconfig/20250514-101057-root.json [10:11:04] Lucas_WMDE: one of the scopes, yes [10:11:08] ok [10:11:26] effie: you mean my changeprop-jobqueue? [10:11:30] I am working on it these days [10:11:32] dcausse: yes [10:11:36] effie: nice :) [10:11:57] is the train done folks? [10:12:10] nope [10:12:16] my scap is still running [10:12:26] once the current backport finishes, then I will have to roll to group1 [10:12:26] should be done soon though [10:12:28] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145836|Also merge fields if stemming settings empty on one side (T394274)]] (duration: 15m 53s) [10:12:31] T394274: InvalidArgumentException: Duplicate field labels for model wikibase-mediainfo - https://phabricator.wikimedia.org/T394274 [10:12:33] * Lucas_WMDE done deploying [10:12:45] effie: will wait for the train to rollout and ship the change [10:13:00] excellent! [10:13:38] dcausse, effie: sry for making you wait, thanks for your patience [10:13:57] rolling out train now [10:14:01] good luck [10:14:05] it is alright, we are only people and it is only code, cheers jnuche [10:14:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Strictly speaking all changes to sudo rules need SRE IF meeting approval, but this is just a minor update to what was already " [puppet] - 10https://gerrit.wikimedia.org/r/1145845 (owner: 10Effie Mouzeli) [10:14:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:14:39] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145849 (https://phabricator.wikimedia.org/T392171) [10:14:41] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145849 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [10:15:28] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145849 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [10:18:05] jnuche: np! [10:24:07] (03CR) 10Btullis: [V:03+1 C:03+2] ceph: Remove extraneous logging configuration statement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144583 (https://phabricator.wikimedia.org/T384322) (owner: 10Btullis) [10:28:23] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.1 refs T392171 [10:28:27] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [10:28:36] logs look okay to me so far, only one instance of “duplicate field labels” left in logspam-watch (and that’s the one I caused manually on mwdebug1002 ^^) [10:29:14] give me a few minutes to watch the logs, I need to make sure I won't need to rollback again [10:29:25] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:31:38] the “Invalid label key: 'same-wt'” looks suspicious (but not related to search) [10:31:40] seeing PHP Warning: Invalid label key: 'same-wt' [10:31:43] yes [10:32:41] that's https://phabricator.wikimedia.org/T394053. The volume has increased significantly since group0, but I'm not sure about the severity since it's a warning? [10:32:53] I think I'm not going to rollback but will probably make this a blocker for group2 [10:32:56] RESOLVED: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:20] jnuche: makes sense [10:36:42] dcausse, effie: things look stable enough, I think you can go ahead [10:36:49] jnuche: thanks! [10:37:03] (03CR) 10DCausse: [C:03+2] changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [10:38:39] (03Merged) 10jenkins-bot: changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [10:39:47] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:40:37] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:40:38] not seeing the new chart version, will wait a bit and try again [10:41:29] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:41:44] hm... still not there [10:41:51] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:43:31] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:44:02] here we go [10:44:34] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:46:48] jouncebot: nowandnext [10:46:48] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1000) [10:46:48] In 0 hour(s) and 13 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1100) [10:47:00] am I okay to deploy a mw-cron change? [10:47:12] staging looks fine will proceed to codfw [10:47:26] hnowlan: I'm deploying changeprop-jobqueue [10:47:32] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on A:cephosd [10:49:04] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:49:20] dcausse: ack, my change won't conflict [10:49:27] ack [10:50:52] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:54:23] hnowlan: once you're done, please ping me :D [10:54:29] there is a traffic jam today [10:56:03] hnowlan: saw a couple errors at startup: "Broker: Unknown topic or partition", is it fine to ignore? [10:57:47] dcausse: I'll have a look, in codfw? [10:57:52] yes [10:58:07] https://logstash.wikimedia.org/goto/41420c21f0dc30fae073640aaeae6427 [10:59:48] I think I will drop off the deploy to production race for today [11:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1100) [11:00:20] effie: sorry :/ [11:00:22] (03CR) 10JMeybohm: [C:03+2] sre.discovery.datacenter: Raise CookbookInitSuccess on status action [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 (owner: 10JMeybohm) [11:00:26] (03CR) 10JMeybohm: [C:03+2] Refactor sre.discovery's use of resolve_with_client_ip [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) (owner: 10JMeybohm) [11:00:39] dcausse: it is ok! [11:00:43] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [11:01:14] dcausse: concerning message, but not your doing so don't worry about it for now [11:01:23] low-traffic jobs are still working as expected [11:01:39] hnowlan: thanks, will proceed to eqiad [11:01:55] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:02:49] (03CR) 10Ladsgroup: [C:03+1] Add mysql grants for cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1145043 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [11:03:16] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:05:10] (03PS1) 10JMeybohm: Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 [11:05:19] jouncebot: nowandnext [11:05:19] For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1100) [11:05:19] In 1 hour(s) and 54 minute(s): UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1300) [11:06:19] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143506 (owner: 10PipelineBot) [11:06:46] (03Merged) 10jenkins-bot: Refactor sre.discovery's use of resolve_with_client_ip [cookbooks] - 10https://gerrit.wikimedia.org/r/1145129 (https://phabricator.wikimedia.org/T393600) (owner: 10JMeybohm) [11:07:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [11:07:50] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143506 (owner: 10PipelineBot) [11:07:55] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Raise CookbookInitSuccess on status action [cookbooks] - 10https://gerrit.wikimedia.org/r/1145137 (owner: 10JMeybohm) [11:08:05] ok I think I'm done, job processing rate is back to normal and not seeing anything abnormal in the logs [11:08:21] Amir1: you can go ahead [11:09:25] thank you! [11:09:37] Mvolz: if you have a deploy, please go ahead [11:09:58] (let me know once you're done) [11:10:11] Amir1: okay, will do [11:10:24] thanks! [11:10:25] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:10:56] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:12:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on A:cephosd [11:12:55] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:13:22] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:14:43] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:15:09] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:15:11] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [11:16:43] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821216 (10taavi) [11:17:06] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@22aa307]: T393561 [11:17:09] T393561: Setup pipelines to load CX extension tables into Data Lake, at wmf_product - https://phabricator.wikimedia.org/T393561 [11:17:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [11:17:59] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@22aa307]: T393561 (duration: 01m 10s) [11:18:58] Amir1: all done! [11:19:05] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821233 (10cmooney) >>! In T394286#10821126, @Vgutierrez wrote: > services present on service.yaml with lvs configuration that don't have an ipip_configuration entry require L2 adjacency. Than... [11:19:13] thanks! [11:19:19] marostegui: about to start the deploy [11:19:29] (03CR) 10Ladsgroup: [C:03+2] Move production term store traffic to x3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145844 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [11:20:15] (03Merged) 10jenkins-bot: Move production term store traffic to x3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145844 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [11:21:03] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]] [11:21:06] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:22:10] Amir1: ok! [11:27:26] (03CR) 10Fabfur: [C:03+2] cache: add option to enable or disable varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/1145282 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [11:27:27] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:27:30] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:28:03] Testing in mwdebug [11:30:21] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-05-07-003410 to 2025-05-12-235119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145863 (https://phabricator.wikimedia.org/T324616) [11:30:31] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-06-142345 to 2025-05-14-112404 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145864 (https://phabricator.wikimedia.org/T324616) [11:32:13] (03CR) 10Muehlenhoff: [C:03+2] Switch the kadmin server to krb1002 [puppet] - 10https://gerrit.wikimedia.org/r/1143574 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [11:35:13] everything works so far [11:35:19] marostegui: pushing everywhere [11:35:21] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:35:31] ok!! [11:38:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [11:41:22] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821336 (10cmooney) @ayounsi steered me the right way here, I believe these are the host types we want to avoid racking in the new cage for now. Just the K8s ones and cirrussearch: aux-k8s-ctr... [11:41:44] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1068.eqiad.wmnet [11:41:52] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]] (duration: 20m 48s) [11:41:55] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:42:05] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10821350 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting to check failed disk [11:42:38] okay deployed. marostegui now, let's remove one replica from dbctl s8 and one from x3 to see if the reads start to split [11:42:54] I can do it, just wanted to make sure :D [11:44:17] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821358 (10cmooney) [11:44:37] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821359 (10cmooney) p:05Triage→03Medium [11:45:24] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:52] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821360 (10cmooney) [11:46:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10821362 (10MoritzMuehlenhoff) [11:47:17] !log installing librabbitmq securit updates [11:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db2243 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76142 and previous config saved to /var/cache/conftool/dbconfig/20250514-114724-ladsgroup.json [11:47:31] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:47:52] I removed only db2243 from s8 to see how it goes [11:48:06] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821376 (10cmooney) @papaul @Jhancock.wm FYI. Is there a good way to save this list somewhere so DC-ops can cross-refernce? Or are you happy to refer back to this task? [11:48:16] Amir1: ok! [11:48:50] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10821380 (10cmooney) [11:49:00] (03PS1) 10Máté Szabó: TransformHandler: Return 400 for invalid titles [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145867 (https://phabricator.wikimedia.org/T394270) [11:49:05] sorry folks for all the task spam I made hard work of those for some reason [11:49:49] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for cortobot [puppet] - 10https://gerrit.wikimedia.org/r/1145808 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:49:56] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1145815 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:50:00] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for kthxbye [puppet] - 10https://gerrit.wikimedia.org/r/1145817 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:50:05] (03CR) 10Filippo Giunchedi: [C:03+1] Enable profile::auto_restarts::service for karma [puppet] - 10https://gerrit.wikimedia.org/r/1145826 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:51:18] Amir1: Before we make the real split I will need to restart mariadb on x3 hosts though, so they pick up all the flags for orchestrator [11:51:21] and show in the right cluster [11:51:26] but that's for later, but just saying [11:52:13] sure [11:55:11] (03CR) 10Brouberol: [C:03+1] airflow-main: Bump parallelism for Airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145847 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [11:58:04] (03PS1) 10Andrew Bogott: sssd.conf: add more timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/1145870 (https://phabricator.wikimedia.org/T394283) [12:01:44] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1145870 (https://phabricator.wikimedia.org/T394283) (owner: 10Andrew Bogott) [12:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:51] (03CR) 10Andrew Bogott: [C:03+2] sssd.conf: add more timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/1145870 (https://phabricator.wikimedia.org/T394283) (owner: 10Andrew Bogott) [12:05:34] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10821464 (10cmassaro) @BCornwall I put it here: https://meta.wikimedia.org/wiki/User:CMassaro_(WMF). Was there somewhere else I should have put it? [12:10:24] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:19] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10821481 (10MoritzMuehlenhoff) The import ran 2:21hrs, but then failed with a DB permission error: ` May 14 11:45:12 [2025-05-14T11:45:12Z] 2:18:00 [progress] 2h18m0s C: 1601000/s (... [12:12:17] (03PS1) 10Marostegui: db1258: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145873 (https://phabricator.wikimedia.org/T390530) [12:13:58] (03CR) 10Marostegui: [C:03+2] db1258: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1145873 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:14:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76143 and previous config saved to /var/cache/conftool/dbconfig/20250514-121446-root.json [12:14:57] Amir1: ^ pooling that new host in s8/x3 [12:15:16] It was racked/installed yesterday [12:16:04] awesome [12:16:09] (03PS1) 10Btullis: clouddumps: Manage directories beneath /srv/dumps/xmldatadumps_airflow_temp [puppet] - 10https://gerrit.wikimedia.org/r/1145874 (https://phabricator.wikimedia.org/T389784) [12:16:51] (03PS1) 10Ayounsi: Bump TransitPeering in/out Saturation to critical [alerts] - 10https://gerrit.wikimedia.org/r/1145875 (https://phabricator.wikimedia.org/T388641) [12:17:07] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5543/co" [puppet] - 10https://gerrit.wikimedia.org/r/1145874 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:17:23] (03PS2) 10Ayounsi: Bump TransitPeering in/out Saturation to critical [alerts] - 10https://gerrit.wikimedia.org/r/1145875 (https://phabricator.wikimedia.org/T388641) [12:20:48] (03PS1) 10Marostegui: check_depooled: Add x3 [software] - 10https://gerrit.wikimedia.org/r/1145877 (https://phabricator.wikimedia.org/T390530) [12:20:49] (03PS1) 10Muehlenhoff: maps: Delete obsolete osm-initial-import script [puppet] - 10https://gerrit.wikimedia.org/r/1145876 (https://phabricator.wikimedia.org/T381565) [12:20:59] (03CR) 10Marostegui: "This is a NOOP" [software] - 10https://gerrit.wikimedia.org/r/1145877 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:21:34] (03CR) 10Marostegui: [C:03+2] check_depooled: Add x3 [software] - 10https://gerrit.wikimedia.org/r/1145877 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:22:01] (03Merged) 10jenkins-bot: check_depooled: Add x3 [software] - 10https://gerrit.wikimedia.org/r/1145877 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:23:24] !log joal@deploy1003 Started deploy [analytics/refinery@9d620d0]: Regular analytics weekly train [analytics/refinery@9d620d06] [12:24:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:24:58] (03CR) 10Brouberol: [C:03+1] clouddumps: Manage directories beneath /srv/dumps/xmldatadumps_airflow_temp [puppet] - 10https://gerrit.wikimedia.org/r/1145874 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:25:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10821522 (10Stevemunene) >>! In T390171#10818501, @Jclark-ctr wrote: > @Stevemunene replaced the drives on these 2 serv... [12:25:42] !log joal@deploy1003 Finished deploy [analytics/refinery@9d620d0]: Regular analytics weekly train [analytics/refinery@9d620d06] (duration: 02m 17s) [12:25:50] (03CR) 10Btullis: [V:03+1 C:03+2] clouddumps: Manage directories beneath /srv/dumps/xmldatadumps_airflow_temp [puppet] - 10https://gerrit.wikimedia.org/r/1145874 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:26:03] !log joal@deploy1003 Started deploy [analytics/refinery@9d620d0] (thin): Analytics webrequest migration THIN [analytics/refinery@9d620d06] [12:27:38] !log joal@deploy1003 Finished deploy [analytics/refinery@9d620d0] (thin): Analytics webrequest migration THIN [analytics/refinery@9d620d06] (duration: 01m 35s) [12:28:06] !log joal@deploy1003 Started deploy [analytics/refinery@9d620d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9d620d06] [12:28:53] !log joal@deploy1003 Finished deploy [analytics/refinery@9d620d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9d620d06] (duration: 00m 46s) [12:28:59] (03PS2) 10JMeybohm: Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 [12:29:47] (03CR) 10CI reject: [V:04-1] Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 (owner: 10JMeybohm) [12:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76144 and previous config saved to /var/cache/conftool/dbconfig/20250514-122952-root.json [12:33:26] (03PS1) 10Muehlenhoff: Remove krb1001 from list of KDCs [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) [12:37:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:37:52] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: define the airflow-dev namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145824 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:37:54] (03PS3) 10JMeybohm: Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 [12:37:59] (03CR) 10Elukey: "So long krb1001 o/" [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:38:05] (03CR) 10Elukey: [C:03+1] Remove krb1001 from list of KDCs [puppet] - 10https://gerrit.wikimedia.org/r/1145884 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:38:09] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: add airflow-dev to the PG/Ceph operator tenant NS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145825 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:39:30] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: define the airflow-dev namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145824 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:39:33] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: add airflow-dev to the PG/Ceph operator tenant NS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145825 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:41:23] (03CR) 10Btullis: deployment_server: provision the airflow-dev kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:42:17] (03CR) 10Btullis: [C:03+1] analytics-hive: Enable lock transaction management in prod hive metastore [puppet] - 10https://gerrit.wikimedia.org/r/1145762 (https://phabricator.wikimedia.org/T386854) (owner: 10Brouberol) [12:42:24] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10821582 (10MoritzMuehlenhoff) [12:43:43] (03CR) 10Brouberol: deployment_server: provision the airflow-dev kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:44:15] (03PS2) 10Brouberol: deployment_server: provision the airflow-dev kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) [12:44:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76145 and previous config saved to /var/cache/conftool/dbconfig/20250514-124458-root.json [12:45:26] (03Merged) 10jenkins-bot: dse-k8s-eqiad: define the airflow-dev namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145824 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:45:41] (03CR) 10Brouberol: deployment_server: provision the airflow-dev kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:45:46] (03Merged) 10jenkins-bot: dse-k8s-eqiad: add airflow-dev to the PG/Ceph operator tenant NS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145825 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:46:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:26] (03CR) 10Brouberol: [C:03+2] deployment_server: provision the airflow-dev kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1145818 (https://phabricator.wikimedia.org/T394001) (owner: 10Brouberol) [12:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:53:14] (03CR) 10Elukey: [C:03+1] maps: Delete obsolete osm-initial-import script [puppet] - 10https://gerrit.wikimedia.org/r/1145876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:53:17] (03PS1) 10Filippo Giunchedi: icinga: remove HOSTOUTPUT from vo-host-notify-by-email [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) [12:54:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:54:42] (03CR) 10Filippo Giunchedi: "Starting Fri I'll be OOO until the first week of June, unless we deploy and test this today or tomorrow I won't be able to assist" [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) (owner: 10Filippo Giunchedi) [12:55:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:55:45] (03CR) 10Jgiannelos: [C:03+1] "Yeah, this is not used anymore in production, its the old osmosis setup." [puppet] - 10https://gerrit.wikimedia.org/r/1145876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:56:24] (03PS1) 10Clément Goubert: mw::maintenance: Run listTaskCounts every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1145906 (https://phabricator.wikimedia.org/T394018) [12:56:29] (03CR) 10Muehlenhoff: [C:03+2] maps: Delete obsolete osm-initial-import script [puppet] - 10https://gerrit.wikimedia.org/r/1145876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:57:53] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10821634 (10Stevemunene) Commented out the failed disk on `/etc/fstab` and rebooted the host, the host is stuck booting due to the... [12:59:50] (03PS2) 10Clément Goubert: mw::maintenance: Run listTaskCounts every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1145906 (https://phabricator.wikimedia.org/T394018) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1300). [13:00:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76146 and previous config saved to /var/cache/conftool/dbconfig/20250514-130004-root.json [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:13] o/ [13:00:59] if someone wants to review https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1139489, I wouldn’t mind deploying that ^^ [13:01:48] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1068.eqiad.wmnet [13:02:08] (03CR) 10Jforrester: [C:03+1] manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:02:13] Lucas_WMDE: Go for it. [13:02:17] (03CR) 10Samtar: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:02:31] (03CR) 10Elukey: "As FYI :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:02:32] ok \o/ [13:02:39] two +1s *does* equal a "+2"... :D [13:02:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:03:02] looks like it ;) [13:03:04] (03CR) 10CI reject: [V:04-1] manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:03:06] bah [13:03:24] o_O why is there a merge conflict in gerrit but not locally [13:03:34] (03PS3) 10Lucas Werkmeister (WMDE): manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) [13:03:37] Lucas_WMDE: jgit powers gerrit, and it's different. [13:04:15] spiderpig convenience feature request: “retry this deployment” [13:04:22] (even if it just creates a new one for the same change ^^) [13:04:32] Phabricator task or it won't happen. :-) [13:04:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:05:05] (03PS1) 10Brouberol: dse-k8s-eqiad: provision the postgresql-airflow-dev PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145907 (https://phabricator.wikimedia.org/T394039) [13:05:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10821669 (10ssingh) Thanks for the task and the detailed description! Mostly sounds good to me for the VMs that I "own" but will need to check with Traffic for others. Is there a tentat... [13:05:56] (03Merged) 10jenkins-bot: manage-dblist: Rename to manage-dblist.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139489 (https://phabricator.wikimedia.org/T392819) (owner: 10Lucas Werkmeister (WMDE)) [13:05:56] !log reboot grafana1002 - hard down [13:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:21] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139489|manage-dblist: Rename to manage-dblist.php (T392819)]] [13:06:24] T392819: phpcs does not check manage-dblist in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T392819 [13:06:54] T394302 [13:06:54] T394302: Retry deployment - https://phabricator.wikimedia.org/T394302 [13:07:09] (03CR) 10Ssingh: [C:03+1] Enable profile::auto_restarts::service for the Prometheus Bird exporter [puppet] - 10https://gerrit.wikimedia.org/r/1145802 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:07:20] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Run listTaskCounts every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1145906 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [13:07:35] !log correction, restart grafana-server on grafana1002 [13:07:36] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Run listTaskCounts every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1145906 (https://phabricator.wikimedia.org/T394018) (owner: 10Clément Goubert) [13:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:44] (03PS1) 10Jforrester: TransformHandler: Return 400 for invalid titles [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145913 (https://phabricator.wikimedia.org/T394270) [13:09:46] (03CR) 10Elukey: "Ping on this - how do you feel about doing it? It seems safer as Janis mentioned to me when we were discussing it, but I wanted to double " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [13:10:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10821711 (10MoritzMuehlenhoff) Unless there's any concerns we'd like to start next week [13:11:13] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139489|manage-dblist: Rename to manage-dblist.php (T392819)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:47] nothing to test really [13:12:05] lol, just typed ?action=debug instead of ?action=purge [13:12:22] anyway, nothing in mwdebug, let’s go [13:12:25] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:13:01] (03CR) 10Filippo Giunchedi: [C:03+1] ipoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:13:20] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:13:52] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:14:40] (03CR) 10Filippo Giunchedi: [C:03+1] linkrecommendation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145217 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:14:40] (03CR) 10VPuffetMichel: [C:03+1] "Looks good too." [puppet] - 10https://gerrit.wikimedia.org/r/1145327 (https://phabricator.wikimedia.org/T393724) (owner: 10BCornwall) [13:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76148 and previous config saved to /var/cache/conftool/dbconfig/20250514-131510-root.json [13:15:26] (03CR) 10Filippo Giunchedi: [C:03+1] machinetranslation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145218 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:15:33] (03PS2) 10Aqu: airflow-main: Bump parallelism for Airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145847 (https://phabricator.wikimedia.org/T369845) [13:15:56] (03CR) 10Filippo Giunchedi: [C:03+1] mathoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145219 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:16:09] (03CR) 10Filippo Giunchedi: [C:03+1] mpic: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145224 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:16:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] icinga: remove HOSTOUTPUT from vo-host-notify-by-email [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) (owner: 10Filippo Giunchedi) [13:18:10] (03PS1) 10Slyngshede: SSHKey: Reimplement key suspension in Vue [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 [13:18:57] (03PS2) 10Slyngshede: SSHKey: Reimplement key suspension in Vue [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 [13:19:10] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139489|manage-dblist: Rename to manage-dblist.php (T392819)]] (duration: 12m 48s) [13:19:15] T392819: phpcs does not check manage-dblist in operations/mediawiki-config.git - https://phabricator.wikimedia.org/T392819 [13:19:35] (03CR) 10AOkoth: "Eeerm, I'd say for now we want deploy to eqiad to verify it works as expected. Once we verify that, we can add an updating mechanism, test" [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:20:05] * Lucas_WMDE done deploying [13:20:35] (03CR) 10CI reject: [V:04-1] TransformHandler: Return 400 for invalid titles [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145913 (https://phabricator.wikimedia.org/T394270) (owner: 10Jforrester) [13:20:38] !log UTC afternoon backport+config window done [13:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:00] (03CR) 10Clément Goubert: [C:03+1] "LGTM, use https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#The_scap_way for deployment (or ask us and we'll do it)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:27:33] (03CR) 10Filippo Giunchedi: [C:03+1] kartotherian: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145215 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:27:37] (03CR) 10Filippo Giunchedi: [C:03+1] kask: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:27:43] (03CR) 10Filippo Giunchedi: [C:03+1] mediawiki: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:27:46] (03CR) 10Filippo Giunchedi: [C:03+1] mediawiki-dumps-legacy: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145221 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:27:50] (03CR) 10Filippo Giunchedi: [C:03+1] miscweb: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145222 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:30:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76149 and previous config saved to /var/cache/conftool/dbconfig/20250514-133016-root.json [13:30:20] (03CR) 10Brouberol: [C:03+2] airflow-main: Bump parallelism for Airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145847 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [13:30:41] (03PS1) 10Stevemunene: Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) [13:31:46] (03CR) 10CI reject: [V:04-1] Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [13:32:27] (03PS2) 10Stevemunene: Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) [13:32:36] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:32:59] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:33:28] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:33:36] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:34:18] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:34:39] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:35:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10821837 (10Stevemunene) set each disk into a single Raid0 1 disk array ` stevemunene@an-worker117... [13:35:13] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:35:21] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:35:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10821840 (10Stevemunene) [13:35:36] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:35:48] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:36:02] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:36:20] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:38:11] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1156.eqiad.wmnet [13:38:24] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308 (10JTweed-WMF) 03NEW [13:39:37] (03PS1) 10Dreamy Jazz: Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145947 (https://phabricator.wikimedia.org/T394299) [13:40:08] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1156.eqiad.wmnet [13:40:41] (03PS1) 10Fabfur: hiera: disable varnishkafka in magru text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) [13:40:43] (03CR) 10Ssingh: [C:03+1] Revert^2 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1145277 (owner: 10Ebernhardson) [13:41:06] (03CR) 10Ssingh: [C:03+1] Revert^2 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1145276 (owner: 10Ebernhardson) [13:41:09] (03CR) 10Muehlenhoff: systemd: validate units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [13:42:42] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for mc-wf1002.mgmt:22 - https://phabricator.wikimedia.org/T394257#10821866 (10Jclark-ctr) a:03Jclark-ctr [13:43:44] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcontrol1006.mgmt:22 - https://phabricator.wikimedia.org/T394256#10821873 (10Jclark-ctr) a:03Jclark-ctr No response by ping rack D5 [13:44:24] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:26] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1040.mgmt:22 - https://phabricator.wikimedia.org/T394255#10821877 (10Jclark-ctr) a:03Jclark-ctr [13:45:06] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudnet1006.mgmt:22 - https://phabricator.wikimedia.org/T394254#10821879 (10Jclark-ctr) a:03Jclark-ctr [13:45:09] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1019.mgmt:22 - https://phabricator.wikimedia.org/T394253#10821883 (10Jclark-ctr) a:03Jclark-ctr [13:45:13] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1004.mgmt:22 - https://phabricator.wikimedia.org/T394252#10821884 (10Jclark-ctr) a:03Jclark-ctr [13:45:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for es1034.mgmt:22 - https://phabricator.wikimedia.org/T394251#10821885 (10Jclark-ctr) a:03Jclark-ctr [13:45:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76150 and previous config saved to /var/cache/conftool/dbconfig/20250514-134521-root.json [13:46:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1175.mgmt:22 - https://phabricator.wikimedia.org/T394250#10821886 (10Jclark-ctr) a:03Jclark-ctr [13:46:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394249#10821887 (10Jclark-ctr) a:03Jclark-ctr [13:46:08] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394248#10821888 (10Jclark-ctr) a:03Jclark-ctr [13:46:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394247#10821889 (10Jclark-ctr) a:03Jclark-ctr [13:46:17] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394246#10821890 (10Jclark-ctr) a:03Jclark-ctr [13:46:23] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394245#10821891 (10Jclark-ctr) a:03Jclark-ctr [13:46:24] (03CR) 10Kosta Harlan: [C:03+1] Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145947 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [13:46:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394244#10821892 (10Jclark-ctr) a:03Jclark-ctr [13:46:33] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394243#10821893 (10Jclark-ctr) a:03Jclark-ctr [13:46:34] stevemunene@cumin1002 init-hadoop-workers (PID 1999309) is awaiting input [13:46:43] (03CR) 10Elukey: [C:03+2] ipoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145214 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:46:51] (03CR) 10Elukey: [C:03+2] kartotherian: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145215 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:46:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394242#10821897 (10Jclark-ctr) a:03Jclark-ctr [13:46:59] (03CR) 10Elukey: [C:03+2] kask: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:07] (03CR) 10Elukey: [C:03+2] linkrecommendation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145217 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:16] (03CR) 10Elukey: [C:03+2] machinetranslation: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145218 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:24] (03CR) 10Elukey: [C:03+2] mathoid: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145219 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:41] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394241#10821905 (10Jclark-ctr) a:03Jclark-ctr [13:47:45] (03CR) 10Elukey: [C:03+2] mediawiki: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145220 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:48] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394240#10821906 (10Jclark-ctr) a:03Jclark-ctr [13:47:54] (03CR) 10Elukey: [C:03+2] mediawiki-dumps-legacy: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145221 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:47:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394239#10821915 (10Jclark-ctr) a:03Jclark-ctr [13:48:02] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394238#10821917 (10Jclark-ctr) a:03Jclark-ctr [13:48:03] (03CR) 10Elukey: [C:03+2] miscweb: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145222 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:48:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394237#10821919 (10Jclark-ctr) a:03Jclark-ctr [13:48:16] (03CR) 10Elukey: [C:03+2] mobileapps: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145223 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:48:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394236#10821921 (10Jclark-ctr) a:03Jclark-ctr [13:48:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394235#10821922 (10Jclark-ctr) a:03Jclark-ctr [13:48:40] (03CR) 10Elukey: [C:03+2] mpic: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145224 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:48:48] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394234#10821925 (10Jclark-ctr) a:03Jclark-ctr [13:48:52] (03CR) 10Elukey: [C:03+2] push-notifications: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145225 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:48:56] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394233#10821927 (10Jclark-ctr) a:03Jclark-ctr [13:49:00] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394232#10821929 (10Jclark-ctr) a:03Jclark-ctr [13:49:01] FIRING: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394231#10821943 (10Jclark-ctr) a:03Jclark-ctr [13:49:51] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394230#10821944 (10Jclark-ctr) a:03Jclark-ctr [13:49:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394229#10821945 (10Jclark-ctr) a:03Jclark-ctr [13:49:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394228#10821946 (10Jclark-ctr) a:03Jclark-ctr [13:50:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394227#10821947 (10Jclark-ctr) a:03Jclark-ctr [13:50:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394226#10821948 (10Jclark-ctr) a:03Jclark-ctr [13:50:17] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394225#10821950 (10Jclark-ctr) a:03Jclark-ctr [13:50:21] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394224#10821951 (10Jclark-ctr) a:03Jclark-ctr [13:50:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394223#10821952 (10Jclark-ctr) a:03Jclark-ctr [13:50:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394222#10821953 (10Jclark-ctr) a:03Jclark-ctr [13:50:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:50:33] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394221#10821954 (10Jclark-ctr) a:03Jclark-ctr [13:50:37] (03PS3) 10Slyngshede: SSHKey: Reimplement key suspension in Vue [software/bitu] - 10https://gerrit.wikimedia.org/r/1145927 [13:50:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394220#10821956 (10Jclark-ctr) a:03Jclark-ctr [13:50:49] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394219#10821957 (10Jclark-ctr) a:03Jclark-ctr [13:50:53] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394218#10821958 (10Jclark-ctr) a:03Jclark-ctr [13:50:57] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394217#10821959 (10Jclark-ctr) a:03Jclark-ctr [13:51:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:51:01] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394216#10821960 (10Jclark-ctr) a:03Jclark-ctr [13:51:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394215#10821961 (10Jclark-ctr) a:03Jclark-ctr [13:51:27] (03PS1) 10Dreamy Jazz: MediaModeration: Only running scanning scripts on production [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) [13:51:56] RESOLVED: HelmReleaseBadStatus: Helm release airflow-main/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-main - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:51:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394214#10821967 (10Jclark-ctr) a:03Jclark-ctr [13:52:02] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394213#10821968 (10Jclark-ctr) a:03Jclark-ctr [13:52:06] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394212#10821969 (10Jclark-ctr) a:03Jclark-ctr [13:52:11] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394211#10821970 (10Jclark-ctr) a:03Jclark-ctr [13:52:15] jouncebot: nowandnext [13:52:15] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1300) [13:52:15] In 0 hour(s) and 7 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1400) [13:52:16] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394210#10821971 (10Jclark-ctr) a:03Jclark-ctr [13:52:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394209#10821972 (10Jclark-ctr) a:03Jclark-ctr [13:52:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394208#10821973 (10Jclark-ctr) a:03Jclark-ctr [13:52:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394207#10821974 (10Jclark-ctr) a:03Jclark-ctr [13:52:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394206#10821975 (10Jclark-ctr) a:03Jclark-ctr [13:52:39] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394205#10821976 (10Jclark-ctr) a:03Jclark-ctr [13:52:44] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394204#10821977 (10Jclark-ctr) a:03Jclark-ctr [13:52:54] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394203#10821978 (10Jclark-ctr) a:03Jclark-ctr [13:52:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394202#10821979 (10Jclark-ctr) a:03Jclark-ctr [13:53:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394201#10821980 (10Jclark-ctr) a:03Jclark-ctr [13:53:08] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394199#10821981 (10Jclark-ctr) a:03Jclark-ctr [13:55:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394200#10821985 (10Jclark-ctr) a:03Jclark-ctr [13:55:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1014.mgmt:22 - https://phabricator.wikimedia.org/T394198#10821986 (10Jclark-ctr) a:03Jclark-ctr [13:55:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-test-coord1001.mgmt:22 - https://phabricator.wikimedia.org/T394197#10821987 (10Jclark-ctr) a:03Jclark-ctr [13:55:30] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394196#10821988 (10Jclark-ctr) a:03Jclark-ctr [13:55:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394195#10821989 (10Jclark-ctr) a:03Jclark-ctr [13:55:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394194#10821990 (10Jclark-ctr) a:03Jclark-ctr [13:55:44] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394193#10821991 (10Jclark-ctr) a:03Jclark-ctr [13:55:51] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394192#10821993 (10Jclark-ctr) a:03Jclark-ctr [13:55:54] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1034.mgmt:22 - https://phabricator.wikimedia.org/T394191#10821994 (10Jclark-ctr) a:03Jclark-ctr [13:55:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394190#10821995 (10Jclark-ctr) a:03Jclark-ctr [13:56:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394189#10821996 (10Jclark-ctr) a:03Jclark-ctr [13:56:08] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for dbstore1007.mgmt:22 - https://phabricator.wikimedia.org/T394188#10821999 (10Jclark-ctr) a:03Jclark-ctr [13:56:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394187#10822000 (10Jclark-ctr) a:03Jclark-ctr [13:56:16] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for gitlab-runner1004.mgmt:22 - https://phabricator.wikimedia.org/T394186#10822001 (10Jclark-ctr) a:03Jclark-ctr [13:56:21] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394185#10822002 (10Jclark-ctr) a:03Jclark-ctr [13:56:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1037.mgmt:22 - https://phabricator.wikimedia.org/T394184#10822003 (10Jclark-ctr) a:03Jclark-ctr [13:56:30] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1096.mgmt:22 - https://phabricator.wikimedia.org/T394183#10822004 (10Jclark-ctr) a:03Jclark-ctr [13:56:54] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:57:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1071.mgmt:22 - https://phabricator.wikimedia.org/T394169#10822011 (10Jclark-ctr) a:03Jclark-ctr [13:57:39] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1068.mgmt:22 - https://phabricator.wikimedia.org/T394182#10822012 (10Jclark-ctr) a:03Jclark-ctr [13:57:44] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1015.mgmt:22 - https://phabricator.wikimedia.org/T394181#10822013 (10Jclark-ctr) a:03Jclark-ctr [13:57:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1222.mgmt:22 - https://phabricator.wikimedia.org/T394180#10822014 (10Jclark-ctr) a:03Jclark-ctr [13:57:54] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1107.mgmt:22 - https://phabricator.wikimedia.org/T394179#10822015 (10Jclark-ctr) a:03Jclark-ctr [13:58:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1023.mgmt:22 - https://phabricator.wikimedia.org/T394178#10822016 (10Jclark-ctr) a:03Jclark-ctr [13:58:08] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1020.mgmt:22 - https://phabricator.wikimedia.org/T394177#10822017 (10Jclark-ctr) a:03Jclark-ctr [13:58:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1042.mgmt:22 - https://phabricator.wikimedia.org/T394176#10822018 (10Jclark-ctr) a:03Jclark-ctr [13:58:18] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudservices1005.mgmt:22 - https://phabricator.wikimedia.org/T394175#10822019 (10Jclark-ctr) a:03Jclark-ctr [13:58:28] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for pki-root1001.mgmt:22 - https://phabricator.wikimedia.org/T394173#10822024 (10Jclark-ctr) a:03Jclark-ctr [13:58:32] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1173.mgmt:22 - https://phabricator.wikimedia.org/T394172#10822025 (10Jclark-ctr) a:03Jclark-ctr [13:58:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1039.mgmt:22 - https://phabricator.wikimedia.org/T394171#10822027 (10Jclark-ctr) a:03Jclark-ctr [13:58:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10822026 (10Stevemunene) `an-worker1177` seems to have an issue after the swap, currently looking... [13:58:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1069.mgmt:22 - https://phabricator.wikimedia.org/T394170#10822028 (10Jclark-ctr) a:03Jclark-ctr [14:00:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1070.mgmt:22 - https://phabricator.wikimedia.org/T394151#10822029 (10Jclark-ctr) a:03Jclark-ctr [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1400) [14:00:09] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1011.mgmt:22 - https://phabricator.wikimedia.org/T394168#10822030 (10Jclark-ctr) a:03Jclark-ctr [14:00:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1038.mgmt:22 - https://phabricator.wikimedia.org/T394167#10822031 (10Jclark-ctr) a:03Jclark-ctr [14:00:18] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1167.mgmt:22 - https://phabricator.wikimedia.org/T394166#10822032 (10Jclark-ctr) a:03Jclark-ctr [14:00:24] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1024.mgmt:22 - https://phabricator.wikimedia.org/T394165#10822033 (10Jclark-ctr) a:03Jclark-ctr [14:00:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76151 and previous config saved to /var/cache/conftool/dbconfig/20250514-140027-root.json [14:00:30] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1012.mgmt:22 - https://phabricator.wikimedia.org/T394164#10822034 (10Jclark-ctr) a:03Jclark-ctr [14:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1047.mgmt:22 - https://phabricator.wikimedia.org/T394163#10822035 (10Jclark-ctr) a:03Jclark-ctr [14:00:39] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1046.mgmt:22 - https://phabricator.wikimedia.org/T394162#10822036 (10Jclark-ctr) a:03Jclark-ctr [14:00:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1067.mgmt:22 - https://phabricator.wikimedia.org/T394161#10822037 (10Jclark-ctr) a:03Jclark-ctr [14:00:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1041.mgmt:22 - https://phabricator.wikimedia.org/T394159#10822038 (10Jclark-ctr) a:03Jclark-ctr [14:00:57] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-druid1005.mgmt:22 - https://phabricator.wikimedia.org/T394158#10822039 (10Jclark-ctr) a:03Jclark-ctr [14:01:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1020.mgmt:22 - https://phabricator.wikimedia.org/T394157#10822040 (10Jclark-ctr) a:03Jclark-ctr [14:01:07] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [14:01:11] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1223.mgmt:22 - https://phabricator.wikimedia.org/T394155#10822052 (10Jclark-ctr) a:03Jclark-ctr [14:01:17] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1249.mgmt:22 - https://phabricator.wikimedia.org/T394154#10822053 (10Jclark-ctr) a:03Jclark-ctr [14:01:23] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1032.mgmt:22 - https://phabricator.wikimedia.org/T394153#10822054 (10Jclark-ctr) a:03Jclark-ctr [14:01:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1036.mgmt:22 - https://phabricator.wikimedia.org/T394152#10822055 (10Jclark-ctr) a:03Jclark-ctr [14:02:28] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for kafka-main1009.mgmt:22 - https://phabricator.wikimedia.org/T394137#10822060 (10Jclark-ctr) a:03Jclark-ctr [14:02:31] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1013.mgmt:22 - https://phabricator.wikimedia.org/T394156#10822061 (10Jclark-ctr) a:03Jclark-ctr [14:02:40] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1159.mgmt:22 - https://phabricator.wikimedia.org/T394150#10822062 (10Jclark-ctr) a:03Jclark-ctr [14:02:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirtlocal1001.mgmt:22 - https://phabricator.wikimedia.org/T394149#10822064 (10Jclark-ctr) a:03Jclark-ctr [14:02:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for es1033.mgmt:22 - https://phabricator.wikimedia.org/T394148#10822065 (10Jclark-ctr) a:03Jclark-ctr [14:02:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1163.mgmt:22 - https://phabricator.wikimedia.org/T394147#10822066 (10Jclark-ctr) a:03Jclark-ctr [14:02:58] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for kubestage1004.mgmt:22 - https://phabricator.wikimedia.org/T394146#10822067 (10Jclark-ctr) a:03Jclark-ctr [14:03:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1232.mgmt:22 - https://phabricator.wikimedia.org/T394145#10822068 (10Jclark-ctr) a:03Jclark-ctr [14:03:10] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T394144#10822069 (10Jclark-ctr) a:03Jclark-ctr [14:03:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1108.mgmt:22 - https://phabricator.wikimedia.org/T394143#10822070 (10Jclark-ctr) a:03Jclark-ctr [14:03:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1019.mgmt:22 - https://phabricator.wikimedia.org/T394142#10822071 (10Jclark-ctr) a:03Jclark-ctr [14:03:24] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1110.mgmt:22 - https://phabricator.wikimedia.org/T394141#10822072 (10Jclark-ctr) a:03Jclark-ctr [14:03:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1045.mgmt:22 - https://phabricator.wikimedia.org/T394140#10822073 (10Jclark-ctr) a:03Jclark-ctr [14:03:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudbackup1004.mgmt:22 - https://phabricator.wikimedia.org/T394139#10822074 (10Jclark-ctr) a:03Jclark-ctr [14:03:39] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1164.mgmt:22 - https://phabricator.wikimedia.org/T394138#10822075 (10Jclark-ctr) a:03Jclark-ctr [14:05:58] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-05-07-003410 to 2025-05-12-235119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145863 (https://phabricator.wikimedia.org/T324616) (owner: 10Jforrester) [14:07:13] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [14:07:36] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-05-07-003410 to 2025-05-12-235119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145863 (https://phabricator.wikimedia.org/T324616) (owner: 10Jforrester) [14:07:48] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [14:07:48] !log klausman@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-lab1002.eqiad.wmnet [14:08:23] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [14:08:34] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:00] (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:09:12] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:47] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:09:54] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on ml-lab1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:20] (03CR) 10Vgutierrez: hiera: disable varnishkafka in magru text|upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [14:10:26] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:10:30] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:10:59] (03PS1) 10Andrew Bogott: Magnum: give magnum even more time to detect cluster success/failure [puppet] - 10https://gerrit.wikimedia.org/r/1145953 [14:11:07] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:11:32] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-05-06-142345 to 2025-05-14-112404 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145864 (https://phabricator.wikimedia.org/T324616) (owner: 10Jforrester) [14:13:16] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-05-06-142345 to 2025-05-14-112404 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145864 (https://phabricator.wikimedia.org/T324616) (owner: 10Jforrester) [14:14:02] (03PS1) 10Hnowlan: mw::maintenance: migrate all updateMenteeData jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145962 (https://phabricator.wikimedia.org/T385782) [14:14:36] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:40] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [14:14:56] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:15:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76152 and previous config saved to /var/cache/conftool/dbconfig/20250514-141532-root.json [14:15:35] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:15:36] (03CR) 10Andrew Bogott: [C:03+2] Magnum: give magnum even more time to detect cluster success/failure [puppet] - 10https://gerrit.wikimedia.org/r/1145953 (owner: 10Andrew Bogott) [14:16:01] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:16:10] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:16:33] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:41] !log uploaded openjdk-8 8u452-ga-1~deb11u1 to component/jdk8 for bullseye-wikimedia [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki: upgrade to mesh.configuration 1.13 - T391333 [14:18:27] T391333: Revisit default envoy histogram buckets - https://phabricator.wikimedia.org/T391333 [14:19:38] (03CR) 10Jgiannelos: "Not sure how much of health information we can infer from GET "/". More specifically things can fail silently in kartotherian and "/" wou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [14:21:29] (03PS1) 10Cathal Mooney: Enable link-protection and BFD on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) [14:22:54] (03PS5) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) [14:22:54] (03PS5) 10Vgutierrez: trafficserver: Send /beacon/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [14:22:54] (03PS16) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [14:23:44] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate all updateMenteeData jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145962 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:24:56] (03PS1) 10Alexandros Kosiaris: calico: Allow to override the MTU via values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) [14:24:57] (03PS1) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) [14:28:54] (03CR) 10JMeybohm: calico: Set veth_mtu to 1480 for staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [14:29:32] (03CR) 10Fabfur: hiera: disable varnishkafka in magru text|upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [14:29:40] (03PS2) 10Fabfur: hiera: disable varnishkafka (webrequest) in magru text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) [14:29:58] (03CR) 10Elukey: "Yeah I think this is the idea - if the db is down for some reason, we don't get a churn of pods not responding to health checks and causin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [14:30:19] !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki: upgrade to mesh.configuration 1.13 - T391333 (duration: 12m 33s) [14:30:22] T391333: Revisit default envoy histogram buckets - https://phabricator.wikimedia.org/T391333 [14:31:35] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet [14:33:00] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [14:33:16] 06SRE, 06Traffic: Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312 (10ssingh) 03NEW [14:33:36] (03PS7) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [14:34:34] (03CR) 10Filippo Giunchedi: [C:03+1] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:34:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:35:17] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [14:37:06] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet [14:37:40] !log installing glib2.0 security updates [14:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:34] jouncebot: nowandnext [14:38:34] For the next 0 hour(s) and 21 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1400) [14:38:35] In 2 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1700) [14:39:29] (03CR) 10Filippo Giunchedi: [C:03+1] shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:39:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:39:53] (03CR) 10Filippo Giunchedi: [C:03+1] recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:39:55] Anyone mind if I deploy a no-op change to readme.php in mediawiki-config? [14:40:08] (03CR) 10Filippo Giunchedi: [C:03+1] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:40:24] Dreamy_Jazz: I usually just rebase them on deploy host and call it a day [14:40:32] (03CR) 10Filippo Giunchedi: [C:03+1] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:40:35] (03CR) 10Kamila Součková: [C:03+1] thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:40:48] (03CR) 10Filippo Giunchedi: [C:03+1] tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:40:54] Didn't know if doing that causes a warning in scap when someone tries to deploy next? [14:40:55] (03CR) 10Filippo Giunchedi: [C:03+1] termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [14:41:17] i.e. the changes you are deploying also includes this unrelated change [14:42:33] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [14:42:36] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145762 (https://phabricator.wikimedia.org/T386854) (owner: 10Brouberol) [14:42:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145947 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [14:43:27] (03Merged) 10jenkins-bot: Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145947 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [14:43:49] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1145947|Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php (T394299)]] [14:43:53] T394299: MediaModeration: Disable extension on beta wikis - https://phabricator.wikimedia.org/T394299 [14:45:39] Dreamy_Jazz: I think it would cause a warning if you merged it in Gerrit but didn’t pull it on the deploy host [14:45:52] Sure. I'll continue to deploy it then [14:45:55] whereas once it’s pulled on the deploy host, I don’t think anything detects whether it’s also been deployed or not [14:46:02] Oh I see [14:46:02] FWIW I’m still in favor of properly deploying it ^ [14:46:04] * ^^ [14:48:23] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1145947|Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php (T394299)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:48:30] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [14:50:15] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all updateMenteeData jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145962 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:50:26] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc2047.mgmt:22 - https://phabricator.wikimedia.org/T394127#10822354 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:50:47] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2184.mgmt:22 - https://phabricator.wikimedia.org/T394118#10822358 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:51:03] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc2048.mgmt:22 - https://phabricator.wikimedia.org/T394119#10822362 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:51:24] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for logstash2035.mgmt:22 - https://phabricator.wikimedia.org/T394121#10822366 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:51:40] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2003.mgmt:22 - https://phabricator.wikimedia.org/T394120#10822370 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:51:52] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ms-be2058.mgmt:22 - https://phabricator.wikimedia.org/T394122#10822374 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:52:08] (03CR) 10Scott French: "Resolving this, as you fixed it long ago." [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:52:30] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for kafka-stretch2001.mgmt:22 - https://phabricator.wikimedia.org/T394123#10822378 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:52:37] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ms-backup2001.mgmt:22 - https://phabricator.wikimedia.org/T394124#10822382 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:52:53] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ms-be2072.mgmt:22 - https://phabricator.wikimedia.org/T394125#10822388 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:53:13] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for ms-be2064.mgmt:22 - https://phabricator.wikimedia.org/T394126#10822394 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:53:28] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wdqs2011.mgmt:22 - https://phabricator.wikimedia.org/T394128#10822399 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted mgmt switch [14:53:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db2181 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76153 and previous config saved to /var/cache/conftool/dbconfig/20250514-145336-ladsgroup.json [14:53:41] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [14:55:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145947|Set $wgMediaModerationPhotoDNASubscriptionKey as empty in readme.php (T394299)]] (duration: 11m 20s) [14:55:13] T394299: MediaModeration: Disable extension on beta wikis - https://phabricator.wikimedia.org/T394299 [14:55:34] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c6-codfw: PEM 0 Not Powered - https://phabricator.wikimedia.org/T394261#10822405 (10Jhancock.wm) reseated power cables firmly without removing from ports so as not to accidentally disconnect. both PSUs have a solid green light. Please confirm at your convenience that the al... [14:55:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:55:49] (03CR) 10Xcollazo: [C:03+1] analytics-hive: Enable lock transaction management in prod hive metastore [puppet] - 10https://gerrit.wikimedia.org/r/1145762 (https://phabricator.wikimedia.org/T386854) (owner: 10Brouberol) [14:56:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1164.mgmt:22 - https://phabricator.wikimedia.org/T394138#10822407 (10Jclark-ctr) 05Open→03Resolved [14:56:48] (03CR) 10Kosta Harlan: [C:03+1] MediaModeration: Only running scanning scripts on production [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [14:57:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1110.mgmt:22 - https://phabricator.wikimedia.org/T394141#10822410 (10Jclark-ctr) 05Open→03Resolved [14:57:28] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1019.mgmt:22 - https://phabricator.wikimedia.org/T394142#10822411 (10Jclark-ctr) 05Open→03Resolved [14:57:32] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1108.mgmt:22 - https://phabricator.wikimedia.org/T394143#10822412 (10Jclark-ctr) 05Open→03Resolved [14:57:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for es1033.mgmt:22 - https://phabricator.wikimedia.org/T394148#10822413 (10Jclark-ctr) 05Open→03Resolved [14:57:41] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1249.mgmt:22 - https://phabricator.wikimedia.org/T394154#10822414 (10Jclark-ctr) 05Open→03Resolved [14:57:46] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1020.mgmt:22 - https://phabricator.wikimedia.org/T394157#10822415 (10Jclark-ctr) 05Open→03Resolved [14:57:51] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1067.mgmt:22 - https://phabricator.wikimedia.org/T394161#10822416 (10Jclark-ctr) 05Open→03Resolved [14:57:56] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1167.mgmt:22 - https://phabricator.wikimedia.org/T394166#10822417 (10Jclark-ctr) 05Open→03Resolved [14:58:01] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1070.mgmt:22 - https://phabricator.wikimedia.org/T394151#10822418 (10Jclark-ctr) 05Open→03Resolved [14:58:06] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1069.mgmt:22 - https://phabricator.wikimedia.org/T394170#10822419 (10Jclark-ctr) 05Open→03Resolved [14:58:10] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1109.mgmt:22 - https://phabricator.wikimedia.org/T394174#10822421 (10Jclark-ctr) 05Open→03Resolved [14:58:15] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1107.mgmt:22 - https://phabricator.wikimedia.org/T394179#10822422 (10Jclark-ctr) 05Open→03Resolved [14:58:17] (03PS2) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [14:58:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1068.mgmt:22 - https://phabricator.wikimedia.org/T394182#10822423 (10Jclark-ctr) 05Open→03Resolved [14:58:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1165.mgmt:22 - https://phabricator.wikimedia.org/T394133#10822424 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:59:15] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [14:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1071.mgmt:22 - https://phabricator.wikimedia.org/T394169#10822428 (10Jclark-ctr) 05Open→03Resolved [14:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1096.mgmt:22 - https://phabricator.wikimedia.org/T394183#10822429 (10Jclark-ctr) 05Open→03Resolved [14:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for gitlab-runner1004.mgmt:22 - https://phabricator.wikimedia.org/T394186#10822430 (10Jclark-ctr) 05Open→03Resolved [14:59:41] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394187#10822431 (10Jclark-ctr) 05Open→03Resolved [14:59:47] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394189#10822432 (10Jclark-ctr) 05Open→03Resolved [14:59:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394199#10822433 (10Jclark-ctr) 05Open→03Resolved [14:59:56] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394203#10822434 (10Jclark-ctr) 05Open→03Resolved [15:00:00] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394204#10822435 (10Jclark-ctr) 05Open→03Resolved [15:00:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394212#10822436 (10Jclark-ctr) 05Open→03Resolved [15:00:09] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394213#10822437 (10Jclark-ctr) 05Open→03Resolved [15:00:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1248.mgmt:22 - https://phabricator.wikimedia.org/T394215#10822438 (10Jclark-ctr) 05Open→03Resolved [15:00:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394216#10822439 (10Jclark-ctr) 05Open→03Resolved [15:00:24] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394220#10822440 (10Jclark-ctr) 05Open→03Resolved [15:00:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10822441 (10Jhancock.wm) 05Open→03Resolved @BCornwall the alert has cleared in the idrac and I dont't see anything new in the history since yesterday. We mi... [15:00:30] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394225#10822443 (10Jclark-ctr) 05Open→03Resolved [15:00:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394234#10822444 (10Jclark-ctr) 05Open→03Resolved [15:00:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:01:15] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394235#10822452 (10Jclark-ctr) 05Open→03Resolved [15:01:18] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1042.mgmt:22 - https://phabricator.wikimedia.org/T394242#10822454 (10Jclark-ctr) 05Open→03Resolved [15:01:23] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1037.mgmt:22 - https://phabricator.wikimedia.org/T394246#10822456 (10Jclark-ctr) 05Open→03Resolved [15:01:26] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate all updateMenteeData jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1145962 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:01:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1015.mgmt:22 - https://phabricator.wikimedia.org/T394249#10822458 (10Jclark-ctr) 05Open→03Resolved [15:01:31] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for es1034.mgmt:22 - https://phabricator.wikimedia.org/T394251#10822460 (10Jclark-ctr) 05Open→03Resolved [15:01:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1004.mgmt:22 - https://phabricator.wikimedia.org/T394252#10822463 (10Jclark-ctr) 05Open→03Resolved [15:01:42] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for mc-wf1002.mgmt:22 - https://phabricator.wikimedia.org/T394257#10822468 (10Jclark-ctr) 05Open→03Resolved [15:01:43] !log dancy@deploy1003 Installing scap version "4.167.0" for 2 host(s) [15:01:46] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393784#10822471 (10Jhancock.wm) part of testing in new cage in DH5. leaving up so the errors don't repeat into new tickets [15:02:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393785#10822474 (10Jhancock.wm) part of testing in new cage in DH5. leaving up so the errors don't repeat into new tickets [15:02:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393784#10822478 (10cmooney) >>! In T393784#10822471, @Jhancock.wm wrote: > part of testing in new cage in DH5. leaving up so the errors don't repeat i... [15:03:18] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudbackup1004.mgmt:22 - https://phabricator.wikimedia.org/T394139#10822479 (10Jclark-ctr) 05Open→03Resolved [15:03:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1045.mgmt:22 - https://phabricator.wikimedia.org/T394140#10822480 (10Jclark-ctr) 05Open→03Resolved [15:03:21] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T394144#10822481 (10Jclark-ctr) 05Open→03Resolved [15:03:22] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirtlocal1001.mgmt:22 - https://phabricator.wikimedia.org/T394149#10822482 (10Jclark-ctr) 05Open→03Resolved [15:03:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1013.mgmt:22 - https://phabricator.wikimedia.org/T394156#10822483 (10Jclark-ctr) 05Open→03Resolved [15:03:32] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1036.mgmt:22 - https://phabricator.wikimedia.org/T394152#10822484 (10Jclark-ctr) 05Open→03Resolved [15:03:33] !log dancy@deploy1003 Installation of scap version "4.167.0" completed for 2 hosts [15:03:36] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1046.mgmt:22 - https://phabricator.wikimedia.org/T394162#10822485 (10Jclark-ctr) 05Open→03Resolved [15:03:41] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1047.mgmt:22 - https://phabricator.wikimedia.org/T394163#10822486 (10Jclark-ctr) 05Open→03Resolved [15:03:46] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1012.mgmt:22 - https://phabricator.wikimedia.org/T394164#10822487 (10Jclark-ctr) 05Open→03Resolved [15:03:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1024.mgmt:22 - https://phabricator.wikimedia.org/T394165#10822488 (10Jclark-ctr) 05Open→03Resolved [15:03:54] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1038.mgmt:22 - https://phabricator.wikimedia.org/T394167#10822489 (10Jclark-ctr) 05Open→03Resolved [15:03:58] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1011.mgmt:22 - https://phabricator.wikimedia.org/T394168#10822490 (10Jclark-ctr) 05Open→03Resolved [15:04:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1039.mgmt:22 - https://phabricator.wikimedia.org/T394171#10822491 (10Jclark-ctr) 05Open→03Resolved [15:04:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudservices1005.mgmt:22 - https://phabricator.wikimedia.org/T394175#10822492 (10Jclark-ctr) 05Open→03Resolved [15:04:13] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1042.mgmt:22 - https://phabricator.wikimedia.org/T394176#10822493 (10Jclark-ctr) 05Open→03Resolved [15:04:17] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1020.mgmt:22 - https://phabricator.wikimedia.org/T394177#10822494 (10Jclark-ctr) 05Open→03Resolved [15:04:23] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1023.mgmt:22 - https://phabricator.wikimedia.org/T394178#10822495 (10Jclark-ctr) 05Open→03Resolved [15:04:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1015.mgmt:22 - https://phabricator.wikimedia.org/T394181#10822496 (10Jclark-ctr) 05Open→03Resolved [15:04:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1037.mgmt:22 - https://phabricator.wikimedia.org/T394184#10822497 (10Jclark-ctr) 05Open→03Resolved [15:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394185#10822498 (10Jclark-ctr) 05Open→03Resolved [15:04:42] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394190#10822499 (10Jclark-ctr) 05Open→03Resolved [15:04:44] (03PS1) 10Arturo Borrero Gonzalez: network: data: introduce cloud-instances-octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146001 (https://phabricator.wikimedia.org/T394099) [15:05:22] (03CR) 10Ssingh: [C:03+1] hiera: disable varnishkafka (webrequest) in magru text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [15:05:34] jouncebot: nowandnext [15:05:35] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [15:05:35] In 1 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1700) [15:05:58] (03PS2) 10Arturo Borrero Gonzalez: network: data: introduce cloud-instances-octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146001 (https://phabricator.wikimedia.org/T394099) [15:06:02] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394232#10822506 (10Jclark-ctr) 05Open→03Resolved [15:06:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394195#10822507 (10Jclark-ctr) 05Open→03Resolved [15:06:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394196#10822508 (10Jclark-ctr) 05Open→03Resolved [15:06:15] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1014.mgmt:22 - https://phabricator.wikimedia.org/T394198#10822509 (10Jclark-ctr) 05Open→03Resolved [15:06:18] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394200#10822510 (10Jclark-ctr) 05Open→03Resolved [15:06:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394206#10822511 (10Jclark-ctr) 05Open→03Resolved [15:06:39] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394209#10822512 (10Jclark-ctr) 05Open→03Resolved [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:43] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394211#10822513 (10Jclark-ctr) 05Open→03Resolved [15:06:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146001 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [15:06:48] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394217#10822514 (10Jclark-ctr) 05Open→03Resolved [15:06:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394218#10822516 (10Jclark-ctr) 05Open→03Resolved [15:07:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394221#10822518 (10Jclark-ctr) 05Open→03Resolved [15:07:11] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394223#10822519 (10Jclark-ctr) 05Open→03Resolved [15:07:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudlb1002.mgmt:22 - https://phabricator.wikimedia.org/T394224#10822520 (10Jclark-ctr) 05Open→03Resolved [15:07:17] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1044.mgmt:22 - https://phabricator.wikimedia.org/T394227#10822521 (10Jclark-ctr) 05Open→03Resolved [15:07:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394228#10822522 (10Jclark-ctr) 05Open→03Resolved [15:07:24] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1043.mgmt:22 - https://phabricator.wikimedia.org/T394229#10822523 (10Jclark-ctr) 05Open→03Resolved [15:07:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394230#10822524 (10Jclark-ctr) 05Open→03Resolved [15:07:38] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:08:00] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:08:15] (03CR) 10Jforrester: [C:03+2] wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [15:08:24] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394239#10822533 (10Jclark-ctr) 05Open→03Resolved [15:08:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1041.mgmt:22 - https://phabricator.wikimedia.org/T394244#10822535 (10Jclark-ctr) 05Open→03Resolved [15:08:35] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudgw1004.mgmt:22 - https://phabricator.wikimedia.org/T394248#10822537 (10Jclark-ctr) 05Open→03Resolved [15:08:42] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1019.mgmt:22 - https://phabricator.wikimedia.org/T394253#10822539 (10Jclark-ctr) 05Open→03Resolved [15:08:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudnet1006.mgmt:22 - https://phabricator.wikimedia.org/T394254#10822541 (10Jclark-ctr) 05Open→03Resolved [15:08:52] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudvirt1040.mgmt:22 - https://phabricator.wikimedia.org/T394255#10822544 (10Jclark-ctr) 05Open→03Resolved [15:08:56] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcontrol1006.mgmt:22 - https://phabricator.wikimedia.org/T394256#10822546 (10Jclark-ctr) 05Open→03Resolved [15:09:36] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:38] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:09:49] (03Merged) 10jenkins-bot: wikifunctions: send traces to the collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145242 (https://phabricator.wikimedia.org/T390753) (owner: 10CDanis) [15:10:07] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: replace refreshLinkRecommendations define, s1 to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:10:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1232.mgmt:22 - https://phabricator.wikimedia.org/T394145#10822552 (10Jclark-ctr) 05Open→03Resolved [15:10:10] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for kubestage1004.mgmt:22 - https://phabricator.wikimedia.org/T394146#10822553 (10Jclark-ctr) 05Open→03Resolved [15:10:14] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:10:15] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1163.mgmt:22 - https://phabricator.wikimedia.org/T394147#10822554 (10Jclark-ctr) 05Open→03Resolved [15:10:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1159.mgmt:22 - https://phabricator.wikimedia.org/T394150#10822555 (10Jclark-ctr) 05Open→03Resolved [15:10:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for kafka-main1009.mgmt:22 - https://phabricator.wikimedia.org/T394137#10822557 (10Jclark-ctr) 05Open→03Resolved [15:10:39] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:10:40] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1032.mgmt:22 - https://phabricator.wikimedia.org/T394153#10822559 (10Jclark-ctr) 05Open→03Resolved [15:10:49] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1223.mgmt:22 - https://phabricator.wikimedia.org/T394155#10822560 (10Jclark-ctr) 05Open→03Resolved [15:10:51] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-druid1005.mgmt:22 - https://phabricator.wikimedia.org/T394158#10822561 (10Jclark-ctr) 05Open→03Resolved [15:10:53] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for restbase1041.mgmt:22 - https://phabricator.wikimedia.org/T394159#10822562 (10Jclark-ctr) 05Open→03Resolved [15:10:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1173.mgmt:22 - https://phabricator.wikimedia.org/T394172#10822563 (10Jclark-ctr) 05Open→03Resolved [15:11:01] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for pki-root1001.mgmt:22 - https://phabricator.wikimedia.org/T394173#10822564 (10Jclark-ctr) 05Open→03Resolved [15:11:01] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:11:44] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:12:02] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:12:03] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394222#10822565 (10Jclark-ctr) 05Open→03Resolved [15:12:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1222.mgmt:22 - https://phabricator.wikimedia.org/T394180#10822566 (10Jclark-ctr) 05Open→03Resolved [15:12:11] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for dbstore1007.mgmt:22 - https://phabricator.wikimedia.org/T394188#10822569 (10Jclark-ctr) 05Open→03Resolved [15:12:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10822570 (10MoritzMuehlenhoff) [15:12:20] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1034.mgmt:22 - https://phabricator.wikimedia.org/T394191#10822571 (10Jclark-ctr) 05Open→03Resolved [15:12:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394192#10822572 (10Jclark-ctr) 05Open→03Resolved [15:12:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394193#10822575 (10Jclark-ctr) 05Open→03Resolved [15:12:33] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394194#10822576 (10Jclark-ctr) 05Open→03Resolved [15:12:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-test-coord1001.mgmt:22 - https://phabricator.wikimedia.org/T394197#10822577 (10Jclark-ctr) 05Open→03Resolved [15:12:45] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394201#10822578 (10Jclark-ctr) 05Open→03Resolved [15:12:49] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394202#10822579 (10Jclark-ctr) 05Open→03Resolved [15:12:55] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394205#10822580 (10Jclark-ctr) 05Open→03Resolved [15:12:59] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394207#10822581 (10Jclark-ctr) 05Open→03Resolved [15:13:05] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1162.mgmt:22 - https://phabricator.wikimedia.org/T394208#10822583 (10Jclark-ctr) 05Open→03Resolved [15:13:15] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394210#10822584 (10Jclark-ctr) 05Open→03Resolved [15:13:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394214#10822586 (10Jclark-ctr) 05Open→03Resolved [15:13:27] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394219#10822587 (10Jclark-ctr) 05Open→03Resolved [15:13:31] (03CR) 10Fabfur: [C:03+2] hiera: disable varnishkafka (webrequest) in magru text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1145948 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [15:13:51] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:14:16] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394226#10822598 (10Jclark-ctr) 05Open→03Resolved [15:14:21] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1023.mgmt:22 - https://phabricator.wikimedia.org/T394231#10822599 (10Jclark-ctr) 05Open→03Resolved [15:14:25] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394233#10822600 (10Jclark-ctr) 05Open→03Resolved [15:14:29] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394236#10822601 (10Jclark-ctr) 05Open→03Resolved [15:14:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394237#10822602 (10Jclark-ctr) 05Open→03Resolved [15:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394238#10822603 (10Jclark-ctr) 05Open→03Resolved [15:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394240#10822605 (10Jclark-ctr) 05Open→03Resolved [15:14:50] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394241#10822607 (10Jclark-ctr) 05Open→03Resolved [15:14:56] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1010.mgmt:22 - https://phabricator.wikimedia.org/T394243#10822609 (10Jclark-ctr) 05Open→03Resolved [15:15:04] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for snapshot1015.mgmt:22 - https://phabricator.wikimedia.org/T394245#10822611 (10Jclark-ctr) 05Open→03Resolved [15:15:10] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for aqs1019.mgmt:22 - https://phabricator.wikimedia.org/T394247#10822613 (10Jclark-ctr) 05Open→03Resolved [15:15:16] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for db1175.mgmt:22 - https://phabricator.wikimedia.org/T394250#10822615 (10Jclark-ctr) 05Open→03Resolved [15:16:34] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ml-serve1004.mgmt:22 - https://phabricator.wikimedia.org/T394160#10822619 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:16:49] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ml-serve1004.mgmt:22 - https://phabricator.wikimedia.org/T394160#10822621 (10Jclark-ctr) pings from cumin [15:17:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:17:22] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:17:31] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for conf1009.mgmt:22 - https://phabricator.wikimedia.org/T394136#10822637 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr pings from cumin [15:17:53] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1097.mgmt:22 - https://phabricator.wikimedia.org/T394135#10822641 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr pings from cumin [15:18:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker1168.mgmt:22 - https://phabricator.wikimedia.org/T394134#10822645 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr pings from cumin [15:22:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10822649 (10Jclark-ctr) a:03Jclark-ctr [15:22:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10822650 (10Jclark-ctr) [15:25:07] !log removing varnishkafka from magru (T393772) [15:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:11] T393772: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772 [15:27:02] (03CR) 10Herron: [C:03+1] "I'm happy to pick this one up and deploy it early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) (owner: 10Filippo Giunchedi) [15:31:53] (03CR) 10Jgiannelos: [C:03+1] kartotherian: simplify the readinessProble's path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:14] (03PS1) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [15:38:33] (03PS1) 10Fabfur: varnishkafka: disable webrequest monitoring if ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) [15:38:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10822744 (10thcipriani) [15:39:17] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [15:40:27] (03CR) 10Ssingh: varnishkafka: disable webrequest monitoring if ensure => absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [15:42:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm [15:43:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10822792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm [15:43:20] (03CR) 10Lucas Werkmeister (WMDE): "And deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [15:45:20] (03PS1) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) [15:45:24] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:55] 10ops-codfw, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in codfw - https://phabricator.wikimedia.org/T394108#10822812 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rebooted the last switch [15:46:23] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:46:29] (03CR) 10CI reject: [V:04-1] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:46:34] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: provision the postgresql-airflow-dev PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145907 (https://phabricator.wikimedia.org/T394039) (owner: 10Brouberol) [15:46:49] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: provision the postgresql-airflow-dev PG cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145907 (https://phabricator.wikimedia.org/T394039) (owner: 10Brouberol) [15:47:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [15:47:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [15:48:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393784#10822853 (10Jhancock.wm) i don't think so. that would be great ty [15:48:19] (03PS2) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) [15:49:23] (03CR) 10CI reject: [V:04-1] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:56:42] (03PS1) 10Gergő Tisza: [noop] Set $wgCentralAuthRestrictSharedDomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146013 (https://phabricator.wikimedia.org/T391270) [15:58:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:58:28] (03CR) 10Eevans: [C:03+1] Enable profile::auto_restarts::service for cortobot [puppet] - 10https://gerrit.wikimedia.org/r/1145808 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:58:29] (03CR) 10Cathal Mooney: [C:03+1] "Should be fine but please add it in Netbox as well." [puppet] - 10https://gerrit.wikimedia.org/r/1146001 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [16:00:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:00:54] (03CR) 10Cathal Mooney: [C:03+1] "You're more in tune with the new alerts but seems good to me yep." [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [16:01:56] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10822959 (10JTweed-WMF) @Bmueller I think this might need your approval. [16:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:57] (03CR) 10Cathal Mooney: [C:03+1] "Not 100% familiar with this syntax but reading it makes sense, those are the two device patterns we need." [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [16:05:33] (03PS1) 10Andrew Bogott: Openstack: forward recent dalmatian changes to epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1146016 [16:08:07] (03CR) 10Andrew Bogott: [C:03+2] Openstack: forward recent dalmatian changes to epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1146016 (owner: 10Andrew Bogott) [16:09:56] (03CR) 10Brouberol: [C:03+2] analytics-hive: Enable lock transaction management in prod hive metastore [puppet] - 10https://gerrit.wikimedia.org/r/1145762 (https://phabricator.wikimedia.org/T386854) (owner: 10Brouberol) [16:11:36] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [16:11:39] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [16:14:05] (03CR) 10Hnowlan: [C:03+1] P:mw::maint::backfill_localaccounts: backfillLocalAccounts-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143227 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [16:14:23] (03CR) 10Hnowlan: [C:03+1] P:mw::maint::backfill_localaccounts: backfillLocalAccounts-loginwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143226 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [16:16:14] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:19:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo [16:19:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo [16:20:29] (03CR) 10Cyndywikime: [C:03+1] [Growth] eswiki: Bump mentorship to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm) [16:22:19] 10ops-eqiad, 06SRE, 06DC-Ops: Reboot of in rack mgmt switches in eqiad - https://phabricator.wikimedia.org/T394109#10823065 (10Papaul) 05Open→03Resolved a:03Papaul I did some trace with @VRiley-WMF to fix some misconfiguration on msw1-eqiad see below for diff. All the interface are now up so we can... [16:23:54] (03CR) 10BCornwall: [C:03+2] admin: Add esanders to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1145327 (https://phabricator.wikimedia.org/T393724) (owner: 10BCornwall) [16:23:56] (03CR) 10BCornwall: [C:03+2] admin: Add thiemowmde to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1145325 (https://phabricator.wikimedia.org/T393798) (owner: 10BCornwall) [16:24:00] (03PS1) 10Federico Ceratto: zarcillo: values.yaml: Add FQDN for SNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146018 (https://phabricator.wikimedia.org/T384212) [16:24:00] (03CR) 10Federico Ceratto: "As discussed on IRC, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146018 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [16:27:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393784#10823079 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T393784#10822853, @Jhancock.wm wrote: > i don't think so. that would be gre... [16:27:30] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c6-codfw: PEM 0 Not Powered - https://phabricator.wikimedia.org/T394261#10823082 (10Papaul) @Jhancock.wm all good you can resolve the task ` sw1-c6-codfw> show chassis alarms No alarms currently active [16:27:43] (03PS5) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) [16:31:22] (03CR) 10Clément Goubert: [C:03+1] zarcillo: values.yaml: Add FQDN for SNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146018 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [16:34:33] (03PS1) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) [16:35:51] (03CR) 10CI reject: [V:04-1] Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [16:36:05] (03PS1) 10Fabfur: hiera: enable vk monitoring in magru to actually remove it [puppet] - 10https://gerrit.wikimedia.org/r/1146021 (https://phabricator.wikimedia.org/T393772) [16:38:33] (03CR) 10Fabfur: varnishkafka: disable webrequest monitoring if ensure => absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [16:38:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146021 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [16:39:43] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: values.yaml: Add FQDN for SNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146018 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [16:40:48] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install X - https://phabricator.wikimedia.org/T394333 (10RobH) 03NEW [16:41:08] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install X - https://phabricator.wikimedia.org/T394333#10823162 (10RobH) [16:41:13] (03Merged) 10jenkins-bot: zarcillo: values.yaml: Add FQDN for SNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146018 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [16:41:30] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10823165 (10RobH) [16:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:41:51] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10823166 (10RobH) [16:42:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission restbase10[28-30].eqiad.wmnet - https://phabricator.wikimedia.org/T393617#10823167 (10VRiley-WMF) [16:42:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10823168 (10BCornwall) Thank you! [16:43:04] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10823169 (10RobH) a:03Andrew @andrew, Please double check the racking details, as I took the old details from the planned order of 12... [16:43:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet [16:46:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission restbase10[28-30].eqiad.wmnet - https://phabricator.wikimedia.org/T393617#10823188 (10VRiley-WMF) [16:46:52] (03CR) 10Vgutierrez: [C:03+1] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [16:46:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:01] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10823190 (10Eevans) [16:49:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission restbase10[28-30].eqiad.wmnet - https://phabricator.wikimedia.org/T393617#10823201 (10VRiley-WMF) 05Open→03Resolved [16:51:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10823215 (10Jclark-ctr) @MatthewVernon might need some help I am stuck unable to image. Boss card Virtual drive is not showing up under boot order will not image at this time. [16:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:56:34] !log updating nameservers for wiki.gives in Markmonitor to set up delegation: T379318 [16:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:38] T379318: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318 [17:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1700). [17:00:11] o/ [17:00:18] I'll be getting started in a few minutes [17:06:03] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1143226 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:06:09] (03CR) 10Scott French: [C:03+2] P:mw::maint::backfill_localaccounts: backfillLocalAccounts-loginwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143226 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:12:43] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:12:54] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:21:10] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10823395 (10BCornwall) Thanks for doing that! @Jdforrester-WMF Can you advise us what groups Cory will be need to be added to, and can you approve the addition to that group? Thanks! [17:22:25] (03CR) 10Ssingh: varnishkafka: disable webrequest monitoring if ensure => absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [17:22:49] (03CR) 10Ssingh: [C:03+1] "(Since this will eventually go away, I am fine with your proposed approach.)" [puppet] - 10https://gerrit.wikimedia.org/r/1146021 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [17:23:40] (03CR) 10Ssingh: [C:03+1] varnishkafka: disable webrequest monitoring if ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [17:26:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10823420 (10ssingh) OK, should be fine. We will follow up with a confirmation after discussing this in the Traffic meeting on Tuesday. [17:27:00] (03CR) 10Fabfur: [C:03+2] varnishkafka: disable webrequest monitoring if ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/1146008 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [17:29:03] (03CR) 10Scott French: [C:03+2] P:mw::maint::backfill_localaccounts: backfillLocalAccounts-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143227 (https://phabricator.wikimedia.org/T385866) (owner: 10Scott French) [17:29:47] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10823426 (10BCornwall) [17:30:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [17:30:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [17:32:26] (03CR) 10Ayounsi: "+1 for link protection, but not sure about BFD, as they're directly connected links I don't see the value of adding an extra protocol." [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [17:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:35:42] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:35:52] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:39:55] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10823442 (10BCornwall) [17:40:32] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10823444 (10Esanders) |cn |[Esanders] |mail |[esanders@wikimedia.org] |memberOf |[cn=project-visualeditor,ou=groups,dc=wikimedia,dc=org, cn=project-bastion,ou=groups,dc=wikimedia,dc=org, cn=wm... [17:42:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission fran1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T393813#10823446 (10VRiley-WMF) [17:42:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission fran1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T393813#10823451 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Hey @Jgreen everything was perfect. This has been decomm'd Thank you! [17:43:44] (03PS1) 10BCornwall: admin: Add jtweed to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1146027 (https://phabricator.wikimedia.org/T394308) [17:43:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10823456 (10VRiley-WMF) [17:44:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10823460 (10VRiley-WMF) 05Open→03Resolved [17:44:24] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10823466 (10BCornwall) a:03Bmueller [17:45:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823468 (10VRiley-WMF) [17:46:18] (03CR) 10Andrew Bogott: [C:03+2] "done" [puppet] - 10https://gerrit.wikimedia.org/r/1146001 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [17:47:34] (03PS1) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) [17:48:24] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1068.eqiad.wmnet with OS bookworm [17:48:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm [17:53:33] (03PS4) 10Krinkle: Stats: Add temporary deprecation for addLabel() normalization [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) [17:57:30] (03CR) 10CI reject: [V:04-1] Stats: Add temporary deprecation for addLabel() normalization [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [17:58:54] (03PS5) 10Krinkle: Stats: Add temporary deprecation for addLabel() normalization [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) [18:00:05] jnuche and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T1800). nyaa~ [18:00:21] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [18:02:09] (03CR) 10BCornwall: [C:03+1] templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [18:02:10] !log aokoth@cumin1002 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [18:03:01] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10823531 (10Jdforrester-WMF) >>! In T393140#10823394, @BCornwall wrote: > Thanks for doing that! > > @Jdforrester-WMF Can you advise us what groups Cory will be need to be added to, and... [18:11:22] (03CR) 10Ssingh: "Plan is to merge tomorrow (May 15)." [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [18:11:35] (03CR) 10Jforrester: [C:03+2] Stats: Add temporary deprecation for addLabel() normalization [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:11:40] (03CR) 10Jforrester: [C:03+1] "Oops." [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:15:02] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1069 [18:15:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1069 [18:15:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:18:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145274 (https://phabricator.wikimedia.org/T392520) (owner: 10Jdlrobson) [18:18:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145273 (https://phabricator.wikimedia.org/T393386) (owner: 10Jdlrobson) [18:19:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133) (owner: 10Jdlrobson) [18:22:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823646 (10VRiley-WMF) [18:25:08] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1070 [18:25:16] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1070 [18:26:04] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1070 [18:26:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1070 [18:27:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:29:26] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:32:15] (03PS1) 10BCornwall: admin: SSH key rotation for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1146033 (https://phabricator.wikimedia.org/T393140) [18:36:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10823657 (10BCornwall) [18:37:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10823659 (10BCornwall) 05Stalled→03In progress [18:37:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10823664 (10taavi) [18:37:42] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:38:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10823668 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [18:38:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:39:41] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:40:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for thiemowmde - https://phabricator.wikimedia.org/T393798#10823699 (10BCornwall) 05In progress→03Resolved a:03BCornwall The access is now live. Feel free to re-open if there's anything I missed. Thanks! [18:41:42] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10823703 (10BCornwall) 05In progress→03Resolved a:03BCornwall The access changes are now live. Please re-open if I missed anything. Thanks! [18:47:18] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1055 to cirrussearch1055 [18:47:43] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:48:57] (03CR) 10Jeena Huneidi: "Is this ready for backport?" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:49:08] (03CR) 10Krinkle: "Yep" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:49:49] (03CR) 10Jeena Huneidi: "Okay, I'll start it now then if there are no objections" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:52:04] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1055 to cirrussearch1055 - bking@cumin2002" [18:53:34] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1055 to cirrussearch1055 - bking@cumin2002" [18:53:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1055 on all recursors [18:53:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1055 on all recursors [18:53:39] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1055 [18:54:52] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1055 [18:55:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1055 to cirrussearch1055 [18:56:11] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1056 to cirrussearch1056 [18:56:25] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:56:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1055.eqiad.wmnet with OS bullseye [18:56:33] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1055 [18:56:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1055 [18:56:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [18:56:45] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1068.eqiad.wmnet with OS bookworm [18:56:50] (03CR) 10Eevans: "I do appreciate that!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145216 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [18:56:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm [19:00:58] (03Merged) 10jenkins-bot: Stats: Add temporary deprecation for addLabel() normalization [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146029 (https://phabricator.wikimedia.org/T394053) (owner: 10Krinkle) [19:01:23] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1146029|Stats: Add temporary deprecation for addLabel() normalization (T394053)]] [19:01:27] T394053: PHP Warning: Invalid label key: 'same-wt' - https://phabricator.wikimedia.org/T394053 [19:01:48] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1056 to cirrussearch1056 - bking@cumin2002" [19:02:07] vriley@cumin1002 provision (PID 2110396) is awaiting input [19:02:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1056 to cirrussearch1056 - bking@cumin2002" [19:02:25] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:02:25] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1056 on all recursors [19:02:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1056 on all recursors [19:02:29] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1056 [19:02:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:03:47] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1056 [19:04:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1056 to cirrussearch1056 [19:04:37] bking@cumin2002 reimage (PID 3228696) is awaiting input [19:05:16] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10823734 (10thcipriani) >>! In T393724#10823444, @Esanders wrote: > |cn |[Esanders] > |mail |[esanders@wikimedia.org] > |memberOf |[cn=project-visualeditor,ou=groups,dc=wikimedia,dc=org, c... [19:05:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:05:55] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:07:19] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [19:08:18] !log jhuneidi@deploy1003 jhuneidi, krinkle: Backport for [[gerrit:1146029|Stats: Add temporary deprecation for addLabel() normalization (T394053)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:08:21] T394053: PHP Warning: Invalid label key: 'same-wt' - https://phabricator.wikimedia.org/T394053 [19:10:08] !log jhuneidi@deploy1003 jhuneidi, krinkle: Continuing with sync [19:11:40] (03PS1) 10TChin: [eventgate-analytics-external] bump version v1.12.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146038 (https://phabricator.wikimedia.org/T391959) [19:12:40] (03PS2) 10AOkoth: wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) [19:13:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1056.eqiad.wmnet with OS bullseye [19:13:36] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1056 [19:13:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1056 [19:14:11] (03CR) 10Dr0ptp4kt: [C:03+2] [eventgate-analytics-external] bump version v1.12.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146038 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [19:14:35] vriley@cumin1002 reimage (PID 2111151) is awaiting input [19:15:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1055.eqiad.wmnet with OS bullseye [19:15:54] (03Merged) 10jenkins-bot: [eventgate-analytics-external] bump version v1.12.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146038 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [19:16:17] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1055.eqiad.wmnet with OS bullseye [19:16:39] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1068.eqiad.wmnet with OS bookworm [19:16:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823795 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm executed... [19:16:48] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146029|Stats: Add temporary deprecation for addLabel() normalization (T394053)]] (duration: 15m 24s) [19:16:51] T394053: PHP Warning: Invalid label key: 'same-wt' - https://phabricator.wikimedia.org/T394053 [19:17:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_ulsfo [19:17:56] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:19:04] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1070.eqiad.wmnet with OS bookworm [19:19:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm [19:19:21] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:19:48] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:20:39] (03PS3) 10Andrew Bogott: Openstack config templates: move [keystone_authtoken] out of common template [puppet] - 10https://gerrit.wikimedia.org/r/1145321 [19:20:39] (03PS10) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [19:21:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [19:21:18] bking@cumin2002 reimage (PID 3236857) is awaiting input [19:21:24] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1055.eqiad.wmnet with OS bullseye [19:21:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_ulsfo [19:21:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823844 (10VRiley-WMF) [19:24:25] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:25:41] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:26:33] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:27:23] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:28:13] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:28:56] (03PS1) 10DDesouza: Design Research participant recruitment survey on eswiki: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146039 (https://phabricator.wikimedia.org/T394315) [19:29:44] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:31:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1068.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:16] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1068.eqiad.wmnet with OS bookworm [19:32:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm [19:33:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146039 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [19:36:10] vriley@cumin1002 reimage (PID 2114517) is awaiting input [19:36:44] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1070.eqiad.wmnet with OS bookworm [19:36:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823931 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm executed... [19:38:03] (03PS1) 10Bking: elastic/cirrussearch: prepare hosts for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1146041 (https://phabricator.wikimedia.org/T388610) [19:40:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_drmrs [19:40:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_drmrs [19:41:30] (03CR) 10BCornwall: [C:03+2] cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [19:41:48] (03PS5) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [19:43:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146041 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:43:53] (03CR) 10JHathaway: "@ltoscano@wikimedia.org would love if you could take another look at the updated patch" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [19:45:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:11] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1068.eqiad.wmnet with reason: host reimage [19:47:54] (03PS2) 10Bking: elastic/cirrussearch: prepare hosts for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1146041 (https://phabricator.wikimedia.org/T388610) [19:48:22] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1070.eqiad.wmnet with OS bookworm [19:48:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm [19:49:00] (03CR) 10CI reject: [V:04-1] elastic/cirrussearch: prepare hosts for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1146041 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:49:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1056.eqiad.wmnet with OS bullseye [19:50:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:50:36] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1069.eqiad.wmnet with OS bookworm [19:50:48] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1068.eqiad.wmnet with reason: host reimage [19:50:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10823982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm [19:51:47] (03PS2) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) [19:51:58] (03CR) 10JHathaway: systemd: validate units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [19:52:03] (03PS3) 10Bking: elastic/cirrussearch: prepare hosts for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1146041 (https://phabricator.wikimedia.org/T388610) [19:53:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [19:53:41] (03PS6) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [19:55:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 (owner: 10C. Scott Ananian) [19:55:58] (03PS2) 10C. Scott Ananian: Remove ParserMigration configuration that matches defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 [19:58:42] (03PS2) 10Ebernhardson: Revert^2 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1145276 [19:59:37] (03CR) 10Eevans: [C:03+2] cassandra-jbod.cfg preseed: grow volume to fill space [puppet] - 10https://gerrit.wikimedia.org/r/1145198 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T2000). Please do the needful. [20:00:05] Jdlrobson, danisztls, and cscott: A patch you scheduled for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:16] (03PS2) 10Ebernhardson: Revert^2 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1145277 [20:00:45] o/ [20:00:53] Party time [20:01:04] o/ [20:01:08] is there a google meet? [20:01:18] https://meet.google.com/def-fvit-mwk [20:01:49] o/ [20:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:10] join us in the meet cscott. it's super fun times [20:02:57] i'm not allowed to have fun. i'm working. [20:03:06] (03CR) 10Bking: [C:03+2] Revert^2 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1145276 (owner: 10Ebernhardson) [20:04:04] (03PS1) 10Eevans: cassandra-dev2001: update paths for new mount points [puppet] - 10https://gerrit.wikimedia.org/r/1146057 (https://phabricator.wikimedia.org/T391544) [20:07:47] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:08:01] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [20:08:15] vriley@cumin1002 reimage (PID 2118268) is awaiting input [20:08:16] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10824029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2001.... [20:09:26] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: update paths for new mount points [puppet] - 10https://gerrit.wikimedia.org/r/1146057 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145274 (https://phabricator.wikimedia.org/T392520) (owner: 10Jdlrobson) [20:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145273 (https://phabricator.wikimedia.org/T393386) (owner: 10Jdlrobson) [20:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133) (owner: 10Jdlrobson) [20:10:53] vriley@cumin1002 reimage (PID 2117347) is awaiting input [20:11:41] (03CR) 10Bking: [C:03+2] etcd data for search-{psi,omega} dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/1145278 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:15:55] (03Merged) 10jenkins-bot: Add ArticleSummaries to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145274 (https://phabricator.wikimedia.org/T392520) (owner: 10Jdlrobson) [20:16:34] (03CR) 10Bking: [C:03+2] Revert^2 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1145277 (owner: 10Ebernhardson) [20:16:44] (03Merged) 10jenkins-bot: Expand dark mode access for anons (May 2025 deployments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145273 (https://phabricator.wikimedia.org/T393386) (owner: 10Jdlrobson) [20:17:36] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:19:25] (03PS1) 10Andrew Bogott: Dummy certs and keys for Openstack Octavia [labs/private] - 10https://gerrit.wikimedia.org/r/1146060 (https://phabricator.wikimedia.org/T393783) [20:19:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm [20:19:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10824068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm [20:19:51] (03Merged) 10jenkins-bot: Nearby should show file namespace on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141904 (https://phabricator.wikimedia.org/T52133) (owner: 10Jdlrobson) [20:20:13] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1145274|Add ArticleSummaries to beta cluster (T392520)]], [[gerrit:1145273|Expand dark mode access for anons (May 2025 deployments) (T393386)]], [[gerrit:1141904|Nearby should show file namespace on Commons (T52133)]] [20:20:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:20:18] T392520: Deploy extension:ArticleSummaries to beta cluster - https://phabricator.wikimedia.org/T392520 [20:20:19] T393386: Dark mode updates (May 2025) - https://phabricator.wikimedia.org/T393386 [20:20:19] T52133: Make Nearby on Commons search NS_FILE namespace - https://phabricator.wikimedia.org/T52133 [20:21:41] https://libera.chat/guides/clients has some web clients which could potentially be embedded [20:23:24] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [20:24:08] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search-psi [20:24:39] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search-omega [20:24:55] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search [20:25:11] !log jdrewniak@deploy1003 jdlrobson, jdrewniak: Backport for [[gerrit:1145274|Add ArticleSummaries to beta cluster (T392520)]], [[gerrit:1145273|Expand dark mode access for anons (May 2025 deployments) (T393386)]], [[gerrit:1141904|Nearby should show file namespace on Commons (T52133)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:25:53] !log jdrewniak@deploy1003 jdlrobson, jdrewniak: Continuing with sync [20:26:00] !log bking@dns1004 START - running authdns-update [20:26:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [20:31:18] (03CR) 10Bking: [C:03+2] search: Update dnsdisc envoy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1143622 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:32:43] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145274|Add ArticleSummaries to beta cluster (T392520)]], [[gerrit:1145273|Expand dark mode access for anons (May 2025 deployments) (T393386)]], [[gerrit:1141904|Nearby should show file namespace on Commons (T52133)]] (duration: 12m 30s) [20:32:49] T392520: Deploy extension:ArticleSummaries to beta cluster - https://phabricator.wikimedia.org/T392520 [20:32:49] T393386: Dark mode updates (May 2025) - https://phabricator.wikimedia.org/T393386 [20:32:49] T52133: Make Nearby on Commons search NS_FILE namespace - https://phabricator.wikimedia.org/T52133 [20:33:42] (03PS2) 10Cathal Mooney: Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) [20:34:35] (03CR) 10Cathal Mooney: "Yeah that's a fair point tbh. Probably some really esoteric scenario in which the link doesn't fail but comms do, but yep. Latest patch " [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [20:34:47] danisztls: you're up! [20:36:17] !log sukhe@dns1004 START - running authdns-update [20:36:38] oh yeah, page [20:36:40] nice [20:36:53] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1070.eqiad.wmnet with OS bookworm [20:37:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm executed... [20:37:04] !incidents [20:37:04] No incidents occurred in the past 24 hours for team SRE [20:37:07] that's weird [20:37:20] the good thing is the page works :P [20:37:39] <_joe_> what's going on? [20:37:40] (03PS11) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [20:38:10] _joe_: nothing that should affect prod traffic or anything else. debugging a discovery record issue. [20:38:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 (owner: 10C. Scott Ananian) [20:38:31] <_joe_> I'm getting email alerts [20:38:32] interesting, I didn't get a page, was there something? [20:38:45] I see it in email, yeah [20:38:49] yeah surprising, there should be a page but not email [20:38:49] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [20:38:56] I will debug that later [20:39:36] (03PS1) 10Bking: Revert^3 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1146061 [20:39:42] (03CR) 10Bking: [V:03+2 C:03+2] Revert^3 "search: add discovery records for secondary clusters" [dns] - 10https://gerrit.wikimedia.org/r/1146061 (owner: 10Bking) [20:39:50] vriley@cumin1002 reimage (PID 2118415) is awaiting input [20:40:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1075.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:40:15] (03PS1) 10Bking: Revert^3 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1146062 [20:40:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1075.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:40:26] (03CR) 10Bking: [V:03+2 C:03+2] Revert^3 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1146062 (owner: 10Bking) [20:40:41] FIRING: [3x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-search-chi.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:41:09] (03PS2) 10Bking: Revert^3 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1146062 [20:41:18] (03CR) 10Bking: [V:03+2 C:03+2] Revert^3 "search: cname specific search clusters to the lvs pool" [dns] - 10https://gerrit.wikimedia.org/r/1146062 (owner: 10Bking) [20:41:19] !log sukhe@dns1004 START - running authdns-update [20:41:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1069.eqiad.wmnet with OS bookworm [20:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:41:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824191 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm executed... [20:42:19] jclark@cumin1002 reimage (PID 2124084) is awaiting input [20:45:41] FIRING: [16x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-search-chi.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:46:09] (03PS1) 10Bking: Revert "search: Update dnsdisc envoy upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1146063 [20:46:17] (03CR) 10Bking: [V:03+2 C:03+2] Revert "search: Update dnsdisc envoy upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1146063 (owner: 10Bking) [20:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:21] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:47:55] (03Merged) 10jenkins-bot: Remove ParserMigration configuration that matches defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 (owner: 10C. Scott Ananian) [20:48:18] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1137443|Remove ParserMigration configuration that matches defaults]] [20:50:41] FIRING: [16x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-search-chi.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:50:48] (03CR) 10Michael Große: [C:03+1] [Growth] eswiki: Bump mentorship to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm) [20:50:57] FIRING: ProbeDown: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#search-psi-https:9643 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:14] got that one! [20:51:15] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10824238 (10Papaul) The server has be relocated and re -imaged since May8th and we haven't seen any issue so far. Can we please put the server back in production to see if we do have the same issue whe... [20:51:17] !ack 6122 [20:51:18] 6122 (ACKED) ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad) [20:51:22] thanks [20:51:23] inflatador: ^ [20:51:31] is that expected? [20:51:38] we really did not change anything in that? [20:51:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm) [20:51:56] FIRING: ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#search-https:9243 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:52:09] !log sukhe@dns1004 START - running authdns-update [20:52:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1070.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:52:22] bking@cumin2002 rename (PID 3279126) is awaiting input [20:52:33] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1074 to cirrussearch1074 [20:52:47] !log cscott@deploy1003 cscott: Backport for [[gerrit:1137443|Remove ParserMigration configuration that matches defaults]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:52:52] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1070.eqiad.wmnet with OS bookworm [20:52:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:53:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm [20:53:23] !log sukhe@dns1004 END - running authdns-update [20:53:31] !log gdnsd reload issues should be fixed [20:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:33] inflatador: do you need help with this page? [20:54:01] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for 16 hosts [20:54:01] FIRING: [3x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:08] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 16 hosts [20:54:24] rzl I might need help with conftool, this is from a failed dns discovery deploy that sukhe and myself are rolling back [20:54:43] inflatador: looking now since gdnsd is back up [20:54:51] !log cscott@deploy1003 cscott: Continuing with sync [20:55:13] nod, here if the two of you need more hands :) [20:55:41] RESOLVED: [16x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-search-chi.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:55:47] inflatador: ^ [20:55:54] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [20:55:57] RESOLVED: ProbeDown: Service search-psi-https:9643 has failed probes (http_search-psi-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#search-psi-https:9643 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:01] cool :) [20:56:01] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10824263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [20:56:08] Looks like you fixed it sukhe ! [20:56:30] inflatador: I guess we will save our joy for when we roll this out successfully :P [20:56:38] but yes please, let's set up some time for this and go over it [20:56:55] will do, thanks again [20:58:17] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1074 to cirrussearch1074 - bking@cumin2002" [20:58:28] rzl: thanks :) [20:58:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1074 to cirrussearch1074 - bking@cumin2002" [20:58:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:58:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1074 on all recursors [20:58:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1074 on all recursors [20:58:40] would be good to know why the emails and not the page but I am saving that for tomorrow [20:58:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1074 [20:59:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1069.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:59:11] a mystery yeah [20:59:48] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1069.eqiad.wmnet with OS bookworm [20:59:53] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1074 [20:59:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm [21:00:06] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T2100) [21:00:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1074 to cirrussearch1074 [21:01:03] danisztls: cscott's deploy is just wrapping up here and then i can run your config change through [21:01:27] FIRING: [2x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:28] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137443|Remove ParserMigration configuration that matches defaults]] (duration: 13m 10s) [21:01:35] hmm [21:01:37] !incidents [21:01:38] 6123 (UNACKED) ProbeDown sre (10.2.2.30 ip4 probes/service eqiad) [21:01:38] 6122 (RESOLVED) ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad) [21:01:40] !ack 6123 [21:01:41] 6123 (ACKED) ProbeDown sre (10.2.2.30 ip4 probes/service eqiad) [21:02:24] o/ here too if additional hands / eyes are needed [21:02:34] Are these still firing? Apologies all [21:02:37] thanks! this did resolve before and we thought it was fixed [21:02:39] but clearly not [21:02:43] inflatador: any idea why it is failing now? [21:03:10] (03PS2) 10Andrew Bogott: Dummy certs and keys for Openstack Octavia [labs/private] - 10https://gerrit.wikimedia.org/r/1146060 (https://phabricator.wikimedia.org/T393783) [21:03:19] swfrench-wmf inflatador sukhe: any reason to pause deploys at the moment? backport window has one left to go. [21:03:21] sukhe probably those confctl commands we were talking about in #traffic [21:03:34] brennen: no, should not be related. thanks for checking [21:03:36] brennen no, these are noisy alerts, not an actual problem [21:03:41] cool, thx, going ahead [21:03:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146039 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [21:03:54] brennen: ok [21:04:10] inflatador: looking [21:04:35] danisztls: will ping when ready to test [21:04:40] (03PS3) 10Cathal Mooney: Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) [21:05:15] inflatador: what was the conftool command you ran? [21:05:38] confctl --object-type discovery select 'dnsdisc=seach-psi' set/pooled=true ? [21:05:50] search [21:05:55] (03Merged) 10jenkins-bot: Design Research participant recruitment survey on eswiki: Pre-deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146039 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [21:06:03] sukhe ` sudo confctl --object-type discovery select 'dnsdisc=search-psi' set/pooled=true` ` sudo confctl --object-type discovery select 'dnsdisc=search' set/pooled=true` ` sudo confctl --object-type discovery select 'dnsdisc=search-omega' set/pooled=true` [21:06:11] thanks [21:06:12] !log sukhe@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=seach-psi [21:06:18] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1146039|Design Research participant recruitment survey on eswiki: Pre-deploy (T394315)]] [21:06:21] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [21:06:27] RESOLVED: [3x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:34] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [21:06:44] I am not sure that fixed it [21:06:48] not happy :) [21:06:56] FIRING: [3x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:07:04] !log sukhe@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=search-psi [21:07:19] dnsdisc=seach-psi (notice the missing "r") [21:07:22] sukhe Should we revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145278 ? [21:07:36] !log sukhe@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=search-omega [21:07:50] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage [21:08:11] inflatador: yeah doing it I guess. [21:08:21] since the pag.e is pretty obviously about search-psi? [21:08:30] ACK, reverting now [21:08:40] done [21:08:41] (03PS1) 10Ssingh: Revert "etcd data for search-{psi,omega} dns discovery" [puppet] - 10https://gerrit.wikimedia.org/r/1146075 [21:08:56] !log sukhe@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=search [21:09:01] FIRING: [3x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:16] creating a silence for this so it doesn't page again [21:09:56] sukhe: inflatador: `target=https://[10.2.2.30]:9243/ msg="Error for HTTP request" err="Get \"https://10.2.2.30:9243/\": x509: certificate is valid for search.discovery.wmnet, search.svc.eqiad.wmnet, elastic1063.eqiad.wmnet, not search-chi.discovery.wmnet"` [21:10:02] (03PS2) 10Ryan Kemper: Revert "etcd data for search-{psi,omega} dns discovery" [puppet] - 10https://gerrit.wikimedia.org/r/1146075 (https://phabricator.wikimedia.org/T143553) (owner: 10Ssingh) [21:10:03] sukhe where did you see that 'seach' typo? Was that in one of the patches? [21:10:07] (03CR) 10Ryan Kemper: [C:03+2] Revert "etcd data for search-{psi,omega} dns discovery" [puppet] - 10https://gerrit.wikimedia.org/r/1146075 (https://phabricator.wikimedia.org/T143553) (owner: 10Ssingh) [21:10:09] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] Revert "etcd data for search-{psi,omega} dns discovery" [puppet] - 10https://gerrit.wikimedia.org/r/1146075 (https://phabricator.wikimedia.org/T143553) (owner: 10Ssingh) [21:10:11] so, it looks like the SANs are wrong in the cert [21:10:13] inflatador: no, that was my typo for running confctl! [21:10:38] Merged etcd data revert [21:10:39] but it resolved itself after that, so my confctl command had nothing to do with the resolve (because it had a typo) [21:10:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage [21:11:30] swfrench-wmf that's even weirder, we shouldn't have search-chi.discovery.wmnet at all [21:11:37] swfrench-wmf: yeah, that is weird [21:11:53] That was part of the reason we did a prior rollback. Stale data somewhere? [21:12:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824310 (10VRiley-WMF) [21:12:16] inflatador: if puppet has not run on prometheus*, then it'll keep probing that [21:12:54] (03PS12) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:13:00] !log brennen@deploy1003 brennen, dani: Backport for [[gerrit:1146039|Design Research participant recruitment survey on eswiki: Pre-deploy (T394315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:03] swfrench-wmf ACK, running puppet on prom hosts now [21:13:03] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [21:13:18] danisztls: ready to test on a debug server [21:13:28] inflatador: it should be fixed in a few minutes organically by the timer, so no need [21:13:41] (assuming you've silenced, that is) [21:13:54] swfrench-wmf: yeah but I guess to check explicitly if it helps [21:14:05] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:14:07] it is clearing up fwiw I think? [21:14:30] yep [21:14:36] https://grafana.wikimedia.org/goto/TQYfmFaNR?orgId=1 [21:14:48] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1069.eqiad.wmnet with reason: host reimage [21:15:13] brennen: looks good, thanks! [21:15:14] (03PS13) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:15:33] !log brennen@deploy1003 brennen, dani: Continuing with sync [21:15:38] goin' [21:16:30] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:16:34] inflatador: looking good? [21:17:20] https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:25] psi-https still looks unhappy [21:17:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1069.eqiad.wmnet with reason: host reimage [21:19:05] sukhe yeah, I ran puppet on prom hosts everywhere, will keep looking. So far as I can tell this is all noise [21:19:26] seem all cleared up [21:19:33] but I am not fully convinced for some reason [21:19:41] something feels amiss [21:19:41] probes logstash looks clean now [21:19:44] swfrench-wmf: thanks [21:19:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824325 (10VRiley-WMF) [21:20:22] and good catch swfrench-wmf on the SAN stuff. We need to fix that for the old elastic hosts or wait till they're all gone [21:20:24] at least as far as the failing probes were concerned, these seem to have been the fact that the updated discovery names were missing from the SAN lists [21:21:19] yeah, the failed probes can be explained by that. [21:21:32] yeah those were added in this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143633 but missing for elastic* hosts (as opposed to cirrussearch, the naming convention of the new opensearch cirrus hosts) [21:21:47] anyway it looks clean [21:22:03] and we will still need to figure out what was missing in the current deploy [21:22:08] most likely the ordering but let's be very sure [21:22:24] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146039|Design Research participant recruitment survey on eswiki: Pre-deploy (T394315)]] (duration: 16m 06s) [21:22:27] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [21:22:39] !log end of UTC late backport & config window (and spiderpig party) [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:43] agreed [21:22:56] thanks folks. [21:22:59] till next time :) [21:23:05] * sukhe out to cook food [21:23:18] later, thanks again! [21:24:23] (03PS14) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:24:38] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Dummy certs and keys for Openstack Octavia [labs/private] - 10https://gerrit.wikimedia.org/r/1146060 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:25:03] (03CR) 10Andrew Bogott: [C:03+2] Openstack config templates: move [keystone_authtoken] out of common template [puppet] - 10https://gerrit.wikimedia.org/r/1145321 (owner: 10Andrew Bogott) [21:25:40] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.681s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:26:56] RESOLVED: [3x] ProbeDown: Service search-https:9243 has failed probes (http_search-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:33] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:28:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:28:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1070.eqiad.wmnet with OS bookworm [21:28:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm completed... [21:30:17] (03PS15) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:30:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:31:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1074.eqiad.wmnet with OS bullseye [21:31:05] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1074 [21:31:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1074 [21:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.681s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:31:26] (03CR) 10CI reject: [V:04-1] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:32:43] (03PS16) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:34:01] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:34:43] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install sretest200N Config J 1P test host - https://phabricator.wikimedia.org/T394357 (10RobH) 03NEW [21:37:07] vriley@cumin1002 reimage (PID 2132564) is awaiting input [21:37:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [21:37:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1069.eqiad.wmnet with OS bookworm [21:37:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm completed... [21:37:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:39:28] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1071 [21:39:34] (03PS17) 10Andrew Bogott: Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) [21:39:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1071 [21:40:06] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1075 to cirrussearch1075 [21:40:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:30] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:41:41] (03PS1) 10Andrew Bogott: Octavia: Added a fake ca passphrase [labs/private] - 10https://gerrit.wikimedia.org/r/1146086 [21:42:19] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Octavia: Added a fake ca passphrase [labs/private] - 10https://gerrit.wikimedia.org/r/1146086 (owner: 10Andrew Bogott) [21:42:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:42:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:43:01] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:43:10] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1072 [21:43:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1072 [21:43:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:44:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824395 (10VRiley-WMF) [21:45:36] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:37] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1074.eqiad.wmnet with reason: host reimage [21:46:34] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1073 [21:46:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1073 [21:47:21] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1075 to cirrussearch1075 - bking@cumin2002" [21:47:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:47:44] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [21:47:47] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [21:48:11] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1004.eqiad.wmnet with OS bookworm [21:48:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10824423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm executed with errors: - apus-be... [21:48:56] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1075 to cirrussearch1075 - bking@cumin2002" [21:48:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:48:57] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1075 on all recursors [21:49:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1075 on all recursors [21:49:01] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1075 [21:49:20] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1074 [21:49:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1074.eqiad.wmnet with reason: host reimage [21:49:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1074 [21:50:13] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1075 [21:50:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1074.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:50:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1075 to cirrussearch1075 [21:51:43] (03CR) 10Andrew Bogott: [C:03+2] Openstack: rough in Octavia service for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1145312 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:51:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1075.eqiad.wmnet with OS bullseye [21:51:51] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1075 [21:51:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1075 [21:57:35] (03PS1) 10Andrew Bogott: octavia: move octavia<->amphora key into /etc/octavia/certs [puppet] - 10https://gerrit.wikimedia.org/r/1146087 (https://phabricator.wikimedia.org/T394099) [21:57:38] 10ops-codfw, 06DC-Ops: hw troubleshooting: SSD Firmware update for frbackup2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T394359 (10Dwisehaupt) 03NEW [21:58:09] (03PS2) 10Andrew Bogott: octavia: move octavia<->amphora key into /etc/octavia/certs [puppet] - 10https://gerrit.wikimedia.org/r/1146087 (https://phabricator.wikimedia.org/T394099) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250514T2200) [22:00:58] (03PS3) 10Andrew Bogott: octavia: move octavia<->amphora key into /etc/octavia/certs [puppet] - 10https://gerrit.wikimedia.org/r/1146087 (https://phabricator.wikimedia.org/T394099) [22:02:20] (03CR) 10Andrew Bogott: [C:03+2] octavia: move octavia<->amphora key into /etc/octavia/certs [puppet] - 10https://gerrit.wikimedia.org/r/1146087 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [22:02:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:03:18] (03CR) 10Andrew Bogott: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1146087 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [22:03:24] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:04:59] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:05:46] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:06:32] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1075.eqiad.wmnet with reason: host reimage [22:08:14] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:09:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:09:30] (03PS1) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:09:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1075.eqiad.wmnet with reason: host reimage [22:10:55] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [22:10:58] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [22:11:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1074.eqiad.wmnet with OS bullseye [22:12:30] (03PS9) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [22:13:22] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1074.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:13:52] (03CR) 10CI reject: [V:04-1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [22:14:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1074.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:15:36] (03PS1) 10Dduvall: aptrepo: Provide thirdparty/docker component with upstream packages [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) [22:15:41] (03PS2) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:15:55] (03PS2) 10Dduvall: aptrepo: Provide thirdparty/docker component with upstream packages [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) [22:16:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [22:16:53] (03PS10) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [22:17:58] (03PS3) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:18:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:19:29] (03CR) 10Stevemunene: airflow: cleanup deployment charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [22:20:40] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:21:03] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1074.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:21:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [22:21:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824502 (10VRiley-WMF) [22:23:43] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1072.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:47] (03PS4) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:23:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:26:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:27:07] (03PS5) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:27:14] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:30:47] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:32:11] (03PS6) 10Andrew Bogott: cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) [22:32:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:33:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1075.eqiad.wmnet with OS bullseye [22:35:18] (03CR) 10Andrew Bogott: [C:03+2] cloudrabbit: get deployement-local octavia username/password [puppet] - 10https://gerrit.wikimedia.org/r/1146089 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:44:14] (03PS1) 10Andrew Bogott: octavia.conf: add database connection rules [puppet] - 10https://gerrit.wikimedia.org/r/1146092 (https://phabricator.wikimedia.org/T393783) [22:45:27] (03PS2) 10Andrew Bogott: octavia.conf: add database connection rules [puppet] - 10https://gerrit.wikimedia.org/r/1146092 (https://phabricator.wikimedia.org/T393783) [22:45:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146092 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:46:48] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_drmrs [22:47:44] (03CR) 10Andrew Bogott: [C:03+2] octavia.conf: add database connection rules [puppet] - 10https://gerrit.wikimedia.org/r/1146092 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:50:49] RECOVERY - Hadoop NodeManager on an-worker1192 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:51:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_drmrs [22:56:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [23:01:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [23:03:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:24:17] (03PS1) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) [23:25:28] (03CR) 10CI reject: [V:04-1] cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [23:29:43] (03PS2) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) [23:32:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [23:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146103 [23:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146103 (owner: 10TrainBranchBot) [23:40:48] (03PS3) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) [23:50:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146103 (owner: 10TrainBranchBot) [23:54:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [23:58:11] (03PS4) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) [23:59:29] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)