[00:03:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1042419 (owner: 10TrainBranchBot) [00:04:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P64752 and previous config saved to /var/cache/conftool/dbconfig/20240613-000430-ladsgroup.json [00:19:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P64753 and previous config saved to /var/cache/conftool/dbconfig/20240613-001937-ladsgroup.json [00:24:03] (03PS1) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [00:25:26] (03PS1) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [00:25:45] (03CR) 10Jdlrobson: [C:04-1] "Cannot be deployed prior to 20th June (currently)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [00:29:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [00:34:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T352010)', diff saved to https://phabricator.wikimedia.org/P64754 and previous config saved to /var/cache/conftool/dbconfig/20240613-003444-ladsgroup.json [00:34:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [00:34:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:35:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [00:35:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64755 and previous config saved to /var/cache/conftool/dbconfig/20240613-003507-ladsgroup.json [00:42:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [00:42:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [00:42:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64756 and previous config saved to /var/cache/conftool/dbconfig/20240613-004247-marostegui.json [00:42:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:58:13] (03PS1) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) [01:09:01] (03CR) 10Scott French: "Hi Janis - I think this should achieve what we talked about earlier today, as long as my understanding is not wildly off :) Thanks in adva" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [01:16:24] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 47 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:45:00] PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:02] RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [02:10:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:26] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:24] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:24] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:55:22] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887335 (10Papaul) I create ticket # 1-235341265861 requesting Equinix to check the breaker on the feed where PEM0 is connected. [02:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:43:29] (03PS1) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 [03:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [03:54:12] (03PS2) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 [04:10:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:15:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:23:03] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887379 (10Papaul) Technician Note Equinix Support , Jun/12/2024 22:28 The site has investigated customer equipment 2016250 Juniper in cabinet 504. All power indicators are green. The only al... [04:24:07] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887380 (10Papaul) Reopen Note Papaul Tshibamba , Jun/12/2024 23:19 Thank you for checking this yes indeed all the power indicators are green but we are not getting enough power on PEM 0 that... [04:25:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:27:06] (03PS1) 10KartikMistry: Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563) [04:30:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:32:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367146 [04:32:27] T367146: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T367146 [04:32:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1183 with weight 0 T367146', diff saved to https://phabricator.wikimedia.org/P64757 and previous config saved to /var/cache/conftool/dbconfig/20240613-043239-root.json [04:32:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367146 [04:33:42] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1041535 (https://phabricator.wikimedia.org/T367146) (owner: 10Gerrit maintenance bot) [04:34:12] (03PS2) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041536 (https://phabricator.wikimedia.org/T367146) [04:38:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:38:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:38:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:38:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:38:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64758 and previous config saved to /var/cache/conftool/dbconfig/20240613-043848-marostegui.json [04:38:53] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [04:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:42:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64759 and previous config saved to /var/cache/conftool/dbconfig/20240613-044201-marostegui.json [04:44:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:51:08] !log Starting s5 eqiad failover from db1230 to db1183 - T367146 [04:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:12] T367146: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T367146 [04:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T367146', diff saved to https://phabricator.wikimedia.org/P64760 and previous config saved to /var/cache/conftool/dbconfig/20240613-045121-root.json [04:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1183 to s5 primary and set section read-write T367146', diff saved to https://phabricator.wikimedia.org/P64761 and previous config saved to /var/cache/conftool/dbconfig/20240613-045141-root.json [04:52:12] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041536 (https://phabricator.wikimedia.org/T367146) (owner: 10Gerrit maintenance bot) [04:52:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 T367146', diff saved to https://phabricator.wikimedia.org/P64762 and previous config saved to /var/cache/conftool/dbconfig/20240613-045254-root.json [04:53:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:54:37] (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042573 [04:54:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Long schema change [04:54:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Long schema change [04:55:20] !log dbmaint eqiad s5 deploy schema change on db1230 T364299 [04:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:55:32] (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042573 (owner: 10Marostegui) [04:57:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64763 and previous config saved to /var/cache/conftool/dbconfig/20240613-045709-marostegui.json [04:58:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:03:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:12:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64764 and previous config saved to /var/cache/conftool/dbconfig/20240613-051204-marostegui.json [05:12:09] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:12:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64765 and previous config saved to /var/cache/conftool/dbconfig/20240613-051216-marostegui.json [05:23:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64766 and previous config saved to /var/cache/conftool/dbconfig/20240613-052344-ladsgroup.json [05:23:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:27:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P64767 and previous config saved to /var/cache/conftool/dbconfig/20240613-052711-marostegui.json [05:27:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64768 and previous config saved to /var/cache/conftool/dbconfig/20240613-052723-marostegui.json [05:27:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:27:29] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:27:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:27:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64769 and previous config saved to /var/cache/conftool/dbconfig/20240613-052746-marostegui.json [05:30:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64770 and previous config saved to /var/cache/conftool/dbconfig/20240613-053052-marostegui.json [05:31:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1042595 (https://phabricator.wikimedia.org/T367378) [05:31:51] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) [05:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:37:33] (03PS1) 10KartikMistry: Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122) [05:38:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P64771 and previous config saved to /var/cache/conftool/dbconfig/20240613-053851-ladsgroup.json [05:42:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P64772 and previous config saved to /var/cache/conftool/dbconfig/20240613-054218-marostegui.json [05:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64773 and previous config saved to /var/cache/conftool/dbconfig/20240613-054600-marostegui.json [05:47:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: Long schema change [05:47:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: Long schema change [05:53:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:53:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P64774 and previous config saved to /var/cache/conftool/dbconfig/20240613-055358-ladsgroup.json [05:57:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64775 and previous config saved to /var/cache/conftool/dbconfig/20240613-055725-marostegui.json [05:57:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [05:57:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:57:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [05:57:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64776 and previous config saved to /var/cache/conftool/dbconfig/20240613-055747-marostegui.json [05:58:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600). [06:01:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64777 and previous config saved to /var/cache/conftool/dbconfig/20240613-060107-marostegui.json [06:01:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:03:30] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:47] jouncebot: now [06:05:47] For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600) [06:05:47] For the next 0 hour(s) and 24 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600) [06:06:01] jouncebot: next [06:06:01] In 0 hour(s) and 53 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0700) [06:09:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64778 and previous config saved to /var/cache/conftool/dbconfig/20240613-060905-ladsgroup.json [06:09:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [06:09:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:09:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64779 and previous config saved to /var/cache/conftool/dbconfig/20240613-060927-ladsgroup.json [06:13:40] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 [06:13:47] (03CR) 10CI reject: [V:04-1] statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto) [06:14:19] (03PS2) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 [06:16:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64780 and previous config saved to /var/cache/conftool/dbconfig/20240613-061613-marostegui.json [06:16:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:16:18] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:16:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:16:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64781 and previous config saved to /var/cache/conftool/dbconfig/20240613-061636-marostegui.json [06:19:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64782 and previous config saved to /var/cache/conftool/dbconfig/20240613-061948-marostegui.json [06:24:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:27:05] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [06:29:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64783 and previous config saved to /var/cache/conftool/dbconfig/20240613-063455-marostegui.json [06:36:22] 06SRE-OnFire, 06cloud-services-team, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9887501 (10dcaro) [06:38:11] (03PS1) 10Marostegui: Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1042726 [06:38:35] (03CR) 10Marostegui: [C:03+2] Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1042726 (owner: 10Marostegui) [06:39:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64784 and previous config saved to /var/cache/conftool/dbconfig/20240613-063934-root.json [06:40:44] (03CR) 10Muehlenhoff: "For production all descriptions are set via profile:base::production ::role_description, but for Cloud VPS this doesn't seem very useful: " [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff) [06:40:58] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff) [06:42:13] !log rebalance ganeti clusters in eqiad following reboots [06:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:52] (03PS1) 10Marostegui: db1187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1042800 [06:45:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:45:20] (03CR) 10Marostegui: [C:03+2] db1187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1042800 (owner: 10Marostegui) [06:47:39] (03PS1) 10Marostegui: db1125: Typo [puppet] - 10https://gerrit.wikimedia.org/r/1042804 [06:49:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:49:55] (03CR) 10Marostegui: [C:03+2] db1125: Typo [puppet] - 10https://gerrit.wikimedia.org/r/1042804 (owner: 10Marostegui) [06:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64785 and previous config saved to /var/cache/conftool/dbconfig/20240613-065002-marostegui.json [06:50:58] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9887525 (10Marostegui) @Jhancock.wm reminder, we do not need AAAA records on these hosts. [06:52:27] (03PS1) 10Marostegui: site.pp: New dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/1042820 (https://phabricator.wikimedia.org/T362824) [06:52:43] (03Abandoned) 10Marostegui: site.pp: New dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/1042820 (https://phabricator.wikimedia.org/T362824) (owner: 10Marostegui) [06:54:02] (03PS1) 10Marostegui: site.pp: New dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042822 (https://phabricator.wikimedia.org/T362824) [06:54:24] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9887533 (10Marostegui) [06:54:28] (03CR) 10Marostegui: [C:03+2] site.pp: New dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042822 (https://phabricator.wikimedia.org/T362824) (owner: 10Marostegui) [06:54:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64786 and previous config saved to /var/cache/conftool/dbconfig/20240613-065439-root.json [06:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:46] (03CR) 10Muehlenhoff: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn) [06:59:20] (03PS1) 10Marostegui: regex.yaml: Add dbproxy codfw [puppet] - 10https://gerrit.wikimedia.org/r/1042825 [07:00:04] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0700). Please do the needful. [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64787 and previous config saved to /var/cache/conftool/dbconfig/20240613-070509-marostegui.json [07:05:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:05:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1042825 (owner: 10Marostegui) [07:05:17] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:05:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:05:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64788 and previous config saved to /var/cache/conftool/dbconfig/20240613-070531-marostegui.json [07:08:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64789 and previous config saved to /var/cache/conftool/dbconfig/20240613-070837-marostegui.json [07:08:47] (03CR) 10Marostegui: [C:03+2] regex.yaml: Add dbproxy codfw [puppet] - 10https://gerrit.wikimedia.org/r/1042825 (owner: 10Marostegui) [07:09:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64790 and previous config saved to /var/cache/conftool/dbconfig/20240613-070944-root.json [07:09:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:10] (03CR) 10Muehlenhoff: [C:03+2] profile::maps::tlsproxy: Unconditionally use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [07:15:16] (03CR) 10Slyngshede: [C:03+2] Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [07:16:50] (03Merged) 10jenkins-bot: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [07:21:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [07:21:50] (03PS1) 10Brouberol: spark-operator: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) [07:22:19] (03Abandoned) 10Muehlenhoff: Make sretest1001 a Cumin node for a test [puppet] - 10https://gerrit.wikimedia.org/r/998930 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [07:23:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64791 and previous config saved to /var/cache/conftool/dbconfig/20240613-072344-marostegui.json [07:24:37] RECOVERY - Host an-worker1085 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:24:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64792 and previous config saved to /var/cache/conftool/dbconfig/20240613-072450-root.json [07:25:59] marostegui: OK to deploy cxserver/MinT? [07:26:19] kart_: go for it! [07:26:31] cool. [07:27:03] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563) (owner: 10KartikMistry) [07:27:46] (03Merged) 10jenkins-bot: Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563) (owner: 10KartikMistry) [07:28:34] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [07:30:41] (03PS1) 10Muehlenhoff: tlsproxy::localssl: Remove support for cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) [07:32:03] "add securityContext to all containers" - is it OK to deploy? [07:33:28] OK. I'll wait for someone to check it then deploy mint/cxserver later. [07:34:04] kart_: let me check the commit, but it is alright [07:34:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [07:34:24] (03PS4) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [07:34:36] effie: OK. Please let me know. Seems added in all services. [07:34:38] (03CR) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [07:34:52] yes it is [07:36:55] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174#9887576 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this, the new established procedure is that all servers which get mo... [07:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64793 and previous config saved to /var/cache/conftool/dbconfig/20240613-073851-marostegui.json [07:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64794 and previous config saved to /var/cache/conftool/dbconfig/20240613-073955-root.json [07:43:36] (03PS2) 10Phedenskog: wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669) [07:43:38] kart_: I don't see anything in the diff apart from the chart version [07:43:41] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you for the extensive comments/guide" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis) [07:44:40] kart_: shall I deploy? [07:47:15] (03CR) 10Muehlenhoff: [C:03+2] prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff) [07:47:45] (03PS3) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 [07:49:30] kart_, effie: I think that has long been deployed to cxserver. What you might see is a chart version bump because of an updated helm test that does change the deployment [07:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [07:51:36] jayme: I saw the log etc, I am just wondering what kart_ saw [07:52:00] effie: Sorry, was bit afk. [07:52:18] jayme: because securityContext on cxserver was deployed in may [07:52:24] effie: I was looking at machinetranslation (mint) service first. [07:52:36] ah let me check there rtoo, I was checking cxserver [07:52:50] effie: I yet to merge patch for cxserver. [07:53:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64795 and previous config saved to /var/cache/conftool/dbconfig/20240613-075358-marostegui.json [07:54:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [07:54:03] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:54:07] kart_: go ahead [07:54:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [07:54:14] Thanks! [07:54:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64796 and previous config saved to /var/cache/conftool/dbconfig/20240613-075420-marostegui.json [07:54:25] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:55:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64797 and previous config saved to /var/cache/conftool/dbconfig/20240613-075500-root.json [07:56:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [07:57:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:57:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64798 and previous config saved to /var/cache/conftool/dbconfig/20240613-075727-marostegui.json [07:59:00] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:59:08] (03CR) 10Filippo Giunchedi: [C:03+2] wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [08:02:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:02:58] (03PS3) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 [08:03:32] (03CR) 10Slyngshede: Replace development server with uWSGI. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [08:03:52] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:03:57] (03CR) 10Majavah: [C:04-1] profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [08:04:51] (03CR) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [08:05:14] PROBLEM - MariaDB Replica SQL: s2 on db2125 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:05:38] depooling ↑ [08:06:17] (03PS4) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 [08:06:22] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [08:06:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'index error depool db2125', diff saved to https://phabricator.wikimedia.org/P64799 and previous config saved to /var/cache/conftool/dbconfig/20240613-080624-arnaudb.json [08:06:38] (03PS1) 10Filippo Giunchedi: logstash: add auto_offset_reset to kafka input [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) [08:06:39] (03PS1) 10Filippo Giunchedi: logstash: consume k8s logs topics [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) [08:06:42] (03PS4) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 [08:07:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:08:05] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2919/co" [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [08:08:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2125.codfw.wmnet with reason: index issue [08:08:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2125.codfw.wmnet with reason: index issue [08:09:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:10:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [08:10:14] RECOVERY - MariaDB Replica SQL: s2 on db2125 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:11:16] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [08:11:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64800 and previous config saved to /var/cache/conftool/dbconfig/20240613-081138-arnaudb.json [08:12:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:12:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64801 and previous config saved to /var/cache/conftool/dbconfig/20240613-081234-marostegui.json [08:12:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:13:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:28] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:13:44] (03CR) 10JMeybohm: [C:03+1] helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:13:54] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [08:14:01] (03CR) 10Klausman: [C:03+1] "Looks good to me. There is a bit of a question of alert routing (for k8s-ml aka LiftWing, the general SRE team isn't the first line of def" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [08:14:42] (03PS1) 10Phedenskog: wmftest: Remove old performance team setup. [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) [08:14:51] (03CR) 10JMeybohm: hemlfile: export admin-ng pending diff metrics hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:15:08] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [08:15:26] (03CR) 10Phedenskog: [C:04-1] "I want to wait with this until we seen that the new Graphite setup is working. When that's done, this cleanup can be done." [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [08:15:31] (03CR) 10Brouberol: [V:03+1] helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:15:50] (03PS7) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [08:15:50] (03PS2) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) [08:19:18] (03CR) 10Majavah: [C:03+1] profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [08:20:30] (03CR) 10Slyngshede: [C:03+2] Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [08:21:28] (03CR) 10Brouberol: [C:03+2] helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:22:08] (03Merged) 10jenkins-bot: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [08:25:07] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [08:26:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64802 and previous config saved to /var/cache/conftool/dbconfig/20240613-082643-arnaudb.json [08:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:27:35] (03PS1) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) [08:27:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64803 and previous config saved to /var/cache/conftool/dbconfig/20240613-082741-marostegui.json [08:27:51] (03PS2) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) [08:29:19] !log Updated MinT to 2024-06-12-111204-production (T363563) [08:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:23] T363563: Avoid references losing their data (showing as plain-text "[1]") when added to the translation using MinT - https://phabricator.wikimedia.org/T363563 [08:29:35] jouncebot: nex [08:29:37] jouncebot: next [08:29:37] In 1 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [08:29:42] jouncebot: now [08:29:42] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [08:30:05] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:32:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:32:55] <_joe_> effie: do you need to deploy mediawiki, or are you just doing reboots? [08:33:24] <_joe_> because if it's the latter, I will do some hacks to mw-debug [08:34:57] reboots [08:36:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:36:47] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [08:37:01] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:37:20] (03PS3) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) [08:39:36] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9887671 (10akosiaris) [08:40:42] (03CR) 10Muehlenhoff: purged: set use_pki to true in magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [08:40:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:41:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64804 and previous config saved to /var/cache/conftool/dbconfig/20240613-084149-arnaudb.json [08:42:28] (03PS1) 10Alexandros Kosiaris: Remove no longer used parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) [08:42:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64805 and previous config saved to /var/cache/conftool/dbconfig/20240613-084248-marostegui.json [08:42:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:42:51] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [08:42:52] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:43:00] (03CR) 10Btullis: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [08:43:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [08:43:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [08:43:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64806 and previous config saved to /var/cache/conftool/dbconfig/20240613-084310-marostegui.json [08:46:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64807 and previous config saved to /var/cache/conftool/dbconfig/20240613-084615-marostegui.json [08:46:47] (03PS3) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) [08:48:20] (03CR) 10Alexandros Kosiaris: [C:03+1] "Uninformed LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis) [08:48:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:36] (03CR) 10JMeybohm: [C:04-1] kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [08:51:17] (03CR) 10Brouberol: [C:03+2] helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [08:52:13] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [08:56:03] (03CR) 10Btullis: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [08:56:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64808 and previous config saved to /var/cache/conftool/dbconfig/20240613-085654-arnaudb.json [08:57:09] (03CR) 10Brouberol: [C:03+2] Deploy calico network policy templates to all datahub charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [08:58:22] (03CR) 10Brouberol: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [08:59:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [08:59:26] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [09:01:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64809 and previous config saved to /var/cache/conftool/dbconfig/20240613-090122-marostegui.json [09:02:07] (03CR) 10Btullis: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:03:35] (03CR) 10JMeybohm: "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:04:13] (03PS1) 10Jelto: aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) [09:05:11] (03CR) 10JMeybohm: "Hm...gerrit formatted stuff." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:06:27] (03CR) 10Brouberol: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:07:40] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:07:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [09:08:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:33] (03CR) 10EoghanGaffney: [C:03+1] aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto) [09:08:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9887749 (10Clement_Goubert) @VRiley-WMF Do you object to us reusing that task by reopening it whenever we have a batch of servers to relabel, or would you rathe... [09:09:49] (03CR) 10Jelto: [C:03+2] aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto) [09:10:08] (03PS2) 10Jelto: aptrepo: bump gitlab-runner and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) [09:12:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64810 and previous config saved to /var/cache/conftool/dbconfig/20240613-091200-arnaudb.json [09:12:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:13:02] (03PS1) 10Klausman: golang: Add version 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 [09:13:03] (03CR) 10Klausman: "Feel free to redirect to a different reviewer" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [09:14:43] (03CR) 10Klausman: "Confirmed working:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [09:15:50] (03CR) 10Brouberol: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:15:55] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto) [09:16:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64811 and previous config saved to /var/cache/conftool/dbconfig/20240613-091629-marostegui.json [09:16:42] (03PS7) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [09:16:42] (03PS1) 10Brouberol: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) [09:17:00] (03Abandoned) 10Brouberol: datahub: update datahubsearch hostname to use external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:17:49] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [09:18:42] (03PS8) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [09:19:44] (03PS9) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [09:22:18] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:22:34] <_joe_> jouncebot: now [09:22:34] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [09:22:40] <_joe_> jouncebot: next [09:22:41] In 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [09:22:50] <_joe_> ok I'll go a little early [09:24:15] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9887824 (10taavi) [09:26:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1006.eqiad.wmnet [09:26:57] (these kafka nodes are insetup, no worries) [09:29:57] (03CR) 10Clément Goubert: [C:03+1] Remove no longer used parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [09:31:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64812 and previous config saved to /var/cache/conftool/dbconfig/20240613-093136-marostegui.json [09:31:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [09:31:41] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:31:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [09:31:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64813 and previous config saved to /var/cache/conftool/dbconfig/20240613-093158-marostegui.json [09:32:45] (03PS2) 10Brouberol: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) [09:32:46] (03PS10) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [09:32:46] (03PS1) 10Brouberol: datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) [09:32:49] (03PS1) 10DCausse: wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) [09:32:57] (03PS1) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 [09:33:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1006.eqiad.wmnet [09:33:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1007.eqiad.wmnet [09:33:05] (03CR) 10CI reject: [V:04-1] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [09:33:41] (03CR) 10Muehlenhoff: "parse1001 and parse2001 are still pooled for the parsoid-php service, will that cause any issues?" [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [09:34:18] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [09:34:40] (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131 (owner: 10Brouberol) [09:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:34:48] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [09:34:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64814 and previous config saved to /var/cache/conftool/dbconfig/20240613-093455-marostegui.json [09:35:03] (03CR) 10Brouberol: [C:03+2] spark-operator: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [09:35:10] (03CR) 10Brouberol: [C:03+2] rdf-streaming-updater: remove from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131 (owner: 10Brouberol) [09:35:34] (03CR) 10Kamila Součková: [C:03+1] Remove mw2289.codfw.wmnet from scap::proxies for decom [puppet] - 10https://gerrit.wikimedia.org/r/1042200 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert) [09:35:48] (03CR) 10Btullis: [C:03+1] Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol) [09:35:58] jouncebot nowandnext [09:35:58] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [09:35:58] In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [09:36:15] (03CR) 10Clément Goubert: [C:03+2] Remove mw2289.codfw.wmnet from scap::proxies for decom [puppet] - 10https://gerrit.wikimedia.org/r/1042200 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert) [09:36:20] (03CR) 10Kamila Součková: [C:03+1] decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042201 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert) [09:37:15] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:37:33] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:37:46] (03CR) 10Alexandros Kosiaris: [C:04-1] "Good catch. Yeah, we need to remove them first. I got a task at https://phabricator.wikimedia.org/T359387" [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [09:38:06] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [09:38:17] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [09:38:50] (03CR) 10Brouberol: [C:03+2] Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol) [09:39:13] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2001.codfw.wmnet [09:39:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1007.eqiad.wmnet [09:39:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1008.eqiad.wmnet [09:40:03] (03CR) 10Brouberol: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:40:07] (03CR) 10Btullis: [C:03+1] datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:41:31] (03CR) 10Btullis: [C:03+1] "Got it. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:41:56] (03CR) 10Btullis: [C:03+1] datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:42:55] (03CR) 10Hnowlan: [C:03+1] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [09:43:35] (03CR) 10Clément Goubert: [C:03+2] decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042201 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert) [09:43:39] (03PS2) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 [09:44:51] (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [09:45:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1008.eqiad.wmnet [09:45:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1009.eqiad.wmnet [09:46:02] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl2003.codfw.wmnet [09:46:02] (03PS1) 10Hashar: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029) [09:47:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2281,2283-2286].codfw.wmnet [09:47:35] (03PS3) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 [09:48:59] (03CR) 10Kamila Součková: [C:04-1] "I messed up and will start with 2003" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková) [09:49:24] (03CR) 10Brouberol: [C:03+2] datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:49:56] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64815 and previous config saved to /var/cache/conftool/dbconfig/20240613-095002-marostegui.json [09:50:05] (03CR) 10Brouberol: [C:03+2] datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:50:18] (03Merged) 10jenkins-bot: datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:50:48] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2003.eqiad.wmnet [09:50:58] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2001.eqiad.wmnet [09:51:03] (03Merged) 10jenkins-bot: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [09:51:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:52:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1009.eqiad.wmnet [09:52:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1010.eqiad.wmnet [09:52:52] (03PS7) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) [09:53:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [09:53:52] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [09:54:19] (03PS1) 10Kamila Součková: Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980 [09:54:30] (03PS2) 10Kamila Součková: Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980 [09:56:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:58:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1010.eqiad.wmnet [09:59:26] (03PS2) 10Brouberol: hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) [09:59:43] (03CR) 10Brouberol: hemlfile: export admin-ng pending diff metrics hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:59:45] (03CR) 10Hashar: [C:03+2] Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar) [09:59:57] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [10:00:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:00:18] (03Merged) 10jenkins-bot: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar) [10:00:33] (03CR) 10Hnowlan: kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [10:00:35] (03PS3) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 [10:00:35] (03PS1) 10Giuseppe Lavagetto: modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982 [10:00:35] (03PS1) 10Giuseppe Lavagetto: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 [10:00:35] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 [10:01:13] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [10:01:33] (03CR) 10CI reject: [V:04-1] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto) [10:01:34] (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto) [10:01:39] The Appserver unavailable are most probably my decoms [10:02:12] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:02:25] jouncebot: nowandnext [10:02:25] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [10:02:25] In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200) [10:02:26] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:02:46] I am goin got do a quick Gerrit update, should not take more than a few minutes [10:03:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [10:03:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:03:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl2003.codfw.wmnet [10:03:53] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887946 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2003.codfw.... [10:03:56] (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980 (owner: 10Kamila Součková) [10:04:03] !log hashar@deploy1002 Started deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 [10:04:10] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 (duration: 00m 08s) [10:04:47] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2281,2283-2286].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [10:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64816 and previous config saved to /var/cache/conftool/dbconfig/20240613-100509-marostegui.json [10:05:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2281,2283-2286].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [10:05:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:05:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2281,2283-2286].codfw.wmnet [10:06:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2287-2290].codfw.wmnet [10:07:34] (03CR) 10MVernon: [C:03+2] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:07:54] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9887959 (10kamila) >>! In T366205#9880294, @Papaul wrote: > @kamila your plan works for us as well, just depool and power the fi... [10:08:03] !log hashar@deploy1002 Started deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 on gerrit1003 # T367029 T367135 [10:08:09] T367029: "Press c to comment" is placed incorrectly when using Firefox 126 and 128 on macOS - https://phabricator.wikimedia.org/T367029 [10:08:09] T367135: "Collapse" link on add/edit reviewers screen is showing weird icons - https://phabricator.wikimedia.org/T367135 [10:08:10] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 on gerrit1003 # T367029 T367135 (duration: 00m 06s) [10:09:03] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:09:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2006.codfw.wmnet [10:10:00] !log cp4037 depooled && puppet disable to profile benthos configuration (T360454) [10:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:04] T360454: Better Benthos performances - https://phabricator.wikimedia.org/T360454 [10:10:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:15:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:15:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2006.codfw.wmnet [10:15:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2007.codfw.wmnet [10:16:07] The high error rates are the circuitbreaking ^ Amir1 [10:18:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:18:45] (03PS1) 10MVernon: wmflib: correct doc string to note lvs is Optional [puppet] - 10https://gerrit.wikimedia.org/r/1042986 [10:20:02] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:20:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:20:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64818 and previous config saved to /var/cache/conftool/dbconfig/20240613-102016-marostegui.json [10:20:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:20:21] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:20:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:20:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042986 (owner: 10MVernon) [10:21:21] (03Abandoned) 10FNegri: Add DNS for ToolsDB replica host [puppet] - 10https://gerrit.wikimedia.org/r/1034042 (https://phabricator.wikimedia.org/T348407) (owner: 10FNegri) [10:21:44] hashar: can you tell me when the gerrit update is finished, please? [10:21:53] oh sorry [10:21:55] done [10:21:59] !log Gerrit upgrade completed [10:22:00] thanks. [10:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:06] well upgrade is a bold word really [10:22:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2007.codfw.wmnet [10:22:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2008.codfw.wmnet [10:22:16] (03CR) 10MVernon: [C:03+2] wmflib: correct doc string to note lvs is Optional [puppet] - 10https://gerrit.wikimedia.org/r/1042986 (owner: 10MVernon) [10:22:19] it is merely swapping for a version with a handful of patches applied [10:22:21] but yeah it is done [10:22:22] sorry [10:22:38] (03PS1) 10Brouberol: datahub: hotfix, remove duplicated env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042987 [10:23:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [10:23:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [10:23:40] (03CR) 10Brouberol: [C:03+2] datahub: hotfix, remove duplicated env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042987 (owner: 10Brouberol) [10:23:42] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2287-2290].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [10:23:47] (03PS2) 10Giuseppe Lavagetto: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 [10:23:47] (03PS2) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 [10:24:48] (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto) [10:25:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:25:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:25:33] (03CR) 10Clément Goubert: [C:03+1] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto) [10:26:16] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:26:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:26:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2287-2290].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [10:26:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:49] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2287-2290].codfw.wmnet [10:26:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:27:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64819 and previous config saved to /var/cache/conftool/dbconfig/20240613-102659-marostegui.json [10:27:11] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:28:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto) [10:28:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2008.codfw.wmnet [10:28:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2009.codfw.wmnet [10:28:50] (03Merged) 10jenkins-bot: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto) [10:29:07] (03CR) 10Giuseppe Lavagetto: [C:03+2] modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982 (owner: 10Giuseppe Lavagetto) [10:29:08] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:29:35] (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto) [10:29:52] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:30:31] (03Merged) 10jenkins-bot: modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982 (owner: 10Giuseppe Lavagetto) [10:30:32] (03Merged) 10jenkins-bot: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto) [10:30:43] !log cmooney@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1003'] [10:30:50] (03PS3) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 [10:31:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1003'] [10:31:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888054 (10Clement_Goubert) @Papaul All servers except `mw2282` decommissioned. [10:31:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64820 and previous config saved to /var/cache/conftool/dbconfig/20240613-103111-ladsgroup.json [10:31:16] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888045 (10Clement_Goubert) [10:31:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:31:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64821 and previous config saved to /var/cache/conftool/dbconfig/20240613-103120-marostegui.json [10:31:30] (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto) [10:31:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9888066 (10MoritzMuehlenhoff) [10:32:08] (03PS4) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 [10:32:16] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888074 (10MoritzMuehlenhoff) [10:32:33] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888049 (10Clement_Goubert) a:05Clement_Goubert→03None [10:32:42] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888076 (10MoritzMuehlenhoff) [10:33:41] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:34:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto) [10:34:24] (03PS11) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [10:34:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2009.codfw.wmnet [10:34:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2010.codfw.wmnet [10:35:24] (03Merged) 10jenkins-bot: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto) [10:36:21] (03PS12) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [10:37:31] (03CR) 10Brouberol: [C:03+2] datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol) [10:39:26] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [10:40:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [10:41:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:41:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2010.codfw.wmnet [10:41:28] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:41:49] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [10:42:24] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [10:43:05] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [10:43:28] (03PS1) 10Giuseppe Lavagetto: base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 [10:44:26] (03CR) 10Clément Goubert: [C:03+1] base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto) [10:44:43] (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto) [10:45:26] (03Abandoned) 10Hnowlan: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler) [10:45:47] (03Merged) 10jenkins-bot: base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto) [10:46:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P64822 and previous config saved to /var/cache/conftool/dbconfig/20240613-104619-ladsgroup.json [10:46:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P64823 and previous config saved to /var/cache/conftool/dbconfig/20240613-104628-marostegui.json [10:46:31] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:46:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:46:48] (03CR) 10EoghanGaffney: [C:03+2] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:47:24] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:47:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet [10:47:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:48:03] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:48:03] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [10:48:09] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:48:44] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:49:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [10:49:51] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:50:15] (03PS1) 10MVernon: cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621) [10:51:38] <_joe_> jouncebot: now [10:51:38] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000) [10:51:48] <_joe_> sigh I will be running a little late I fear [10:51:52] <_joe_> jouncebot: next [10:51:52] In 1 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200) [10:52:02] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [10:52:40] (03PS1) 10Brouberol: datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 [10:54:21] (03PS1) 10Giuseppe Lavagetto: base.statsd: remove quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042996 [10:54:58] (03CR) 10Klausman: [C:03+1] cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:55:07] PROBLEM - SSH on wikikube-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:55:59] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:56:03] (03CR) 10MVernon: [C:03+2] cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:56:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet [10:56:26] (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: remove quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042996 (owner: 10Giuseppe Lavagetto) [10:56:41] PROBLEM - Host wikikube-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:14] (03CR) 10Muehlenhoff: [C:03+2] profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [10:58:22] huh that ain't me [10:58:33] <_joe_> claime: wat [10:58:33] FIRING: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:58:54] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:59:00] I just rebooted kubemaster1001, didn't touch wikikube-ctrl1001 [10:59:03] <_joe_> ah that's not an active master [10:59:08] <_joe_> ctrl I mean? [10:59:20] it is [10:59:29] well it was [10:59:32] <_joe_> can't reach via ssh [10:59:34] now it's down [10:59:39] kamila_ ? [10:59:55] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:00:03] RECOVERY - SSH on wikikube-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:00:05] RECOVERY - Host wikikube-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:00:30] Huh, that wasn't me [11:00:42] FIRING: [6x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:01:05] That's probably me [11:01:08] ok [11:01:21] (03PS1) 10Superpes15: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) [11:01:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P64824 and previous config saved to /var/cache/conftool/dbconfig/20240613-110126-ladsgroup.json [11:01:31] it's actually up, I don't know why it's pinging now [11:01:32] ack thanks claime [11:01:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P64825 and previous config saved to /var/cache/conftool/dbconfig/20240613-110135-marostegui.json [11:01:39] the one that's down is ctrl1001 [11:01:40] checking too [11:01:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:02:03] I didn't touch ctrl1001 today [11:02:06] yeah recovering, I don't see pages on alerts.w.o [11:02:08] here [11:02:24] too late then [11:02:30] 11:02:25 up 2 min, 2 users, load average: 2.68, 1.14, 0.42 [11:02:33] it rebooted [11:02:35] wth [11:02:39] Mhm [11:03:33] RESOLVED: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:03:44] I'm currently at a doctor's appointment, I'll stare at it when I get back [11:04:26] (03CR) 10Btullis: [C:03+1] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol) [11:04:40] since the probe recovered I'm assuming we're okay claime ? [11:05:12] 2024-06-13T10:52:34.131434+00:00 wikikube-ctrl1001 systemd-logind[1069]: Power key pressed. [11:05:14] wat [11:05:25] godog: yeah [11:05:42] RESOLVED: [4x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:50] kk thanks claime, going back to lunch [11:05:57] sorry for the noise [11:06:00] enjoy lunch [11:06:12] np that's what we are here for [11:07:15] PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:07:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [11:07:30] (03PS1) 10Muehlenhoff: mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998 [11:07:36] claime: shit, that could have been me [11:07:38] * topranks checking [11:07:57] jouncebot: now [11:07:57] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [11:08:15] RECOVERY - ensure kvm processes are running on cloudvirt1032 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:08:18] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:08:25] ugh, yeah :( [11:08:32] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:08:43] <_joe_> effie: I am finally done, all yours [11:09:12] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [11:09:20] claime: I had intended to reset wikikube-ctrl1003 to follow up on some debugging I was doing with kamila... seems I typed the url wrong [11:09:22] _joe_: tx [11:09:28] topranks: happens [11:09:34] ugh shouldn't though [11:09:34] at least it's got the new kernel now [11:09:36] :p [11:09:54] ha ok see protecting you guys from hackers :P [11:10:34] claime: with any luck 1002 was able to keep the lights on? [11:10:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:10:55] topranks: yeah and there was the old kubemaster aswell [11:11:12] ok ok, sorry folks I'll make sure to do better [11:12:55] (03PS1) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999 [11:13:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [11:13:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:41] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:14:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet [11:14:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff) [11:15:56] kamila_: Jun 13 11:13:53 puppetmaster1001 confd[7831]: 2024-06-13T11:13:53Z puppetmaster1001 /usr/bin/confd[7831]: ERROR "failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/codfw/.kubemaster793668086' with 1 (0.043032169342041016s) [invalid]: { 'host': 'wikikube-ctrl2003.codfw.wmnet', 'weight':10, 'enabled': True } [Errno -2] Name or service not known\n\nupdating error [11:15:58] mtime on /var/run/confd-template/_srv_config-master_pybal_codfw_kubemaster.err\n" [11:16:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64826 and previous config saved to /var/cache/conftool/dbconfig/20240613-111633-ladsgroup.json [11:16:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [11:16:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64827 and previous config saved to /var/cache/conftool/dbconfig/20240613-111642-marostegui.json [11:16:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:16:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [11:16:49] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [11:16:50] !log installing pillow security updates [11:16:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64828 and previous config saved to /var/cache/conftool/dbconfig/20240613-111655-ladsgroup.json [11:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:17:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64829 and previous config saved to /var/cache/conftool/dbconfig/20240613-111706-marostegui.json [11:18:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:19:10] (03CR) 10Stevemunene: [C:03+1] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol) [11:19:13] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:19:17] (03CR) 10JMeybohm: [C:03+1] hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:19:46] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2003.codfw.wmnet [11:20:15] (03CR) 10Volans: "question/thought inline" [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack) [11:20:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [11:20:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:21:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64830 and previous config saved to /var/cache/conftool/dbconfig/20240613-112122-marostegui.json [11:22:09] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:22:42] kamila_: topranks: I set wikikube-ctrl2003.codfw.wmnet to invalid because it doesn't resolve anymore and that breaks confd [11:23:41] RESOLVED: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:23:52] claime: seems sensible, that machine shows as status "decommissioning" in netbox so it makes sense the name is not in DNS [11:23:59] (03PS1) 10Ladsgroup: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 [11:24:52] topranks: yep, but it must have references in puppet, given it's being wrestled into submission by you and k.amila_ [11:25:12] yeah, possibly those references should have been removed [11:25:23] jouncebot: nowandnext [11:25:23] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [11:25:23] In 0 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200) [11:25:31] (03PS2) 10Ladsgroup: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 [11:25:36] but also likely a brief interruption would have been fine, and what kamilla expected, but we had *problems* [11:25:40] (03CR) 10Ladsgroup: [C:03+2] Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 (owner: 10Ladsgroup) [11:26:02] Amir1: effie is rebooting k8s nodes, it may impact the k8s pull, and potentially the redeployment of mw-on-k8s [11:26:14] Possible it won't given it's a small-ish batch [11:26:16] kamila_: let me know if I can help with wikikube-ctrl2003, right now in Netbox it looks a little non-standard [11:26:18] (03Merged) 10jenkins-bot: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 (owner: 10Ladsgroup) [11:26:40] as in it has no IP addresses assigned, but does have it's switch interface connected [11:26:59] I can tidy that up if needed once you're back and we know what next steps are [11:27:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [11:27:39] claime: noted [11:27:51] how long it's going to take? [11:27:57] Amir1: I will ping you [11:28:03] thanks! [11:28:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:28:34] (03CR) 10JMeybohm: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [11:29:28] Amir1: I generally wanted to make it before the next window [11:29:32] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:30:05] (03CR) 10EoghanGaffney: [C:03+1] "Good catch, didn't notice those!" [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff) [11:30:19] I don't think people will deploy things in the next window, I can take over there [11:31:35] (03CR) 10JMeybohm: [C:03+1] "This looks like it could be working now ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:32:24] (03CR) 10EoghanGaffney: [C:03+1] mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998 (owner: 10Muehlenhoff) [11:33:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [11:36:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P64831 and previous config saved to /var/cache/conftool/dbconfig/20240613-113630-marostegui.json [11:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:38:31] checking [11:39:59] (03PS1) 10Muehlenhoff: ircecho: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) [11:40:12] (03CR) 10Muehlenhoff: [C:03+2] mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998 (owner: 10Muehlenhoff) [11:40:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff) [11:41:31] (03PS2) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999 [11:48:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:49:41] claime, topranks: thank you for the help with wikikube-ctrl2003, it's scheduled to be juggled by dc-ops and they suggested that I decom it on my schedule because timezones, I suppose it's not that simple '^^ [11:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:51:09] kamila_: dc-ops are moving it? [11:51:17] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399 (10MoritzMuehlenhoff) 03NEW [11:51:26] topranks: yes, it needs to go into a 10G rack [11:51:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P64832 and previous config saved to /var/cache/conftool/dbconfig/20240613-115137-marostegui.json [11:53:03] Just decom without changing anything else seemed to work when we were doing it quickly, but async apparently gets in the way, I'm sorry [11:53:33] kamila_: but it's in a 10G rack.... hmm maybe they already moved it? [11:54:22] Well in that case someone is confused, most likely me :-D [11:54:39] kamila_: ah my bad, it is indeed connected to a 10/25G switch, but all the port blocks on it are set to 1G so probably it does need to move [11:54:40] ignore me [11:55:14] I'll have a task number in a sec, omw home from doctor [11:56:00] no worries, you / dc-ops are right I think it needs to move :( [11:56:36] I've just become aware of a headache I'd not fully considered before, will spare you the details but really sucks we gotta move this will be many more the same I fear :( [11:57:05] !log enabling puppet && repool cp4037 (T360454) [11:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:10] T360454: Better Benthos performances - https://phabricator.wikimedia.org/T360454 [11:57:13] (03PS2) 10Klausman: golang: Add version 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 [11:57:34] (03CR) 10Klausman: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [11:57:45] kamila_: it's all good lets wait till DC-ops do the move and confirm the new port, hopefully be straightforward after that [11:58:16] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [11:59:50] I hope so, thank you topranks <3 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200) [12:04:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:04:57] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [12:04:59] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [12:05:01] (03PS1) 10Slyngshede: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 [12:05:16] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff) [12:06:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64834 and previous config saved to /var/cache/conftool/dbconfig/20240613-120644-marostegui.json [12:06:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:06:50] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [12:06:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:07:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:07:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:07:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:07:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64835 and previous config saved to /var/cache/conftool/dbconfig/20240613-120711-marostegui.json [12:07:27] Amir1: done [12:07:52] awesome [12:07:54] thanks [12:08:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:36] (03PS2) 10Slyngshede: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 [12:09:22] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]] [12:11:04] (03CR) 10Krinkle: noc: fail with a 404 when the selected wiki is nonexistent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [12:11:06] (03CR) 10Peter Fischer: [C:03+2] "We handle 429 with retry, there's a test: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/blob/main/common/s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [12:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64836 and previous config saved to /var/cache/conftool/dbconfig/20240613-121127-marostegui.json [12:12:05] (03Merged) 10jenkins-bot: Search update pipeline: enable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [12:12:14] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:12:21] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [12:15:33] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:16:01] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:16:17] (03CR) 10Slyngshede: [C:03+2] Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 (owner: 10Slyngshede) [12:17:35] (03Merged) 10jenkins-bot: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 (owner: 10Slyngshede) [12:17:56] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:19:49] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:20:34] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:20:44] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888356 (10WDoranWMF) [12:21:35] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]] (duration: 12m 13s) [12:22:04] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/1043036 (https://phabricator.wikimedia.org/T367382) [12:24:08] (03PS1) 10Muehlenhoff: udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) [12:24:33] (03PS2) 10Muehlenhoff: udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) [12:25:31] (03CR) 10Muehlenhoff: [C:03+2] ircecho: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff) [12:26:31] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage and move to OVS [12:26:33] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage and move to OVS [12:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64837 and previous config saved to /var/cache/conftool/dbconfig/20240613-122634-marostegui.json [12:28:13] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 10netops, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9888375 (10MatthewVernon) Just to note that per [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&va... [12:28:29] (03PS1) 10Majavah: hieradata: cloudvirt1032: Move to single NIC setup and OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457) [12:28:54] (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/1043036 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto) [12:29:13] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 4804 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:29:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2921/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [12:30:27] (03PS1) 10Reedy: CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) [12:30:54] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [12:30:56] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:33:53] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudvirt1032: Move to single NIC setup and OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [12:38:24] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888452 (10elukey) IIUC we are missing DHCP's option 12 from the BMC's client. On DELL's we expect something like:... [12:39:18] !log reset BIOS/BMC to factory default on sretest1001 - T365372 [12:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:22] T365372: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372 [12:39:49] (03CR) 10Brouberol: [C:03+2] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol) [12:40:00] (03CR) 10Brouberol: [C:03+2] hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:41:42] (03PS1) 10Hashar: Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050 [12:41:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64838 and previous config saved to /var/cache/conftool/dbconfig/20240613-124141-marostegui.json [12:44:47] (03CR) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack) [12:48:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:43] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [12:50:22] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408 (10cmooney) 03NEW p:05Triage→03Low [12:51:32] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [12:52:46] !log jmm@cumin1002 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [12:55:42] (03PS1) 10Majavah: prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057 [12:56:18] (03CR) 10DCausse: "aren't 429 handled as part of the normal retry mechanism? meaning that events might enter the error queue because of throttling if the num" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [12:56:31] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888510 (10elukey) I can confirm that the sretest1001's BMC sends this: ` DHCP-Message (53), length 1: Discover Hos... [12:56:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64839 and previous config saved to /var/cache/conftool/dbconfig/20240613-125648-marostegui.json [12:56:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [12:56:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2922/console" [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah) [12:56:53] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [12:56:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [12:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64840 and previous config saved to /var/cache/conftool/dbconfig/20240613-125700-marostegui.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1300). [13:00:04] Nemoralis, Superpes, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64841 and previous config saved to /var/cache/conftool/dbconfig/20240613-130117-marostegui.json [13:01:19] I’m in a meeting but can deploy later [13:01:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9888532 (10cmooney) 05Open→03Resolved [13:03:59] !log jmm@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [13:04:47] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah) [13:04:58] (03CR) 10Majavah: [V:03+1 C:03+2] prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah) [13:06:47] !log installing pillow security updates [13:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:07:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:08:25] FIRING: [8x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:16] (03PS1) 10Majavah: openstack: nova: Ensure libvirt is running when declaring secrets [puppet] - 10https://gerrit.wikimedia.org/r/1043058 [13:09:33] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:10:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P64842 and previous config saved to /var/cache/conftool/dbconfig/20240613-131006-ladsgroup.json [13:10:24] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2923/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043058 (owner: 10Majavah) [13:10:39] (03PS8) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [13:11:32] (03PS1) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) [13:12:56] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888607 (10elukey) [13:13:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:13:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:13:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:14:03] (03PS9) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [13:14:44] HI Lucas_WMDE From what time are you available? [13:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64843 and previous config saved to /var/cache/conftool/dbconfig/20240613-131625-marostegui.json [13:16:50] o/ [13:16:52] now :) [13:16:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888639 (10elukey) Created https://gitlab.wikimedia.org/repos/sre/python-deploy @Volans we can change the name if you want, otherwise please push the first version of the c... [13:17:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:17:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:17:43] no Nemoralis yet afaict [13:17:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64844 and previous config saved to /var/cache/conftool/dbconfig/20240613-131746-ladsgroup.json [13:17:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:18:39] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bookworm [13:19:15] Superpes: I’m confused by the changed fawikibooks comments in logos.php, any idea what happened there? [13:19:25] did the script change and the file wasn’t regenerated in the meantime, or something? [13:19:52] o_O also the diffConfig reports a difference to cawiki.json [13:21:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888659 (10Papaul) @Clement_Goubert thank you. [13:21:43] (03PS1) 10MVernon: install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) [13:21:57] (03CR) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [13:22:00] (03PS5) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [13:22:40] (03PS2) 10Superpes15: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) [13:22:59] Lucas_WMDE Maybe it's a fix? I was confused too, but tried with another project, and the same change happened... [13:23:03] (03PS2) 10MVernon: install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) [13:23:18] I rebased the change, curious what CI will say now [13:23:30] (03CR) 10Jforrester: [C:03+1] "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) (owner: 10Reedy) [13:24:51] (03PS1) 10Clément Goubert: mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) [13:25:05] (03CR) 10DCausse: [C:03+1] "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [13:25:09] Superpes: AFAICT the cawiki change might be correct, https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ca.svg indeed has width="120" and height="14" [13:25:12] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1035852/1..2 This patch (related to the fawikibooks issue) was likely created without using tox [13:25:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64845 and previous config saved to /var/cache/conftool/dbconfig/20240613-132512-ladsgroup.json [13:25:15] still completely baffling where it comes from though [13:26:12] (03CR) 10CDanis: [C:03+1] "thanks taavi!" [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah) [13:26:15] hang on [13:26:27] oh, wait. that change is actually in logos.php [13:26:29] !log installing pillow security updates [13:26:30] I just didn’t notice it before [13:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:39] okay that explains the diffConfig at least [13:27:07] Lucas_WMDE Afaik, if you don't use tox, you'll get a -1... but don't know why the checks were fine in the fawikibooks patch :/ [13:27:17] (03PS1) 10Btullis: Switch the role for an-redacteddb1001 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) [13:28:05] (03CR) 10Effie Mouzeli: [C:03+1] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:28:21] jouncebot: nowandnext [13:28:21] For the next 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1300) [13:28:21] In 1 hour(s) and 31 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500) [13:28:32] will wait :) [13:28:37] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2924/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [13:28:51] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:30:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "The additional changes are confusing, but as far as I can tell, harmless (fawiktionary comments) or correct (cawiki’s logo is indeed 120x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15) [13:30:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15) [13:30:37] let’s try it [13:30:41] About cawiki, yep, it seems correct.. but I didn't run the script for cawiki! So it's still weird, but maybe tox fixes all the issues when run, do let's say everything is fine :D [13:30:51] !log upgrading spicerack on cumin2002 to v8.6.0 [13:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:59] * Lucas_WMDE is very reluctant to touch / run tox ^^ [13:31:10] (03CR) 10Brouberol: [C:03+1] "Nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [13:31:11] I'll check cawiki too on WMDebug just to be sure [13:31:19] yeah, I was gonna do that too, thanks [13:31:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64846 and previous config saved to /var/cache/conftool/dbconfig/20240613-133132-marostegui.json [13:31:43] (03Merged) 10jenkins-bot: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15) [13:31:53] FWIW, the tagline at https://ca.wikipedia.org/wiki/Portada doesn’t look especially “stretched” to me at the moment [13:32:12] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]] [13:32:16] but then again, 112/13 and 120/14 is almost the same aspect ratio [13:32:17] T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247 [13:32:23] (03CR) 10Btullis: [V:03+1 C:03+2] Switch the role for an-redacteddb1001 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [13:32:25] (a bit over eight and a half) [13:32:52] I guess it will become a smidgen bigger [13:33:05] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:33:44] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:34:02] Yep indeed but maybe 112/13 was manually added and tox doesn't like it lmao :D [13:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:34:58] !log lucaswerkmeister-wmde@deploy1002 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:35:12] Superpes: please test :) [13:35:41] yeah the tagline grows a tiny bit [13:35:51] looks fine to me tbh [13:35:53] Yep and looks fine [13:36:02] * Lucas_WMDE peeks at svwiktionary [13:36:04] Yep amd on svwikt too :) [13:36:21] !log lucaswerkmeister-wmde@deploy1002 superpes, lucaswerkmeister-wmde: Continuing with sync [13:36:27] (03CR) 10Hashar: [C:03+2] Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050 (owner: 10Hashar) [13:36:33] I don't like the gold color of the logo tbh :D [13:36:49] But it's not my choice lmao [13:38:04] wiki sovereignty \oi [13:38:05] * \o/ [13:38:32] (03CR) 10Hnowlan: [C:03+1] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:38:45] Lol [13:39:13] I also have a problem with another patch on a wordmark, tox sets it to 2x1px resolution, which is absurd [13:39:33] I tried to fix the svg but the situation didn't change [13:39:56] huh [13:40:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64847 and previous config saved to /var/cache/conftool/dbconfig/20240613-134017-ladsgroup.json [13:40:56] Furthermore, I also had to fix these svwiktionary logos because they didn't meet the resolution standards! Unfortunately the guidelines are not read, and a lot of people upload logos and wordmarks thinking they're fine like this :D [13:40:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888707 (10Jhancock.wm) rails, power, and network cables prepped for mw2282 move. [13:41:15] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888709 (10hnowlan) >>! In T361835#9712223, @SGupta-WMF wrote: > @WDoranWMF Ye... [13:41:38] (03PS1) 10Hashar: Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762) [13:42:09] 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9888712 (10cmooney) We could use these cables but the host side but we might not have enough slack to connect to servers at dif... [13:44:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9888715 (10Jhancock.wm) @Marostegui thank you for the reminder. I will be getting this racked on Friday most likely. also thank you for updating puppet files! [13:44:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:44:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:44:52] (03Merged) 10jenkins-bot: Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050 (owner: 10Hashar) [13:44:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T352010)', diff saved to https://phabricator.wikimedia.org/P64848 and previous config saved to /var/cache/conftool/dbconfig/20240613-134456-ladsgroup.json [13:45:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:45:08] (03CR) 10Hashar: [C:03+2] Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar) [13:45:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]] (duration: 13m 24s) [13:45:41] T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247 [13:46:12] Superpes: should be done :) [13:46:24] still no sign of Nemoralis afaict [13:46:37] (03PS3) 10Lucas Werkmeister (WMDE): Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) [13:46:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64849 and previous config saved to /var/cache/conftool/dbconfig/20240613-134639-marostegui.json [13:46:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [13:46:44] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [13:46:48] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:40:00 on lsw1-f6-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f6-eqiad [13:46:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [13:47:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:40:00 on lsw1-f6-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f6-eqiad [13:47:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64850 and previous config saved to /var/cache/conftool/dbconfig/20240613-134701-marostegui.json [13:47:07] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:47:45] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9888730 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94b81d4d-316b-4c68-b4a9-a2d07057d180) set by cmooney... [13:48:33] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [13:48:53] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9888734 (10Eevans) >>! In T362033#9885505, @VRiley-WMF wrote: > It certainly does! I will plan for this tomorrow and start prepping a motherboard for this unit. Thanks! Standing by; Let me know! [13:49:03] jouncebot: next [13:49:03] In 1 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500) [13:49:55] (03CR) 10Lucas Werkmeister (WMDE): "Note: this is okay because all Test Wikidata clients have reached wmf.9; the wmf.8 backport had to be aborted, so if the train has to be r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:50:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:50:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64851 and previous config saved to /var/cache/conftool/dbconfig/20240613-135010-marostegui.json [13:50:48] (03Merged) 10jenkins-bot: Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:51:21] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]] [13:51:25] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:53:01] (03Merged) 10jenkins-bot: Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar) [13:53:55] Lucas_WMDE I can check Nemoralis patch :) [13:53:57] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:54:36] testing my own patch at the moment [13:54:50] (03PS2) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) [13:55:16] !log roll-restarting shellbox-constraints [13:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:22] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: sync [13:55:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64852 and previous config saved to /var/cache/conftool/dbconfig/20240613-135523-ladsgroup.json [13:55:28] looks good so far… [13:55:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: sync [13:55:56] (03PS3) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) [13:56:48] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:57:35] (03CR) 10JHathaway: [C:03+1] Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [13:58:03] (03PS4) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) [13:58:25] FIRING: [8x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:58:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:14] (03PS1) 10Majavah: hieradata: Move cloudvirt1033 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043071 (https://phabricator.wikimedia.org/T364457) [13:59:33] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: sync [13:59:51] Superpes: I don’t think we’ll have time for that anyway, sorry [13:59:57] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1033.eqiad.wmnet with reason: reimage and move to OVS [14:00:09] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: sync [14:00:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1033.eqiad.wmnet with reason: reimage and move to OVS [14:00:10] Oh yep no problem :) Thanks for your assistance btw :P [14:00:15] (03CR) 10JMeybohm: [C:03+2] ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [14:01:12] (03Merged) 10jenkins-bot: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [14:03:07] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [14:03:17] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [14:03:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:53] (03PS3) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 [14:03:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:04:12] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1033 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043071 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [14:04:43] (03CR) 10CI reject: [V:04-1] geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack) [14:05:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64853 and previous config saved to /var/cache/conftool/dbconfig/20240613-140517-marostegui.json [14:05:35] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]] (duration: 14m 14s) [14:05:39] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [14:05:44] !log UTC afternoon backport+config window done [14:05:46] ping claime :) [14:05:46] (03PS4) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 [14:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:57] Thanks Lucas_WMDE :) [14:06:16] (03CR) 10Filippo Giunchedi: [C:03+2] grafana: change performance testing graphite endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1042223 (https://phabricator.wikimedia.org/T367064) (owner: 10Filippo Giunchedi) [14:06:38] (03CR) 10Muehlenhoff: [C:03+2] udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [14:06:46] (03PS3) 10Filippo Giunchedi: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [14:06:46] (03PS1) 10Filippo Giunchedi: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) [14:06:47] (03PS1) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) [14:06:49] (03PS1) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) [14:08:49] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888788 (10dcaro) Added two pannels to the health... [14:08:53] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888795 (10dcaro) Added the discards also to the c... [14:08:57] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888796 (10dcaro) 05Open→03Resolved [14:12:38] (03CR) 10Peter Fischer: [C:03+2] "Yes, they would be retried like any other failed request. We could make an exception here and let the HTTP client retry in case of 429. Th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [14:15:24] (03CR) 10Alexandros Kosiaris: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:15:25] !log cgoubert@deploy1002 Started scap: Change mwapi listener to mw-api-int - T333120 [14:15:30] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [14:16:16] (03PS1) 10Elukey: profile::docker::reporter: update exclude filter [puppet] - 10https://gerrit.wikimedia.org/r/1043082 [14:16:20] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [14:16:48] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [14:16:58] (03CR) 10Filippo Giunchedi: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:18:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64854 and previous config saved to /var/cache/conftool/dbconfig/20240613-141810-ladsgroup.json [14:18:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:18:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:51] (03CR) 10BBlack: [C:03+2] geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack) [14:19:04] (03PS5) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 [14:19:46] (03Abandoned) 10BBlack: geodns: eqiad non-primary for all public users [dns] - 10https://gerrit.wikimedia.org/r/545385 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack) [14:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64855 and previous config saved to /var/cache/conftool/dbconfig/20240613-142024-marostegui.json [14:20:45] (03CR) 10EoghanGaffney: [C:03+1] rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn) [14:21:00] (03CR) 10EoghanGaffney: [C:03+1] gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn) [14:21:09] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [14:21:24] !log cgoubert@deploy1002 Finished scap: Change mwapi listener to mw-api-int - T333120 (duration: 06m 47s) [14:21:28] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9888856 (10Clement_Goubert) [14:21:31] T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 [14:21:32] (03PS1) 10Hashar: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762) [14:23:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:23:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:05] (03PS2) 10Filippo Giunchedi: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) [14:24:05] (03PS2) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) [14:24:05] (03PS2) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) [14:24:05] (03PS1) 10Filippo Giunchedi: shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) [14:24:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [14:24:56] hmm looking at the memcached issue [14:24:56] (03PS1) 10Jelto: sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341) [14:25:13] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack) [14:27:04] (03PS6) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [14:27:15] (03CR) 10Hashar: [C:03+2] Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar) [14:27:19] !log authdns-update for https://gerrit.wikimedia.org/r/1042490 (remaps some Facebook ranges to codfw+eqiad) [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:49] bblack: neat [14:27:55] (03Merged) 10jenkins-bot: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar) [14:28:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:28:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:28:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:29:18] (03CR) 10Elukey: Allow to only report images of supported Debian versions (033 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [14:30:23] (03CR) 10Andrea Denisse: [C:03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:30:44] (03CR) 10Elukey: "I have zero context on this, it is difficult to review from the commit msg. Janis could you expand it a little to add more details?" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm) [14:32:11] !log hashar@deploy1002 Started deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit2002 # T358762 [14:32:15] T358762: Gerrit commit message formatting does not handle angle-bracketed URLs well, adds extra semicolon - https://phabricator.wikimedia.org/T358762 [14:32:18] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit2002 # T358762 (duration: 00m 07s) [14:33:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64856 and previous config saved to /var/cache/conftool/dbconfig/20240613-143318-ladsgroup.json [14:33:56] (03PS1) 10Clément Goubert: shellbox-constraints: bump to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043087 [14:34:56] (03CR) 10Alexandros Kosiaris: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:35:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64857 and previous config saved to /var/cache/conftool/dbconfig/20240613-143531-marostegui.json [14:35:33] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9888942 (10elukey) [14:35:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [14:35:36] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [14:35:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [14:35:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64858 and previous config saved to /var/cache/conftool/dbconfig/20240613-143554-marostegui.json [14:37:43] (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: update exclude filter [puppet] - 10https://gerrit.wikimedia.org/r/1043082 (owner: 10Elukey) [14:38:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64859 and previous config saved to /var/cache/conftool/dbconfig/20240613-143859-marostegui.json [14:40:25] (03PS1) 10Filippo Giunchedi: zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) [14:40:27] (03PS1) 10Filippo Giunchedi: apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) [14:40:47] I am doing a quick upgrade of Gerrit again [14:40:58] !log hashar@deploy1002 Started deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit1003 # T358762 [14:41:03] T358762: Gerrit commit message formatting does not handle angle-bracketed URLs well, adds extra semicolon - https://phabricator.wikimedia.org/T358762 [14:41:03] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit1003 # T358762 (duration: 00m 05s) [14:41:31] (03CR) 10CI reject: [V:04-1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:41:32] (03CR) 10CI reject: [V:04-1] apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:43:57] (03Merged) 10jenkins-bot: shellbox-constraints: bump to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043087 (owner: 10Clément Goubert) [14:44:06] gerrit upgraded [14:44:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [14:44:25] (03CR) 10Hashar: "recheck due to Gerrit restart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:44:29] (03CR) 10Hashar: "recheck due to Gerrit restart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:44:29] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [14:44:30] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:44:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [14:44:39] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [14:45:52] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003 [14:46:06] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [14:46:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:30] (03CR) 10Muehlenhoff: [C:03+2] puppetserver::git::private: Use wrapper from puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1037778 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:47:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003 [14:48:11] (03CR) 10CDanis: [C:03+1] eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:48:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64860 and previous config saved to /var/cache/conftool/dbconfig/20240613-144825-ladsgroup.json [14:48:36] (03CR) 10CDanis: [C:03+1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:13] (03CR) 10CDanis: [C:03+1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:15] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm [14:49:19] !log rebalance ganeti/B in eqiad following reboots [14:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:24] (03CR) 10CDanis: [C:03+1] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:30] (03CR) 10CDanis: [C:03+1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:32] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:39] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:48] (03CR) 10Alexandros Kosiaris: [C:03+1] eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:49:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [14:49:59] (03CR) 10Alexandros Kosiaris: [C:03+1] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:50:02] (03PS1) 10Brouberol: datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) [14:50:17] (03CR) 10Alexandros Kosiaris: [C:03+1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:50:34] (03CR) 10CDanis: [C:03+2] otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis) [14:50:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 depool ahead of T365983', diff saved to https://phabricator.wikimedia.org/P64861 and previous config saved to /var/cache/conftool/dbconfig/20240613-145035-arnaudb.json [14:50:40] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [14:50:41] (03CR) 10CI reject: [V:04-1] datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [14:50:49] (03CR) 10Alexandros Kosiaris: [C:03+1] apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:50:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1039.eqiad.wmnet with reason: T365983 [14:51:00] (03CR) 10Alexandros Kosiaris: [C:03+1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:51:01] (03PS1) 10Filippo Giunchedi: mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) [14:51:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1039.eqiad.wmnet with reason: T365983 [14:52:43] (03PS1) 10Dbrant: Look for iPadOS in user-agent, in addition to iOS. [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723) [14:53:23] (03PS2) 10Brouberol: datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) [14:53:37] (03Merged) 10jenkins-bot: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis) [14:53:58] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:54:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64862 and previous config saved to /var/cache/conftool/dbconfig/20240613-145406-marostegui.json [14:55:49] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:55:55] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:57:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003 [14:57:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003 [14:57:27] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:57:29] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:57:58] (03PS2) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [14:58:46] (03CR) 10Brouberol: "As it turns out, the mce/mae-consumer pods already expose JMX metrics." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol) [14:59:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:36] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003 [14:59:36] !log cdanis@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:59:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003 [15:00:04] brennen and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500). [15:00:55] !log cdanis@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:01:01] !log cdanis@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:01:10] (03PS1) 10Muehlenhoff: Cleanup puppetmaster preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1043114 [15:01:35] !log cdanis@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:01:57] (03PS11) 10EoghanGaffney: lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174 [15:03:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-f6-eqiad,lsw1-f6-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f6-eqiad [15:03:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-f6-eqiad,lsw1-f6-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f6-eqiad [15:03:32] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2925/co" [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [15:03:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64863 and previous config saved to /var/cache/conftool/dbconfig/20240613-150332-ladsgroup.json [15:03:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889146 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=891c00a3-b649-4659-b39f-5ad6b01367a9) set by cmooney... [15:03:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:04:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:35:00 on an-worker[1169-1171].eqiad.wmnet,es1039.eqiad.wmnet,ms-be1080.eqiad.wmnet with reason: JunOS upgrade lsw1-f6-eqiad [15:04:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on an-worker[1169-1171].eqiad.wmnet,es1039.eqiad.wmnet,ms-be1080.eqiad.wmnet with reason: JunOS upgrade lsw1-f6-eqiad [15:04:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889149 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5a6a58c5-4681-4aea-8e80-e8ba2c613022) set by cmooney... [15:04:47] !log rebooting lsw1-f6-codfw to upgrade JunOS on switch T365983 [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:05:58] !log upgrading spicerack on cumin1002 to v8.6.0 [15:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9889158 (10elukey) Hi @Jhancock.wm! I was able to tcpdump the DHCP traffic sent from the host's BMC to `install2004`, and sadly it doesn't set any valid Hostname. This i... [15:07:25] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:07:37] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:07:43] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:07:48] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:07:57] (03PS1) 10MVernon: apus: setup for codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) [15:07:58] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:08:07] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:08:52] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:09:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64864 and previous config saved to /var/cache/conftool/dbconfig/20240613-150913-marostegui.json [15:10:06] (03CR) 10Arnaudb: [C:03+1] apus: setup for codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:11:04] (03CR) 10Arnaudb: [C:03+1] install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:15:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:15:36] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [15:16:26] (03CR) 10MVernon: [C:03+2] install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:17:50] (03PS1) 10JHathaway: postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) [15:18:23] (03CR) 10JHathaway: [C:03+1] Cleanup puppetmaster preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1043114 (owner: 10Muehlenhoff) [15:18:36] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [15:19:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64865 and previous config saved to /var/cache/conftool/dbconfig/20240613-151910-marostegui.json [15:19:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:22:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889279 (10cmooney) Switch has reloaded on the new version, all looks good at first glance. ` cmooney@lsw1-f6-eqiad> show inter... [15:22:12] !log drop eventgate-ci docker images from the Docker Registry [15:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64866 and previous config saved to /var/cache/conftool/dbconfig/20240613-152300-arnaudb.json [15:23:04] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64867 and previous config saved to /var/cache/conftool/dbconfig/20240613-152420-marostegui.json [15:24:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [15:24:25] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [15:24:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [15:25:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [15:26:25] !log drop mediawiki-services-parsoid docker images from the Docker Registry - T367427 [15:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:29] T367427: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427 [15:27:08] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:27:14] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [15:27:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [15:27:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [15:27:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64868 and previous config saved to /var/cache/conftool/dbconfig/20240613-152748-marostegui.json [15:28:06] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:28:11] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [15:28:13] !log STOPPED lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 --start '["55386869"]' 2>&1 | tee -a ~/T315510-enwiki-9; date # Ctrl+C – had slowed down, unnecessary work by this point; was at --start '["55914913"]' [15:28:13] (03CR) 10EoghanGaffney: [C:03+1] sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341) (owner: 10Jelto) [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:29] (03PS1) 10JHathaway: postfix: mx-in role [puppet] - 10https://gerrit.wikimedia.org/r/1043124 (https://phabricator.wikimedia.org/T325406) [15:30:34] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney) [15:30:40] (03PS1) 10JMeybohm: ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310) [15:30:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64869 and previous config saved to /var/cache/conftool/dbconfig/20240613-153056-marostegui.json [15:31:02] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [15:32:24] (03CR) 10JMeybohm: [C:03+2] ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:32:27] (03PS1) 10Jforrester: Convert local function to arrow function to fix context [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366) [15:33:18] (03Merged) 10jenkins-bot: ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64870 and previous config saved to /var/cache/conftool/dbconfig/20240613-153417-marostegui.json [15:34:51] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [15:34:53] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [15:35:01] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/ratelimit: apply [15:35:15] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9889387 (10Papaul) @kamila no problem we can move that one. Once done we will update the task. [15:35:30] (03PS2) 10JHathaway: postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) [15:36:09] PROBLEM - Host registry2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:11] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2002.codfw.wmnet with OS bookworm [15:36:21] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm executed with errors: - moss-fe2002 (... [15:36:57] (03CR) 10JHathaway: [C:03+2] postfix: mx-in role [puppet] - 10https://gerrit.wikimedia.org/r/1043124 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [15:37:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [15:37:07] PROBLEM - Host apt2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:17] PROBLEM - Host cloudidm2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [15:37:19] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm [15:37:23] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889405 (10MatthewVernon) Swift looks good, thanks. [15:37:23] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [15:37:37] PROBLEM - Host kubemaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:39] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [15:38:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64871 and previous config saved to /var/cache/conftool/dbconfig/20240613-153805-arnaudb.json [15:38:10] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:38:16] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [15:38:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:22] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:31] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:37] !log cdobbins@cumin1002 sudo -i cookbook sre.cdn.roll-reboot --alias 'cp-upload_eqsin' --batchsize 1 --reason T366555 --task-id T366555 --grace-sleep 5400 [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:41] FIRING: [2x] ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:46] !log cdobbins@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin [15:39:14] here [15:39:33] !incidents [15:39:34] 4746 (UNACKED) [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw) [15:39:34] 4745 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [15:39:34] 4743 (RESOLVED) [2x] ProbeDown sre (probes/custom eqiad) [15:39:34] 4740 (RESOLVED) [6x] ProbeDown sre (probes/service ulsfo) [15:39:39] here [15:39:44] ^ we have slightly reduced kubemaster capacity in codfw (one of the new hw nodes is down) [15:39:46] not sure if related [15:39:56] !ack 4746 [15:39:57] 4746 (ACKED) [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw) [15:40:24] registry and cloudidm going down at the same time smells like ganeti [15:40:36] iirc all of them are vms [15:40:40] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889411 (10Jdforrester-WMF) Looks like this is now done except for "some straggling traffic" for the api-gateway? {F55289507} [15:40:41] great, we have more reduced capacity! \o/ [15:41:02] I hope we are not moving the wrong server accidentally [15:41:11] (03PS1) 10EoghanGaffney: lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127 [15:41:12] don't see a spike in requests on k8s api in codfw [15:41:33] FIRING: KubernetesCalicoDown: kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubemaster2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:41:35] effie: that would seriously suck '^^ the move should be happening about now :D [15:41:42] !log cdobbins@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin [15:42:00] is kube-ctrl up ? [15:42:01] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage [15:42:41] (03PS1) 10Muehlenhoff: Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) [15:42:49] effie: wikikube-ctrl2003 is decommed [15:43:01] effie:efyes, on ctrl2002 [15:43:06] the other two should be up [15:43:08] root@deploy1002:~# kubectl -n kube-system get leases.coordination.k8s.io [15:43:10] NAME HOLDER AGE [15:43:12] cert-manager-cainjector-leader-election cert-manager-cainjector-79df7c6cc8-jb6rf_9db5da6a-27c6-45ff-8aec-b1c273c06c90 478d [15:43:14] cert-manager-controller cert-manager-ff469f6b6-tt7t7-external-cert-manager-controller 478d [15:43:16] kube-controller-manager wikikube-ctrl2002_af95e93d-6681-462f-86da-75f450626107 478d [15:43:16] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2926/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [15:43:18] kube-scheduler wikikube-ctrl2002_be7d4a2d-3abf-41dd-9749-fa36d599d3a4 478d [15:43:35] uhh [15:43:39] 💙cdanis@ganeti2020.codfw.wmnet ~ 🕦☕ sudo gnt-instance list [15:43:41] it's hanging [15:43:45] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:54] FIRING: [2x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:43:54] oh, there it goes [15:44:14] (03CR) 10Dzahn: [V:03+1 C:03+1] "lgtm! uid 46919 - https://app.betterworks.com/app/#/profile/441803" [puppet] - 10https://gerrit.wikimedia.org/r/1042331 (https://phabricator.wikimedia.org/T367053) (owner: 10Herron) [15:44:20] I think something is funky with the ganeti master in codfw? [15:44:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage [15:45:35] cdanis: is that command usually quicker? [15:45:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:46:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P64872 and previous config saved to /var/cache/conftool/dbconfig/20240613-154603-marostegui.json [15:46:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:46:19] jhathaway: I thought so? but I could be wrong [15:46:24] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9889457 (10Dzahn) 05Open→03In progress [15:46:44] cdanis: roger, not sure myself [15:46:57] it takes <1s on eqiad [15:47:19] --> #-sre [15:47:26] definitely feels slow [15:49:24] ganeti2028 seems common between vms that are down according to icinga [15:49:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64873 and previous config saved to /var/cache/conftool/dbconfig/20240613-154924-marostegui.json [15:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [15:50:00] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-fe2002.codfw.wmnet with OS bookworm [15:50:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [15:50:18] and some interesting drbd messages in dmesg on ganeti2028 [15:51:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:52:23] !log drop mediawiki-services-restbase docker images from the Docker Registry - T367427 [15:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:28] T367427: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427 [15:53:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64874 and previous config saved to /var/cache/conftool/dbconfig/20240613-155310-arnaudb.json [15:53:16] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:53:45] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:05] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:54:25] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889560 (10Clement_Goubert) Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters. [15:54:51] (03PS1) 10Elukey: profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427) [15:55:40] RESOLVED: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:55:59] (03PS2) 10Ryan Kemper: wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [15:56:35] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [15:57:06] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889580 (10cmooney) 05Open→03Resolved Thanks for checking things, all stable on our side I will close the task now. [15:57:55] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889584 (10hnowlan) I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway refers to traffic to the mediawiki API as "mwapi_cluster"... [15:58:11] (03CR) 10MVernon: "I'm in principle happy for this to go ahead, but I'm afraid I don't know enough about the puppetserver puppet code to feel confident givin" [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:58:45] FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:18] (03CR) 10JHathaway: [C:03+2] postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P64875 and previous config saved to /var/cache/conftool/dbconfig/20240613-160110-marostegui.json [16:02:39] (03CR) 10Elukey: Allow to only report images of supported Debian versions (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm) [16:04:17] PROBLEM - Host ganeti2028 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:26] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:04:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64876 and previous config saved to /var/cache/conftool/dbconfig/20240613-160431-marostegui.json [16:04:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [16:04:36] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:04:37] (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [16:04:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [16:04:53] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:04:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64877 and previous config saved to /var/cache/conftool/dbconfig/20240613-160453-marostegui.json [16:04:57] (03CR) 10Herron: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1042331 (https://phabricator.wikimedia.org/T367053) (owner: 10Herron) [16:05:27] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [16:05:34] 06SRE, 06Infrastructure-Foundations, 10netops: No IPv6 ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 (10cmooney) 03NEW p:05Triage→03High [16:05:46] FIRING: ProbeDown: Service ganeti2028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:52] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@ee5a291]: make public data from wdqs subgraph analysis readable by others [16:07:15] RECOVERY - Host registry2003 is UP: PING WARNING - Packet loss = 66%, RTA = 0.32 ms [16:07:15] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@ee5a291]: make public data from wdqs subgraph analysis readable by others (duration: 00m 22s) [16:07:35] PROBLEM - Docker registry health on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:07:37] PROBLEM - Docker registry HTTPS interface certificate expiry on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:08:05] PROBLEM - SSH on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:08:05] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [16:08:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64878 and previous config saved to /var/cache/conftool/dbconfig/20240613-160816-arnaudb.json [16:08:20] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [16:08:25] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [16:08:36] !log forcibly rebooted ganeti2028, drdbd hung [16:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [16:08:45] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:45] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:52] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:09:11] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:09:16] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889670 (10VRiley-WMF) 05Open→03In progress Starting the Motherboard swap now. [16:11:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [16:11:46] !log gnt-node failover -f ganeti2028.codfw.wmnet [16:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [16:11:54] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [16:11:56] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9889673 (10herron) 05In progress→03Resolved a:03herron Group membership has been provisioned, thanks! [16:12:10] (03PS1) 10Pppery: Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) [16:12:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery) [16:12:34] (03PS1) 10Majavah: openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458) [16:13:38] PROBLEM - Host registry2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:08] PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:21] (03CR) 10Abijeet Patro: [C:03+1] Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery) [16:15:46] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64880 and previous config saved to /var/cache/conftool/dbconfig/20240613-161617-marostegui.json [16:16:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [16:16:22] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [16:16:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [16:16:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64881 and previous config saved to /var/cache/conftool/dbconfig/20240613-161641-marostegui.json [16:17:59] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033 [16:18:03] T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033 [16:18:13] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033 [16:18:21] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889731 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7d73e7a7-7fc0-4f4e-8b18-84ce78db6c6b) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r... [16:18:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [16:18:27] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [16:18:45] FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:18:56] 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#9889726 (10jijiki) (me too ubuntu-forum style reply) This happened again on ganeti2028: ` [Thu Jun 13 15:38:21 2024] INFO: task drbd_r_resource:1033579 blocked for more than 121... [16:18:57] jouncebot nowandnext [16:18:58] For the next 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1600) [16:18:58] In 0 hour(s) and 41 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700) [16:18:58] In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700) [16:19:36] James_F: shall i go ahead and sling out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1043126 ? [16:19:43] brennen: Sure! [16:19:52] Sorry, distracted by other things. [16:20:02] thanks for getting that in order! [16:20:04] RECOVERY - Host ganeti2028 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [16:20:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366) (owner: 10Jforrester) [16:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64882 and previous config saved to /var/cache/conftool/dbconfig/20240613-162040-marostegui.json [16:20:58] RECOVERY - SSH on registry2003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:21:00] RECOVERY - Host apt2002 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms [16:21:00] RECOVERY - Host registry2003 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [16:21:00] RECOVERY - Host cloudidm2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [16:21:28] RECOVERY - Host kubemaster2001 is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [16:21:32] RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Docker [16:21:36] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:21:38] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:21:56] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:22:58] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Docker [16:23:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64883 and previous config saved to /var/cache/conftool/dbconfig/20240613-162321-arnaudb.json [16:23:22] RESOLVED: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:26] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [16:23:32] RECOVERY - Docker registry HTTPS interface certificate expiry on registry2003 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Fri 28 Jun 2024 08:55:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [16:23:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:23:41] RESOLVED: [2x] ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:45] FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:50] FIRING: [2x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:10] !log gitlab-replica.wikimedia.org - short downtime - renaming to gitlab-replica-a [16:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:15] (03CR) 10Dzahn: [C:03+2] rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn) [16:25:17] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442 (10RKemper) 03NEW [16:25:19] (03PS2) 10Dzahn: rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 [16:25:34] (03CR) 10Dzahn: "not a netbox change - these are just marked as manually managed there" [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn) [16:25:42] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9889792 (10RKemper) [16:25:43] (03CR) 10Andrew Bogott: [C:03+1] openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [16:26:33] RESOLVED: KubernetesCalicoDown: kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubemaster2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:27:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:28:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm [16:29:09] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm completed: - moss-fe2002 (**PASS**)... [16:29:46] FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [16:30:13] 06SRE, 06Infrastructure-Foundations, 10netops: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9889810 (10cmooney) [16:30:23] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:30:46] FIRING: [3x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:51] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9889825 (10elukey) @KartikMistry @santhosh Hi! Getting back to this task since it is getting attention from other pe... [16:31:57] (03CR) 10Majavah: [C:03+2] openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [16:34:26] (03PS33) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [16:35:16] (03CR) 10Dzahn: [C:03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn) [16:35:28] (03PS1) 10JMeybohm: Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 [16:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64884 and previous config saved to /var/cache/conftool/dbconfig/20240613-163547-marostegui.json [16:37:02] (03CR) 10JMeybohm: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman) [16:37:33] (03PS2) 10Dzahn: gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 [16:39:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:39:43] (03PS3) 10Dzahn: gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 [16:40:17] (03Merged) 10jenkins-bot: Convert local function to arrow function to fix context [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366) (owner: 10Jforrester) [16:40:52] !log brennen@deploy1002 Started scap: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]] [16:40:56] T367366: Failed to fetch notifications: Notifications fail to load - https://phabricator.wikimedia.org/T367366 [16:41:09] (03PS34) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [16:41:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002" [16:42:16] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 36 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:43:26] (03CR) 10Tacsipacsi: "Thanks for backporting this!" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery) [16:43:29] !log brennen@deploy1002 jforrester, brennen: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:43:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002" [16:43:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:31] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:46:17] (03PS3) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) [16:46:24] (03CR) 10Dzahn: [C:03+2] gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn) [16:47:16] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 10 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:48:26] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889928 (10VRiley-WMF) 05In progress→03Open Motherboard has been swapped, returning ticket into open status. [16:48:46] !log brennen@deploy1002 jforrester, brennen: Continuing with sync [16:49:05] (03PS1) 10Andrew Bogott: nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458) [16:49:19] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:50:14] (03CR) 10Majavah: [C:03+1] nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458) (owner: 10Andrew Bogott) [16:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64885 and previous config saved to /var/cache/conftool/dbconfig/20240613-165055-marostegui.json [16:51:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [16:52:01] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:52:44] (03PS2) 10JMeybohm: Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 [16:53:07] (03CR) 10Andrew Bogott: [C:03+2] nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458) (owner: 10Andrew Bogott) [16:53:31] (03CR) 10Dzahn: [C:03+2] "gitlab-exporter service was temp disabled, DNS changed, config changed,, then reactivated. service is running again" [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn) [16:53:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:54:43] (03CR) 10Klausman: [C:03+1] Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm) [16:55:48] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet) [16:56:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:57:43] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]] (duration: 16m 51s) [16:57:47] T367366: Failed to fetch notifications: Notifications fail to load - https://phabricator.wikimedia.org/T367366 [16:58:00] (03CR) 10JHathaway: [C:03+1] Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:59:01] (03PS1) 10Andrew Bogott: Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163 [17:00:04] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700) [17:00:46] nothing to do for my deploy window today. [17:01:10] (03CR) 10Majavah: [C:04-2] "not yet." [puppet] - 10https://gerrit.wikimedia.org/r/1043163 (owner: 10Andrew Bogott) [17:01:25] (03CR) 10Dzahn: [C:03+2] acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 (owner: 10Dzahn) [17:01:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:02:04] (03PS4) 10Dzahn: acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 [17:03:14] (03PS1) 10MVernon: installer/cephadm: specify a very large maximum size [puppet] - 10https://gerrit.wikimedia.org/r/1043165 (https://phabricator.wikimedia.org/T279621) [17:06:01] (03CR) 10Dzahn: [C:03+2] acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 (owner: 10Dzahn) [17:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64886 and previous config saved to /var/cache/conftool/dbconfig/20240613-170602-marostegui.json [17:06:07] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [17:13:06] (03CR) 10Dzahn: [C:03+2] move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [17:13:09] (03PS4) 10Dzahn: move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 [17:19:18] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ using stat1009.eqiad.wmnet) [17:24:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043172 [17:25:03] (03PS4) 10Btullis: [WIP] Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) [17:25:52] (03PS1) 10Andrew Bogott: openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) [17:25:56] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [17:26:13] (03PS2) 10Andrew Bogott: openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) [17:26:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [17:33:50] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [17:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:39:14] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [17:39:21] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [17:41:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9890137 (10VRiley-WMF) @Clement_Goubert I believe it would be better to open a new task for any servers that need to be relabeled. [17:41:23] (03PS1) 10Dzahn: idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 [17:42:06] (03CR) 10Dzahn: "< taavi> did someone forget to cleanup the CAS config after gitlab moved from the cas protocol to OIDC?" [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn) [17:42:35] (03CR) 10Dzahn: [C:03+2] idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn) [17:44:24] (03PS10) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [17:44:27] (03PS2) 10Dzahn: idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 [17:45:25] FIRING: [3x] SystemdUnitFailed: ferm.service on mw2337:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:10] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:47:29] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9890151 (10VRiley-WMF) @RKemper When is there a preference on when we could schedule this? [17:47:37] (03CR) 10Dzahn: [V:03+2 C:03+2] idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn) [17:47:45] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9890165 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [17:48:41] FIRING: [2x] ProbeDown: Service wikikube-ctrl1003:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:15] o/ [17:52:34] expired downtime? [17:53:30] !incidents [17:53:30] 4747 (UNACKED) [2x] ProbeDown sre (wikikube-ctrl1003:6443 probes/custom eqiad) [17:53:30] 4746 (RESOLVED) [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw) [17:53:31] 4745 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [17:53:31] 4743 (RESOLVED) [2x] ProbeDown sre (probes/custom eqiad) [17:53:37] not sure [17:53:42] !ack 4747 [17:53:43] 4747 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1003:6443 probes/custom eqiad) [17:54:38] I see kamila_ was reimaging it earlier today [17:54:52] and it never had a positive host health, in the graphs [17:56:48] (03PS1) 10Dzahn: idp: drop gitlab-new.wikimedia.org service ID [puppet] - 10https://gerrit.wikimedia.org/r/1043181 [17:57:25] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye [17:57:34] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4... [17:57:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [17:58:03] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [18:00:04] brennen and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1800). [18:00:38] (03CR) 10Dzahn: [C:03+2] "no IP change, only comments and moving the entry around" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [18:01:15] (03CR) 10Dzahn: [C:03+2] "to avoid that someone uses this IP another time because the comment looks like it's free" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [18:02:36] (03CR) 10Dzahn: "should we just drop this? But it seems we still need _some_ name for "the other machine that is not a replica"." [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn) [18:04:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64887 and previous config saved to /var/cache/conftool/dbconfig/20240613-180404-ladsgroup.json [18:04:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:05:15] jhathaway: yeah I think you are right, I had a look around the hots too and am inclined to leave it as-is [18:05:25] host* [18:06:12] well the host is reachalbe now, and has a 2+ hour uptime [18:06:34] (03CR) 10Dzahn: "certainly not "gitlab-a" and "gitlab-b" even though that would match the replicas now. but once gitlab2003.wikimedia.org is setup it will " [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn) [18:06:39] but i see the is dmesg, [18:06:41] [Thu Jun 13 15:39:39 2024] bnxt_en 0000:3b:00.1 enp59s0f1np1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [18:07:03] puppet has some issues and a few services are broken too, looks like its not finished with setup? not sure [18:07:40] nod [18:08:08] o/ [18:09:26] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9890246 (10KartikMistry) @elukey Yes. We can move to Swift. Is there any documentation for services using a similar... [18:10:31] was it reimaged while the puppet role was applied? probably the usual problem that it won't work on first run, only on second run [18:10:48] but then it won't work the reimage cookbook unless the prod role is temp removed [18:10:50] brennen: o/ [18:10:54] good for train deploy here? [18:12:58] herron: I think it is find to leave as is, but I'll ask in servicops, in case someone is around [18:13:36] jhathaway: thanks sgtm [18:15:25] FIRING: [3x] SystemdUnitFailed: ferm.service on mw2337:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:17] !log 1.43.0-wmf.9 train (T361403): no current blockers, rolling to group2 [18:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:25] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [18:16:48] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403) [18:16:49] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [18:17:30] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [18:17:42] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye [18:17:46] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4... [18:18:08] Forgot to merge the hiera config :P [18:19:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64888 and previous config saved to /var/cache/conftool/dbconfig/20240613-181911-ladsgroup.json [18:19:22] (03PS1) 10BCornwall: Set cp4038 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043185 (https://phabricator.wikimedia.org/T364891) [18:26:49] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:26:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [18:27:15] (03PS11) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [18:28:58] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [18:28:59] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [18:29:32] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.9 refs T361403 [18:29:36] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [18:34:01] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4038 is CRITICAL: connect to address 10.128.0.27 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:34:15] brennen: you seeing the "fwrite(): write of 199 bytes failed with errno=32 Broken pipe" errors? [18:34:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64889 and previous config saved to /var/cache/conftool/dbconfig/20240613-183417-ladsgroup.json [18:34:28] (03CR) 10CDobbins: [C:03+2] Set cp4038 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043185 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [18:35:28] oh sorry that's wmf.8. dumps related? [18:36:24] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [18:36:35] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [18:37:02] dduvall: i believe so, yeah [18:37:54] k. so much logspam this week [18:38:02] yeah, it's not quiet. [18:39:39] RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:45:24] (03PS35) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:45:46] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:04] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 12, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367457 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:47:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T367457 (10ops-monitoring-bot) 03NEW [18:49:00] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [18:49:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64890 and previous config saved to /var/cache/conftool/dbconfig/20240613-184924-ladsgroup.json [18:49:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:49:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:49:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:50:08] (03CR) 10Andrew Bogott: [C:03+2] openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [19:05:12] jhathaway: I'm back, will look at the wikikube-ctrl1003 thing, sorry about that [19:05:53] kamila_: no problem at all [19:06:02] was it supposed to be up? [19:06:53] what's your definition of supposed? :D [19:07:04] I was hoping it would be, but apparently the reimage failed [19:07:27] (I was afk for a while) [19:08:13] (03PS1) 10Scott French: Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) [19:08:13] so now it's not expected to be up, I'll plop a downtime on it [19:08:22] (03CR) 10CI reject: [V:04-1] Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [19:08:50] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:08:51] kamila_: ah that makes sense, thanks [19:09:52] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:10:16] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: reimage failing [19:10:25] FIRING: [2x] SystemdUnitFailed: etcd.service on wikikube-ctrl1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: reimage failing [19:10:36] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890462 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ffb1c0b-d404-4615-accd-65085d64f738) set by kamila@c... [19:13:50] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890464 (10CDanis) Hi all. @joanna_borun asked me to do some looking into this. I promise I skimmed the above, but... [19:20:28] (03PS2) 10Scott French: Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) [19:22:26] (03CR) 10Eevans: [C:03+2] aqs: Upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1042234 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [19:22:58] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [19:23:31] (03PS1) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) [19:24:05] (03PS1) 10Andrew Bogott: Pass --allow-releaseinfo-change when adding new openstack client apt repos [puppet] - 10https://gerrit.wikimedia.org/r/1043199 (https://phabricator.wikimedia.org/T366028) [19:27:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [19:27:18] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [19:27:21] 🎉 [19:27:30] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1013.eqiad.wmnet [19:27:30] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1013.eqiad.wmnet [19:28:32] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002 [19:28:37] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [19:34:15] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890544 (10dcaro) >>! In T348643#9890463, @CDanis wrote: > Hi all. @joanna_borun asked me to do some looking into t... [19:38:30] (03PS1) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) [19:38:40] (03PS1) 10JHathaway: postfix: mx-in hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406) [19:38:41] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890547 (10CDanis) Very helpful, thanks @dcaro and enjoy the pto! I'll be gentle, and definitely won't do any write... [19:38:51] (03CR) 10CI reject: [V:04-1] Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [19:39:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:39:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:29] (03PS2) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) [19:40:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:50] (03CR) 10CI reject: [V:04-1] Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [19:41:11] (03CR) 10JHathaway: [C:03+2] postfix: mx-in hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:41:49] !log removing 2 files for legal compliance [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:20] (03Abandoned) 10Andrew Bogott: Pass --allow-releaseinfo-change when adding new openstack client apt repos [puppet] - 10https://gerrit.wikimedia.org/r/1043199 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [19:43:02] (03PS3) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) [19:46:25] (03PS1) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) [19:47:04] (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [19:47:35] (03PS5) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) [19:51:32] !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye [19:51:36] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: -... [19:51:42] !log removing 2 files for legal compliance [19:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:51] (03PS6) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) [19:53:15] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye [19:53:21] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye [19:57:18] (03PS7) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) [19:58:14] !log kamila@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1003.eqiad.wmnet [19:58:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1003.eqiad.wmnet [19:59:09] !log removing 2 files for legal compliance [19:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T2000) [20:00:05] dbrant and pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] here [20:00:09] (03CR) 10Jsn.sherman: [C:03+1] "looks good to me; we've tested this locally and on beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:00:12] present [20:00:23] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl1003.eqiad.wmnet [20:00:24] (03PS1) 10JHathaway: postfix: mx-in{1001,2001} change role to postfix::mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) [20:00:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:01:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:02:24] have a backport that I just realized I didn't put on the calendar [20:05:14] (03CR) 10JHathaway: [C:03+2] postfix: mx-in{1001,2001} change role to postfix::mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:06:29] I stuck it in there in hopes of tagging along at the end: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1041699 [20:10:49] do we have a deployer on hand? I could deploy if needed [20:13:24] !log removing 1 file for legal compliance [20:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:52] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [20:15:07] looks like you've volunteered [20:15:36] dbrant: getting everything setup. both of these backports look straightforward [20:17:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723) (owner: 10Dbrant) [20:17:58] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [20:18:51] Now Jenkins will take ~20 minutes to approve the patch. You could manually +2 my patch as well so the two 20-minute delays run in parallel rather than series. [20:19:14] Pppery: ack [20:20:09] (03CR) 10Jsn.sherman: [C:03+2] "looks good to me; giving zuul a head start in the deployment window" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery) [20:20:40] thanks [20:26:31] (03PS1) 10JHathaway: postfix: mx-in{1001,2001} fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406) [20:26:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:28:17] (03CR) 10JHathaway: [C:03+2] postfix: mx-in{1001,2001} fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:29:46] FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [20:31:27] (03PS2) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) [20:32:07] (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [20:32:58] 06SRE, 06Infrastructure-Foundations, 10netops: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890729 (10cmooney) It seems this was an inadvertent result of the upgrade to the codfw row A/B switches, and the move there from a purely L2 switching layer to a rout... [20:34:27] (03PS1) 10JHathaway: mx-in: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406) [20:34:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:35:47] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64891 and previous config saved to /var/cache/conftool/dbconfig/20240613-203708-marostegui.json [20:37:13] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:38:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:30] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466 (10CDanis) 03NEW [20:40:55] (03CR) 10JHathaway: [C:03+2] mx-in: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:42:58] (03Merged) 10jenkins-bot: Look for iPadOS in user-agent, in addition to iOS. [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723) (owner: 10Dbrant) [20:43:00] (03Merged) 10jenkins-bot: Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery) [20:43:23] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890786 (10CDanis) [20:44:03] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye [20:44:07] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**P... [20:44:13] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890789 (10CDanis) [20:45:58] Pppery: it looks like your backport comes with some other unexpected commits due to submodules [20:46:24] Sorry, I have no idea what that means [20:48:02] basically, it looks like it has a submodule update from master instead the release branch [20:48:46] (03PS3) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) [20:49:04] I just used Gerrit's cherry-pick option in the UI. I had no idea that the translate extension even had submodules [20:49:28] (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [20:50:21] Subproject commit b085c3259dd6e36c16a8149767ba841b5d597d9a [20:50:32] !log cdobbins@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.eqsin.wmnet [20:51:00] (03PS4) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) [20:51:12] That doesn't make sense. b085c3259dd6e36c16a8149767ba841b5d597d9a is the hash of my commit [20:51:30] https://phabricator.wikimedia.org/rMWfe91de424bd1f20936fd48f2bc3e7321e65f46a7 [20:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P64892 and previous config saved to /var/cache/conftool/dbconfig/20240613-205215-marostegui.json [20:52:19] That commit updates the pointer for translate in the branch of the mediawiki/core repo from the version before my commit to the version after my commit. That looks right [20:52:48] And that's what should be deployed, right? [20:52:49] now that looks like dbrant: [20:53:02] https://phabricator.wikimedia.org/rMWa436c8f2782830b36c2244546f219a9cc964dd15 [20:53:30] Yep, that's the same submodule update for dbrant's MobileApp commit. I'm not seeing the problem here [20:53:48] I also just used the cherry-pick feature in gerrit. [20:54:15] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:54:51] I wonder if the deployments got put together because of gate-and-submit finishing at simultaneously [20:55:47] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:55:47] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002 [20:55:47] scap is giving me a warning [20:55:48] `20:44:37 There were unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.43.0-wmf.9` [20:55:54] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [20:56:49] I think if I had scap deployed both changes together this may have been the expected result without the warning [20:57:00] but I'm not super confident about moving forward [20:59:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:59:51] cjming: could you advise? [21:00:20] So it sounds like commits have been merged in the wmf.9 branch of other extensions [21:00:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:38] yes, I kicked off a +2 on another patch to be backported [21:00:43] Those are also going to be deployed together with the change that you were planning to deploy, which might not have been what you expected [21:00:46] it finished in the middle of the first patch [21:00:58] It's OK to do that as long as you know that it's happening and you have the person who requested the deploy test it etc [21:01:04] which seems like the right outcome [21:01:06] hi hi - yes i've encountered that before - not sure if it's the right decision but i've plowed ahead [21:01:07] (03PS1) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [21:01:11] good good [21:01:24] thanks guys! [21:01:25] what Roan said [21:01:29] Yeah if the change that the scap tool is complaining about is one you know about and are comfortable deploying, then move forward [21:01:43] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]] [21:01:48] T362723: Data Validation for iOS Image Recs - https://phabricator.wikimedia.org/T362723 [21:02:01] I always have a freeze response when I see a warning about sub modules [21:02:07] same! [21:02:13] The tool has this feature to warn you in the scenario where someone +2s a patch (most commonly in mw-config) and never deploys it, then you "scap deploy" another patch, and end up deploying a completely unrelated change along with yours [21:02:20] (03CR) 10Scott French: "Thanks, Tobias! That's a good point about the routing." [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [21:02:31] but it looks like the extensions are submodules in the deployment repo, which makes sense [21:02:38] That 's exactly right [21:02:40] (03CR) 10Scott French: [C:03+2] kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [21:03:26] And there's a magic feature in Gerrit where, when a patch is merged in the wmf.9 branch of e.g. the Translate extension, a commit is automatically created and merged in the wmf.9 branch of core updating the submodule for Translate. https://phabricator.wikimedia.org/rMWfe91de424bd1f20936fd48f2bc3e7321e65f46a7 is one of those automagic update commits [21:03:34] !log changing BGP aggregate contribution policy / external route announcement cr2-eqord (T367439) [21:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:39] T367439: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 [21:03:51] (03Merged) 10jenkins-bot: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [21:04:01] That way the submodules in the deployment branch of core always stay in sync with the deployment branches of the extensions [21:04:04] !log changing BGP aggregate contribution policy / external route announcement cr2-eqdfw (T367439) [21:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:09] !log jsn@deploy1002 dbrant, jsn: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:15] TIL [21:04:16] sorry for the delay dbrant: and Pppery: please test [21:04:19] On it [21:04:58] First of two things my patch did looks good. Still testing the second [21:05:05] mine looks good! [21:05:34] RoanKattouw: That makes sense and explains why we do reverts when we do [21:06:07] (03PS2) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [21:06:11] How does it explain reverts? [21:06:15] dbrant: thanks! These are rolling together, so we'll wait to hear from Pppery: [21:06:56] Second of two things looks good as well. Proceed [21:07:02] !log jsn@deploy1002 dbrant, jsn: Continuing with sync [21:07:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P64893 and previous config saved to /var/cache/conftool/dbconfig/20240613-210723-marostegui.json [21:08:41] If you deploy an extension change and it tests bad, you can't just stop the sync, you have to revert the change too. I know this is super obvious when I think about it, but scap abstracts things quite a bit. [21:09:09] Oh right yes because it's already merged in the deployment branch [21:09:21] So even if you didn't sync it and just left it there, it would be a nasty surprise for the next deployer [21:09:33] ^ [21:13:45] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:27] jouncebot next [21:15:27] In 8 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0600) [21:15:34] jouncebot now [21:15:34] No deployments scheduled for the next 8 hour(s) and 44 minute(s) [21:15:55] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]] (duration: 14m 11s) [21:15:59] T362723: Data Validation for iOS Image Recs - https://phabricator.wikimedia.org/T362723 [21:16:23] Pppery: & dbrant: y'all should be good; I'm going to pull in our config change too, since there's nothing else happening [21:16:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [21:17:14] thx! [21:17:31] (03Merged) 10jenkins-bot: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [21:17:49] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]] [21:17:53] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [21:18:03] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890880 (10cmooney) I've pushed this change to cr2-eqdfw and it seems to be doing what we need there: Codfw /48 is announced to Facebook: ` cmoo... [21:18:16] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:18:42] I'm sorry. I've attended four backport windows and every time something went uniquely wrong [21:19:44] Pppery: no worries! This was just my inexperience with scap. Nothing went wrong here. [21:19:55] Thanks [21:20:16] !log jsn@deploy1002 jsn, kgraessle: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:20:44] testing [21:22:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64894 and previous config saved to /var/cache/conftool/dbconfig/20240613-212230-marostegui.json [21:22:36] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:23:23] !log jsn@deploy1002 jsn, kgraessle: Continuing with sync [21:23:37] looks good, surveys live on all 4 wikis [21:23:58] (on the debug host) [21:25:47] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:28:08] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890909 (10cmooney) I'm monitoring the change in traffic levels. Right now it seems negligible, however that is not much surprise, prior to the... [21:29:21] (03PS1) 10JHathaway: postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) [21:29:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [21:30:32] (03PS1) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) [21:32:07] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]] (duration: 14m 18s) [21:32:08] (03PS2) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) [21:32:12] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [21:33:38] (03PS3) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) [21:33:53] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002 [21:33:56] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [21:34:23] (03PS2) 10JHathaway: postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) [21:34:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [21:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:35:24] (03PS3) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [21:38:15] (03CR) 10JHathaway: [C:03+2] postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [21:39:28] (03PS1) 10Dzahn: idp: remove gitlab from the CAS protocol section [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) [21:42:15] (03PS2) 10Dzahn: idp: remove gitlab from the CAS protocol section [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) [21:42:39] (03PS4) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [21:44:19] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890956 (10cmooney) Just to note that for the same time period (since March 5th) we've not been announcing the codfw aggregates from eqord: ` cmo... [21:56:38] (03PS13) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) [21:59:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:00] (03PS1) 10JHathaway: mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T189655) [22:00:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T189655) (owner: 10JHathaway) [22:00:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:58] (03PS2) 10JHathaway: mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406) [22:05:28] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [22:07:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:07:48] (03PS1) 10JHathaway: vrts_aliases: use keyword params [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406) [22:08:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:10:18] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:12:33] (03CR) 10JHathaway: [C:03+2] vrts_aliases: use keyword params [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:12:45] (03CR) 10JHathaway: [C:03+2] mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:18:18] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 58.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:00] (03PS1) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043275 (https://phabricator.wikimedia.org/T363001) [22:30:47] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:45] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:28] (03PS1) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) [22:35:47] RESOLVED: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [22:37:41] (03PS12) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [22:39:19] (03PS5) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) [22:39:33] jouncebot: nowandnext [22:39:33] No deployments scheduled for the next 7 hour(s) and 20 minute(s) [22:39:33] In 7 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0600) [22:40:17] (03PS2) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) [22:40:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:42:35] (03CR) 10Zabe: [C:03+2] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [22:43:15] (03Merged) 10jenkins-bot: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [22:46:38] (03PS3) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) [22:46:54] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are m [22:46:54] n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:46:54] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet are m [22:46:54] n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:46:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:04] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [22:47:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:47:54] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:47:56] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:47:56] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset 0.000438846 secs https://wikitech.wikimedia.org/wiki/NTP [22:47:57] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:49:02] !log create plwiki sysop wiki # T361041 [22:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:08] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [22:49:14] (03PS2) 10EoghanGaffney: lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127 [22:50:46] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2927/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [22:51:57] RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:52:57] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:56:56] (03PS1) 10Zabe: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) [22:57:49] (03PS2) 10Zabe: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) [22:57:55] (03CR) 10Zabe: [C:03+2] Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [22:58:36] (03Merged) 10jenkins-bot: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [22:59:19] !log zabe@deploy1002 Started scap: T361041 [22:59:23] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [23:01:53] !log zabe@deploy1002 zabe: T361041 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:02:39] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002 [23:02:43] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [23:06:19] !log zabe@deploy1002 Sync cancelled. [23:07:16] (03PS1) 10Zabe: multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041) [23:07:45] (03CR) 10Zabe: [C:03+2] multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [23:08:24] (03Merged) 10jenkins-bot: multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe) [23:08:52] !log zabe@deploy1002 Started scap: T361041 [23:08:57] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [23:10:01] (03PS1) 10Bking: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) [23:11:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking) [23:13:13] (03PS2) 10Bking: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) [23:13:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking) [23:17:15] !log removing 9 files for legal compliance [23:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:28] !log zabe@deploy1002 Finished scap: T361041 (duration: 11m 36s) [23:20:33] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [23:23:35] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=sysop_plwiki --cluster=all 2>&1 | tee /tmp/sysop_plwiki.UpdateSearchIndexConfig.log # T361041 [23:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043309 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043309 (owner: 10TrainBranchBot) [23:44:05] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311 [23:44:05] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311 (owner: 10Zabe) [23:44:45] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311 (owner: 10Zabe) [23:45:26] !log zabe@deploy1002 Started scap: T361041, [[gerrit:1043311|Update interwiki cache]] [23:45:30] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041 [23:48:06] !log removing 7 files for legal compliance [23:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:00] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9891246 (10Papaul) We Will be going on site this Monday, June 17th at 11am to work with Equinix team on fixing this issue. @cmooney will be depooling the site. [23:56:33] !log zabe@deploy1002 Finished scap: T361041, [[gerrit:1043311|Update interwiki cache]] (duration: 11m 07s) [23:56:37] T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041