[00:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1042419 (owner: 10TrainBranchBot)
[00:04:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P64752 and previous config saved to /var/cache/conftool/dbconfig/20240613-000430-ladsgroup.json
[00:19:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P64753 and previous config saved to /var/cache/conftool/dbconfig/20240613-001937-ladsgroup.json
[00:24:03] <wikibugs>	 (03PS1) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673)
[00:25:26] <wikibugs>	 (03PS1) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378)
[00:25:45] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Cannot be deployed prior to 20th June (currently)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[00:29:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03)
[00:34:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T352010)', diff saved to https://phabricator.wikimedia.org/P64754 and previous config saved to /var/cache/conftool/dbconfig/20240613-003444-ladsgroup.json
[00:34:47] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[00:34:49] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[00:35:00] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[00:35:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64755 and previous config saved to /var/cache/conftool/dbconfig/20240613-003507-ladsgroup.json
[00:42:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance
[00:42:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance
[00:42:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64756 and previous config saved to /var/cache/conftool/dbconfig/20240613-004247-marostegui.json
[00:42:52] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[00:58:13] <wikibugs>	 (03PS1) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978)
[01:09:01] <wikibugs>	 (03CR) 10Scott French: "Hi Janis - I think this should achieve what we talked about earlier today, as long as my understanding is not wildly off :) Thanks in adva" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[01:16:24] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 47 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[01:45:00] <icinga-wm_>	 PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100%
[01:50:02] <icinga-wm_>	 RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[02:10:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:26] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:38:24] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:38:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:24] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:55:22] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887335 (10Papaul) I create ticket # 1-235341265861 requesting Equinix to check the breaker on the feed where PEM0 is connected.
[02:55:46] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:43:29] <wikibugs>	 (03PS1) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490
[03:49:46] <jinxer-wm>	 FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[03:54:12] <wikibugs>	 (03PS2) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490
[04:10:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:15:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:23:03] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887379 (10Papaul) Technician Note Equinix Support , Jun/12/2024 22:28 The site has investigated customer equipment 2016250 Juniper in cabinet 504. All power indicators are green. The only al...
[04:24:07] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9887380 (10Papaul) Reopen Note Papaul Tshibamba , Jun/12/2024 23:19 Thank you for checking this yes indeed all the power indicators are green but we are not getting enough power on PEM 0 that...
[04:25:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:27:06] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563)
[04:30:15] <jinxer-wm>	 RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:32:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367146
[04:32:27] <stashbot>	 T367146: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T367146
[04:32:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1183 with weight 0 T367146', diff saved to https://phabricator.wikimedia.org/P64757 and previous config saved to /var/cache/conftool/dbconfig/20240613-043239-root.json
[04:32:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T367146
[04:33:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1041535 (https://phabricator.wikimedia.org/T367146) (owner: 10Gerrit maintenance bot)
[04:34:12] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041536 (https://phabricator.wikimedia.org/T367146)
[04:38:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[04:38:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[04:38:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:38:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:38:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64758 and previous config saved to /var/cache/conftool/dbconfig/20240613-043848-marostegui.json
[04:38:53] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[04:39:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:42:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64759 and previous config saved to /var/cache/conftool/dbconfig/20240613-044201-marostegui.json
[04:44:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:51:08] <marostegui>	 !log Starting s5 eqiad failover from db1230 to db1183 - T367146
[04:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:51:12] <stashbot>	 T367146: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T367146
[04:51:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T367146', diff saved to https://phabricator.wikimedia.org/P64760 and previous config saved to /var/cache/conftool/dbconfig/20240613-045121-root.json
[04:51:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1183 to s5 primary and set section read-write T367146', diff saved to https://phabricator.wikimedia.org/P64761 and previous config saved to /var/cache/conftool/dbconfig/20240613-045141-root.json
[04:52:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041536 (https://phabricator.wikimedia.org/T367146) (owner: 10Gerrit maintenance bot)
[04:52:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 T367146', diff saved to https://phabricator.wikimedia.org/P64762 and previous config saved to /var/cache/conftool/dbconfig/20240613-045254-root.json
[04:53:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[04:54:37] <wikibugs>	 (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042573
[04:54:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Long schema change
[04:54:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Long schema change
[04:55:20] <marostegui>	 !log dbmaint eqiad s5 deploy schema change on db1230 T364299
[04:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:24] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[04:55:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042573 (owner: 10Marostegui)
[04:57:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64763 and previous config saved to /var/cache/conftool/dbconfig/20240613-045709-marostegui.json
[04:58:15] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:03:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:12:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64764 and previous config saved to /var/cache/conftool/dbconfig/20240613-051204-marostegui.json
[05:12:09] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:12:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64765 and previous config saved to /var/cache/conftool/dbconfig/20240613-051216-marostegui.json
[05:23:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64766 and previous config saved to /var/cache/conftool/dbconfig/20240613-052344-ladsgroup.json
[05:23:49] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[05:27:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P64767 and previous config saved to /var/cache/conftool/dbconfig/20240613-052711-marostegui.json
[05:27:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367261)', diff saved to https://phabricator.wikimedia.org/P64768 and previous config saved to /var/cache/conftool/dbconfig/20240613-052723-marostegui.json
[05:27:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[05:27:29] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[05:27:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[05:27:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64769 and previous config saved to /var/cache/conftool/dbconfig/20240613-052746-marostegui.json
[05:30:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64770 and previous config saved to /var/cache/conftool/dbconfig/20240613-053052-marostegui.json
[05:31:47] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1042595 (https://phabricator.wikimedia.org/T367378)
[05:31:51] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378)
[05:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[05:37:33] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122)
[05:38:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P64771 and previous config saved to /var/cache/conftool/dbconfig/20240613-053851-ladsgroup.json
[05:42:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P64772 and previous config saved to /var/cache/conftool/dbconfig/20240613-054218-marostegui.json
[05:46:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64773 and previous config saved to /var/cache/conftool/dbconfig/20240613-054600-marostegui.json
[05:47:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: Long schema change
[05:47:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: Long schema change
[05:53:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[05:53:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P64774 and previous config saved to /var/cache/conftool/dbconfig/20240613-055358-ladsgroup.json
[05:57:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T364069)', diff saved to https://phabricator.wikimedia.org/P64775 and previous config saved to /var/cache/conftool/dbconfig/20240613-055725-marostegui.json
[05:57:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance
[05:57:32] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:57:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance
[05:57:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64776 and previous config saved to /var/cache/conftool/dbconfig/20240613-055747-marostegui.json
[05:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600).
[06:01:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64777 and previous config saved to /var/cache/conftool/dbconfig/20240613-060107-marostegui.json
[06:01:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:03:30] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:47] <effie>	 jouncebot: now
[06:05:47] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600)
[06:05:47] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600)
[06:06:01] <effie>	 jouncebot: next
[06:06:01] <jouncebot>	 In 0 hour(s) and 53 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0700)
[06:09:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64778 and previous config saved to /var/cache/conftool/dbconfig/20240613-060905-ladsgroup.json
[06:09:08] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[06:09:10] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[06:09:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64779 and previous config saved to /var/cache/conftool/dbconfig/20240613-060927-ladsgroup.json
[06:13:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657
[06:13:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto)
[06:14:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657
[06:16:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367261)', diff saved to https://phabricator.wikimedia.org/P64780 and previous config saved to /var/cache/conftool/dbconfig/20240613-061613-marostegui.json
[06:16:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[06:16:18] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[06:16:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[06:16:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64781 and previous config saved to /var/cache/conftool/dbconfig/20240613-061636-marostegui.json
[06:19:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64782 and previous config saved to /var/cache/conftool/dbconfig/20240613-061948-marostegui.json
[06:24:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:27:05] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[06:29:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:34:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64783 and previous config saved to /var/cache/conftool/dbconfig/20240613-063455-marostegui.json
[06:36:22] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9887501 (10dcaro)
[06:38:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1042726
[06:38:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1042726 (owner: 10Marostegui)
[06:39:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64784 and previous config saved to /var/cache/conftool/dbconfig/20240613-063934-root.json
[06:40:44] <wikibugs>	 (03CR) 10Muehlenhoff: "For production all descriptions are set via profile:base::production ::role_description, but for Cloud VPS this doesn't seem very useful: " [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff)
[06:40:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff)
[06:42:13] <moritzm>	 !log rebalance ganeti clusters in eqiad following reboots
[06:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:52] <wikibugs>	 (03PS1) 10Marostegui: db1187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1042800
[06:45:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:45:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1042800 (owner: 10Marostegui)
[06:47:39] <wikibugs>	 (03PS1) 10Marostegui: db1125: Typo [puppet] - 10https://gerrit.wikimedia.org/r/1042804
[06:49:30] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:49:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1125: Typo [puppet] - 10https://gerrit.wikimedia.org/r/1042804 (owner: 10Marostegui)
[06:50:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64785 and previous config saved to /var/cache/conftool/dbconfig/20240613-065002-marostegui.json
[06:50:58] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9887525 (10Marostegui) @Jhancock.wm reminder, we do not need AAAA records on these hosts.
[06:52:27] <wikibugs>	 (03PS1) 10Marostegui: site.pp: New dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/1042820 (https://phabricator.wikimedia.org/T362824)
[06:52:43] <wikibugs>	 (03Abandoned) 10Marostegui: site.pp: New dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/1042820 (https://phabricator.wikimedia.org/T362824) (owner: 10Marostegui)
[06:54:02] <wikibugs>	 (03PS1) 10Marostegui: site.pp: New dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042822 (https://phabricator.wikimedia.org/T362824)
[06:54:24] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9887533 (10Marostegui)
[06:54:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: New dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042822 (https://phabricator.wikimedia.org/T362824) (owner: 10Marostegui)
[06:54:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64786 and previous config saved to /var/cache/conftool/dbconfig/20240613-065439-root.json
[06:55:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:57:46] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn)
[06:59:20] <wikibugs>	 (03PS1) 10Marostegui: regex.yaml: Add dbproxy codfw [puppet] - 10https://gerrit.wikimedia.org/r/1042825
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0700). Please do the needful.
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:05:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T367261)', diff saved to https://phabricator.wikimedia.org/P64787 and previous config saved to /var/cache/conftool/dbconfig/20240613-070509-marostegui.json
[07:05:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[07:05:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1042825 (owner: 10Marostegui)
[07:05:17] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[07:05:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[07:05:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64788 and previous config saved to /var/cache/conftool/dbconfig/20240613-070531-marostegui.json
[07:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:08:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64789 and previous config saved to /var/cache/conftool/dbconfig/20240613-070837-marostegui.json
[07:08:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] regex.yaml: Add dbproxy codfw [puppet] - 10https://gerrit.wikimedia.org/r/1042825 (owner: 10Marostegui)
[07:09:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64790 and previous config saved to /var/cache/conftool/dbconfig/20240613-070944-root.json
[07:09:51] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:13:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:13:51] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:14:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::maps::tlsproxy: Unconditionally use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[07:15:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[07:16:50] <wikibugs>	 (03Merged) 10jenkins-bot: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[07:21:20] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[07:21:50] <wikibugs>	 (03PS1) 10Brouberol: spark-operator: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978)
[07:22:19] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Make sretest1001 a Cumin node for a test [puppet] - 10https://gerrit.wikimedia.org/r/998930 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff)
[07:23:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64791 and previous config saved to /var/cache/conftool/dbconfig/20240613-072344-marostegui.json
[07:24:37] <icinga-wm_>	 RECOVERY - Host an-worker1085 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[07:24:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64792 and previous config saved to /var/cache/conftool/dbconfig/20240613-072450-root.json
[07:25:59] <kart_>	 marostegui: OK to deploy cxserver/MinT?
[07:26:19] <marostegui>	 kart_: go for it!
[07:26:31] <kart_>	 cool.
[07:27:03] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563) (owner: 10KartikMistry)
[07:27:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2024-06-12-111204-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042541 (https://phabricator.wikimedia.org/T363563) (owner: 10KartikMistry)
[07:28:34] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[07:30:41] <wikibugs>	 (03PS1) 10Muehlenhoff: tlsproxy::localssl: Remove support for cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750)
[07:32:03] <kart_>	 "add securityContext to all containers" - is it OK to deploy?
[07:33:28] <kart_>	 OK. I'll wait for someone to check it then deploy mint/cxserver later.
[07:34:04] <effie>	 kart_: let me check the commit, but  it is alright 
[07:34:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[07:34:24] <wikibugs>	 (03PS4) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587
[07:34:36] <kart_>	 effie: OK. Please let me know. Seems added in all services.
[07:34:38] <wikibugs>	 (03CR) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse)
[07:34:52] <effie>	 yes it is 
[07:36:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174#9887576 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this, the new established procedure is that all servers which get mo...
[07:38:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64793 and previous config saved to /var/cache/conftool/dbconfig/20240613-073851-marostegui.json
[07:39:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64794 and previous config saved to /var/cache/conftool/dbconfig/20240613-073955-root.json
[07:43:36] <wikibugs>	 (03PS2) 10Phedenskog: wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669)
[07:43:38] <effie>	 kart_: I don't see anything in the diff apart from the chart version 
[07:43:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you for the extensive comments/guide" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis)
[07:44:40] <effie>	 kart_: shall I deploy?
[07:47:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] prometheus::blackbox_exporter: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1013074 (owner: 10Muehlenhoff)
[07:47:45] <wikibugs>	 (03PS3) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449
[07:49:30] <jayme>	 kart_, effie: I think that has long been deployed to cxserver. What you might see is a chart version bump because of an updated helm test that does change the deployment
[07:49:46] <jinxer-wm>	 FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[07:51:36] <effie>	 jayme: I saw the log etc, I am just wondering what kart_ saw 
[07:52:00] <kart_>	 effie: Sorry, was bit afk.
[07:52:18] <effie>	 jayme:  because  securityContext  on cxserver was deployed in may 
[07:52:24] <kart_>	 effie: I was looking at machinetranslation (mint) service first.
[07:52:36] <effie>	 ah let me check there rtoo, I was checking cxserver
[07:52:50] <kart_>	 effie: I yet to merge patch for cxserver.
[07:53:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64795 and previous config saved to /var/cache/conftool/dbconfig/20240613-075358-marostegui.json
[07:54:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[07:54:03] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[07:54:07] <effie>	 kart_: go ahead
[07:54:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[07:54:14] <kart_>	 Thanks!
[07:54:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64796 and previous config saved to /var/cache/conftool/dbconfig/20240613-075420-marostegui.json
[07:54:25] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[07:55:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64797 and previous config saved to /var/cache/conftool/dbconfig/20240613-075500-root.json
[07:56:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[07:57:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:57:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64798 and previous config saved to /var/cache/conftool/dbconfig/20240613-075727-marostegui.json
[07:59:00] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[07:59:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog)
[08:02:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:02:58] <wikibugs>	 (03PS3) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261
[08:03:32] <wikibugs>	 (03CR) 10Slyngshede: Replace development server with uWSGI. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede)
[08:03:52] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:03:57] <wikibugs>	 (03CR) 10Majavah: [C:04-1] profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[08:04:51] <wikibugs>	 (03CR) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[08:05:14] <icinga-wm_>	 PROBLEM - MariaDB Replica SQL: s2 on db2125 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:05:38] <arnaudb>	 depooling ↑
[08:06:17] <wikibugs>	 (03PS4) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261
[08:06:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[08:06:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'index error depool db2125', diff saved to https://phabricator.wikimedia.org/P64799 and previous config saved to /var/cache/conftool/dbconfig/20240613-080624-arnaudb.json
[08:06:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: add auto_offset_reset to kafka input [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710)
[08:06:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: consume k8s logs topics [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710)
[08:06:42] <wikibugs>	 (03PS4) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449
[08:07:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:08:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2919/co" [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[08:08:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db2125.codfw.wmnet with reason: index issue
[08:08:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db2125.codfw.wmnet with reason: index issue
[08:09:50] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:10:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[08:10:14] <icinga-wm_>	 RECOVERY - MariaDB Replica SQL: s2 on db2125 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:11:16] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[08:11:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64800 and previous config saved to /var/cache/conftool/dbconfig/20240613-081138-arnaudb.json
[08:12:15] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:12:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64801 and previous config saved to /var/cache/conftool/dbconfig/20240613-081234-marostegui.json
[08:12:56] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:13:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:28] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:13:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:13:54] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[08:14:01] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Looks good to me. There is a bit of a question of alert routing (for k8s-ml aka LiftWing, the general SRE team isn't the first line of def" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[08:14:42] <wikibugs>	 (03PS1) 10Phedenskog: wmftest: Remove old performance team setup. [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669)
[08:14:51] <wikibugs>	 (03CR) 10JMeybohm: hemlfile: export admin-ng pending diff metrics hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:15:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[08:15:26] <wikibugs>	 (03CR) 10Phedenskog: [C:04-1] "I want to wait with this until we seen that the new Graphite setup is working. When that's done, this cleanup can be done." [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog)
[08:15:31] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:15:50] <wikibugs>	 (03PS7) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894)
[08:15:50] <wikibugs>	 (03PS2) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894)
[08:19:18] <wikibugs>	 (03CR) 10Majavah: [C:03+1] profile::openstack::base::designate::service: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[08:20:30] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede)
[08:21:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:22:08] <wikibugs>	 (03Merged) 10jenkins-bot: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede)
[08:25:07] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[08:26:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64802 and previous config saved to /var/cache/conftool/dbconfig/20240613-082643-arnaudb.json
[08:27:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:27:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636)
[08:27:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64803 and previous config saved to /var/cache/conftool/dbconfig/20240613-082741-marostegui.json
[08:27:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636)
[08:29:19] <kart_>	 !log Updated MinT to 2024-06-12-111204-production (T363563)
[08:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:23] <stashbot>	 T363563: Avoid references losing their data (showing as plain-text "[1]") when added to the translation using MinT - https://phabricator.wikimedia.org/T363563
[08:29:35] <effie>	 jouncebot: nex
[08:29:37] <effie>	 jouncebot: next
[08:29:37] <jouncebot>	 In 1 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[08:29:42] <effie>	 jouncebot: now
[08:29:42] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 30 minute(s)
[08:30:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:32:15] <jinxer-wm>	 RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:32:55] <_joe_>	 effie: do you need to deploy mediawiki, or are you just doing reboots?
[08:33:24] <_joe_>	 because if it's the latter, I will do some hacks to mw-debug
[08:34:57] <effie>	 reboots
[08:36:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:36:47] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[08:37:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:37:20] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636)
[08:39:36] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9887671 (10akosiaris)
[08:40:42] <wikibugs>	 (03CR) 10Muehlenhoff: purged: set use_pki to true in magru (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[08:40:46] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:41:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64804 and previous config saved to /var/cache/conftool/dbconfig/20240613-084149-arnaudb.json
[08:42:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove no longer used parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636)
[08:42:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367261)', diff saved to https://phabricator.wikimedia.org/P64805 and previous config saved to /var/cache/conftool/dbconfig/20240613-084248-marostegui.json
[08:42:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[08:42:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[08:42:52] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[08:43:00] <wikibugs>	 (03CR) 10Btullis: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[08:43:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[08:43:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[08:43:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64806 and previous config saved to /var/cache/conftool/dbconfig/20240613-084310-marostegui.json
[08:46:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64807 and previous config saved to /var/cache/conftool/dbconfig/20240613-084615-marostegui.json
[08:46:47] <wikibugs>	 (03PS3) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894)
[08:48:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "Uninformed LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis)
[08:48:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:49:36] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[08:51:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[08:52:13] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[08:56:03] <wikibugs>	 (03CR) 10Btullis: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[08:56:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64808 and previous config saved to /var/cache/conftool/dbconfig/20240613-085654-arnaudb.json
[08:57:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Deploy calico network policy templates to all datahub charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041676 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[08:58:22] <wikibugs>	 (03CR) 10Brouberol: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[08:59:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[08:59:26] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[09:01:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64809 and previous config saved to /var/cache/conftool/dbconfig/20240613-090122-marostegui.json
[09:02:07] <wikibugs>	 (03CR) 10Btullis: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:03:35] <wikibugs>	 (03CR) 10JMeybohm: "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:04:13] <wikibugs>	 (03PS1) 10Jelto: aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382)
[09:05:11] <wikibugs>	 (03CR) 10JMeybohm: "Hm...gerrit formatted stuff." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:06:27] <wikibugs>	 (03CR) 10Brouberol: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:07:40] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[09:07:48] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[09:08:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:08:33] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto)
[09:08:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9887749 (10Clement_Goubert) @VRiley-WMF Do you object to us reusing that task by reopening it whenever we have a batch of servers to relabel, or would you rathe...
[09:09:49] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: bump gitlab and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto)
[09:10:08] <wikibugs>	 (03PS2) 10Jelto: aptrepo: bump gitlab-runner and gitlab-ce to 16.11 [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382)
[09:12:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64810 and previous config saved to /var/cache/conftool/dbconfig/20240613-091200-arnaudb.json
[09:12:56] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:13:02] <wikibugs>	 (03PS1) 10Klausman: golang: Add version 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948
[09:13:03] <wikibugs>	 (03CR) 10Klausman: "Feel free to redirect to a different reviewer" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman)
[09:14:43] <wikibugs>	 (03CR) 10Klausman: "Confirmed working:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman)
[09:15:50] <wikibugs>	 (03CR) 10Brouberol: datahub: update datahubsearch hostname to use external-services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:15:55] <wikibugs>	 (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1042947 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto)
[09:16:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64811 and previous config saved to /var/cache/conftool/dbconfig/20240613-091629-marostegui.json
[09:16:42] <wikibugs>	 (03PS7) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[09:16:42] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423)
[09:17:00] <wikibugs>	 (03Abandoned) 10Brouberol: datahub: update datahubsearch hostname to use external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041671 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:17:49] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[09:18:42] <wikibugs>	 (03PS8) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[09:19:44] <wikibugs>	 (03PS9) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[09:22:18] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[09:22:34] <_joe_>	 jouncebot: now
[09:22:34] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 37 minute(s)
[09:22:40] <_joe_>	 jouncebot: next
[09:22:41] <jouncebot>	 In 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[09:22:50] <_joe_>	 ok I'll go a little early
[09:24:15] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9887824 (10taavi)
[09:26:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1006.eqiad.wmnet
[09:26:57] <claime>	 (these kafka nodes are insetup, no worries)
[09:29:57] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Remove no longer used parsoid certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris)
[09:31:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367261)', diff saved to https://phabricator.wikimedia.org/P64812 and previous config saved to /var/cache/conftool/dbconfig/20240613-093136-marostegui.json
[09:31:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[09:31:41] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[09:31:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[09:31:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64813 and previous config saved to /var/cache/conftool/dbconfig/20240613-093158-marostegui.json
[09:32:45] <wikibugs>	 (03PS2) 10Brouberol: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423)
[09:32:46] <wikibugs>	 (03PS10) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[09:32:46] <wikibugs>	 (03PS1) 10Brouberol: datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423)
[09:32:49] <wikibugs>	 (03PS1) 10DCausse: wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069)
[09:32:57] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966
[09:33:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1006.eqiad.wmnet
[09:33:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1007.eqiad.wmnet
[09:33:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková)
[09:33:41] <wikibugs>	 (03CR) 10Muehlenhoff: "parse1001 and parse2001 are still pooled for the parsoid-php service, will that cause any issues?" [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris)
[09:34:18] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[09:34:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131 (owner: 10Brouberol)
[09:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:34:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete mwmaint.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1042922 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff)
[09:34:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64814 and previous config saved to /var/cache/conftool/dbconfig/20240613-093455-marostegui.json
[09:35:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] spark-operator: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042838 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[09:35:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] rdf-streaming-updater: remove from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131 (owner: 10Brouberol)
[09:35:34] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Remove mw2289.codfw.wmnet from scap::proxies for decom [puppet] - 10https://gerrit.wikimedia.org/r/1042200 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert)
[09:35:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol)
[09:35:58] <claime>	 jouncebot nowandnext
[09:35:58] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 24 minute(s)
[09:35:58] <jouncebot>	 In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[09:36:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Remove mw2289.codfw.wmnet from scap::proxies for decom [puppet] - 10https://gerrit.wikimedia.org/r/1042200 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert)
[09:36:20] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042201 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert)
[09:37:15] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:37:33] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:37:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Good catch. Yeah, we need to remove them first. I got a task at https://phabricator.wikimedia.org/T359387" [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris)
[09:38:06] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[09:38:17] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[09:38:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol)
[09:39:13] <logmsgbot>	 !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2001.codfw.wmnet
[09:39:29] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1007.eqiad.wmnet
[09:39:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1008.eqiad.wmnet
[09:40:03] <wikibugs>	 (03CR) 10Brouberol: datahub: replace IPs by Services in network policies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:40:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:41:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Got it. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:41:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:42:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková)
[09:43:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042201 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert)
[09:43:39] <wikibugs>	 (03PS2) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966
[09:44:51] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková)
[09:45:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1008.eqiad.wmnet
[09:45:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1009.eqiad.wmnet
[09:46:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl2003.codfw.wmnet
[09:46:02] <wikibugs>	 (03PS1) 10Hashar: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029)
[09:47:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2281,2283-2286].codfw.wmnet
[09:47:35] <wikibugs>	 (03PS3) 10Kamila Součková: Revert "Add wikikube-ctrl2001 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042966
[09:48:59] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "I messed up and will start with 2003" [dns] - 10https://gerrit.wikimedia.org/r/1042966 (owner: 10Kamila Součková)
[09:49:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:49:56] <icinga-wm_>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:50:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64815 and previous config saved to /var/cache/conftool/dbconfig/20240613-095002-marostegui.json
[09:50:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:50:18] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: fix label matching beetween pods and networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042964 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:50:48] <logmsgbot>	 !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2003.eqiad.wmnet
[09:50:58] <logmsgbot>	 !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2001.eqiad.wmnet
[09:51:03] <wikibugs>	 (03Merged) 10jenkins-bot: datahub-next: restore IP-based networkpolicy to datahubsearch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042952 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[09:51:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:52:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1009.eqiad.wmnet
[09:52:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main1010.eqiad.wmnet
[09:52:52] <wikibugs>	 (03PS7) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996)
[09:53:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[09:53:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[09:54:19] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980
[09:54:30] <wikibugs>	 (03PS2) 10Kamila Součková: Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980
[09:56:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:58:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main1010.eqiad.wmnet
[09:59:26] <wikibugs>	 (03PS2) 10Brouberol: hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894)
[09:59:43] <wikibugs>	 (03CR) 10Brouberol: hemlfile: export admin-ng pending diff metrics hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[09:59:45] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar)
[09:59:57] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[10:00:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[10:00:18] <wikibugs>	 (03Merged) 10jenkins-bot: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042976 (https://phabricator.wikimedia.org/T367029) (owner: 10Hashar)
[10:00:33] <wikibugs>	 (03CR) 10Hnowlan: kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[10:00:35] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657
[10:00:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982
[10:00:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983
[10:00:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984
[10:01:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[10:01:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto)
[10:01:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto)
[10:01:39] <claime>	 The Appserver unavailable are most probably my decoms
[10:02:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:02:25] <hashar>	 jouncebot: nowandnext
[10:02:25] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[10:02:25] <jouncebot>	 In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200)
[10:02:26] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[10:02:46] <hashar>	 I am goin got do a quick Gerrit update, should not take more than a few minutes
[10:03:46] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[10:03:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:03:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl2003.codfw.wmnet
[10:03:53] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9887946 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl2003.codfw....
[10:03:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Revert "Add wikikube-ctrl2003 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1042980 (owner: 10Kamila Součková)
[10:04:03] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1
[10:04:10] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 (duration: 00m 08s)
[10:04:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2281,2283-2286].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002"
[10:05:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64816 and previous config saved to /var/cache/conftool/dbconfig/20240613-100509-marostegui.json
[10:05:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2281,2283-2286].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002"
[10:05:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:05:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2281,2283-2286].codfw.wmnet
[10:06:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2287-2290].codfw.wmnet
[10:07:34] <wikibugs>	 (03CR) 10MVernon: [C:03+2] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[10:07:54] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9887959 (10kamila) >>! In T366205#9880294, @Papaul wrote: > @kamila  your plan works for us as well, just depool and power the fi...
[10:08:03] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 on gerrit1003 # T367029 T367135
[10:08:09] <stashbot>	 T367029: "Press c to comment" is placed incorrectly when using Firefox 126 and 128 on macOS - https://phabricator.wikimedia.org/T367029
[10:08:09] <stashbot>	 T367135: "Collapse" link on add/edit reviewers screen is showing weird icons - https://phabricator.wikimedia.org/T367135
[10:08:10] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@ee8252a]: Gerrit to snapshot version 3.9.5-21-g553ea468a1 on gerrit1003 # T367029 T367135 (duration: 00m 06s)
[10:09:03] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[10:09:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2006.codfw.wmnet
[10:10:00] <fabfur>	 !log cp4037 depooled && puppet disable to profile benthos configuration (T360454)
[10:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:04] <stashbot>	 T360454: Better Benthos performances - https://phabricator.wikimedia.org/T360454
[10:10:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:15:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:15:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2006.codfw.wmnet
[10:15:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2007.codfw.wmnet
[10:16:07] <claime>	 The high error rates are the circuitbreaking ^ Amir1 
[10:18:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[10:18:45] <wikibugs>	 (03PS1) 10MVernon: wmflib: correct doc string to note lvs is Optional [puppet] - 10https://gerrit.wikimedia.org/r/1042986
[10:20:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:20:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[10:20:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367261)', diff saved to https://phabricator.wikimedia.org/P64818 and previous config saved to /var/cache/conftool/dbconfig/20240613-102016-marostegui.json
[10:20:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[10:20:21] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[10:20:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[10:20:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042986 (owner: 10MVernon)
[10:21:21] <wikibugs>	 (03Abandoned) 10FNegri: Add DNS for ToolsDB replica host [puppet] - 10https://gerrit.wikimedia.org/r/1034042 (https://phabricator.wikimedia.org/T348407) (owner: 10FNegri)
[10:21:44] <Emperor>	 hashar: can you tell me when the gerrit update is finished, please?
[10:21:53] <hashar>	 oh sorry
[10:21:55] <hashar>	 done
[10:21:59] <hashar>	 !log Gerrit upgrade completed
[10:22:00] <Emperor>	 thanks.
[10:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:06] <hashar>	 well upgrade is a bold word really
[10:22:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2007.codfw.wmnet
[10:22:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2008.codfw.wmnet
[10:22:16] <wikibugs>	 (03CR) 10MVernon: [C:03+2] wmflib: correct doc string to note lvs is Optional [puppet] - 10https://gerrit.wikimedia.org/r/1042986 (owner: 10MVernon)
[10:22:19] <hashar>	 it is merely swapping for a version with a handful of patches applied
[10:22:21] <hashar>	 but yeah it is done
[10:22:22] <hashar>	 sorry
[10:22:38] <wikibugs>	 (03PS1) 10Brouberol: datahub: hotfix, remove duplicated env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042987
[10:23:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[10:23:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[10:23:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: hotfix, remove duplicated env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042987 (owner: 10Brouberol)
[10:23:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2287-2290].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002"
[10:23:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983
[10:23:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984
[10:24:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto)
[10:25:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:25:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:25:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto)
[10:26:16] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[10:26:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[10:26:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2287-2290].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002"
[10:26:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:26:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2287-2290].codfw.wmnet
[10:26:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[10:27:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64819 and previous config saved to /var/cache/conftool/dbconfig/20240613-102659-marostegui.json
[10:27:11] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[10:28:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto)
[10:28:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2008.codfw.wmnet
[10:28:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2009.codfw.wmnet
[10:28:50] <wikibugs>	 (03Merged) 10jenkins-bot: statsd-exporter: add service port to ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042657 (owner: 10Giuseppe Lavagetto)
[10:29:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982 (owner: 10Giuseppe Lavagetto)
[10:29:08] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[10:29:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto)
[10:29:52] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[10:30:31] <wikibugs>	 (03Merged) 10jenkins-bot: modules: add base.statsd new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042982 (owner: 10Giuseppe Lavagetto)
[10:30:32] <wikibugs>	 (03Merged) 10jenkins-bot: base.statsd: allow binding to ipv4 for statsd collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042983 (owner: 10Giuseppe Lavagetto)
[10:30:43] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1003']
[10:30:50] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984
[10:31:10] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1003']
[10:31:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888054 (10Clement_Goubert) @Papaul All servers except `mw2282` decommissioned.
[10:31:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64820 and previous config saved to /var/cache/conftool/dbconfig/20240613-103111-ladsgroup.json
[10:31:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888045 (10Clement_Goubert)
[10:31:17] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[10:31:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64821 and previous config saved to /var/cache/conftool/dbconfig/20240613-103120-marostegui.json
[10:31:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto)
[10:31:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9888066 (10MoritzMuehlenhoff)
[10:32:08] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984
[10:32:16] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888074 (10MoritzMuehlenhoff)
[10:32:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275#9888049 (10Clement_Goubert) a:05Clement_Goubert→03None
[10:32:42] <wikibugs>	 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9888076 (10MoritzMuehlenhoff)
[10:33:41] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[10:34:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto)
[10:34:24] <wikibugs>	 (03PS11) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[10:34:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2009.codfw.wmnet
[10:34:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-main2010.codfw.wmnet
[10:35:24] <wikibugs>	 (03Merged) 10jenkins-bot: statsd-exporter: update base.statsd to 1.0.3, switch to ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042984 (owner: 10Giuseppe Lavagetto)
[10:36:21] <wikibugs>	 (03PS12) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[10:37:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) (owner: 10Brouberol)
[10:39:26] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[10:40:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet
[10:41:00] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:41:11] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2010.codfw.wmnet
[10:41:28] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:41:49] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[10:42:24] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[10:43:05] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[10:43:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988
[10:44:26] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto)
[10:44:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto)
[10:45:26] <wikibugs>	 (03Abandoned) 10Hnowlan: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler)
[10:45:47] <wikibugs>	 (03Merged) 10jenkins-bot: base.statsd: fix port name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042988 (owner: 10Giuseppe Lavagetto)
[10:46:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P64822 and previous config saved to /var/cache/conftool/dbconfig/20240613-104619-ladsgroup.json
[10:46:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P64823 and previous config saved to /var/cache/conftool/dbconfig/20240613-104628-marostegui.json
[10:46:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:46:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:46:48] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:47:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:47:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet
[10:47:29] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:48:03] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:48:03] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[10:48:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:48:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:49:38] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet
[10:49:51] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[10:50:15] <wikibugs>	 (03PS1) 10MVernon: cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621)
[10:51:38] <_joe_>	 jouncebot: now
[10:51:38] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1000)
[10:51:48] <_joe_>	 sigh I will be running a little late I fear
[10:51:52] <_joe_>	 jouncebot: next
[10:51:52] <jouncebot>	 In 1 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200)
[10:52:02] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[10:52:40] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995
[10:54:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base.statsd: remove quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042996
[10:54:58] <wikibugs>	 (03CR) 10Klausman: [C:03+1] cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[10:55:07] <icinga-wm_>	 PROBLEM - SSH on wikikube-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:55:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[10:56:03] <wikibugs>	 (03CR) 10MVernon: [C:03+2] cephadm::controller - escape split argument [puppet] - 10https://gerrit.wikimedia.org/r/1042991 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[10:56:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet
[10:56:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] base.statsd: remove quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042996 (owner: 10Giuseppe Lavagetto)
[10:56:41] <icinga-wm_>	 PROBLEM - Host wikikube-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:58:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[10:58:22] <claime>	 huh that ain't me
[10:58:33] <_joe_>	 claime: wat
[10:58:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:58:54] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:59:00] <claime>	 I just rebooted kubemaster1001, didn't touch wikikube-ctrl1001
[10:59:03] <_joe_>	 ah that's not an active master
[10:59:08] <_joe_>	 ctrl I mean?
[10:59:20] <claime>	 it is
[10:59:29] <claime>	 well it was
[10:59:32] <_joe_>	 can't reach via ssh
[10:59:34] <claime>	 now it's down
[10:59:39] <claime>	 kamila_ ?
[10:59:55] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:00:03] <icinga-wm_>	 RECOVERY - SSH on wikikube-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:00:05] <icinga-wm_>	 RECOVERY - Host wikikube-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[11:00:30] <kamila_>	 Huh, that wasn't me
[11:00:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:01:05] <claime>	 That's probably me
[11:01:08] <akosiaris>	 ok
[11:01:21] <wikibugs>	 (03PS1) 10Superpes15: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247)
[11:01:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P64824 and previous config saved to /var/cache/conftool/dbconfig/20240613-110126-ladsgroup.json
[11:01:31] <claime>	 it's actually up, I don't know why it's pinging now
[11:01:32] <godog>	 ack thanks claime 
[11:01:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P64825 and previous config saved to /var/cache/conftool/dbconfig/20240613-110135-marostegui.json
[11:01:39] <claime>	 the one that's down is ctrl1001
[11:01:40] <godog>	 checking too
[11:01:57] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:02:03] <kamila_>	 I didn't touch ctrl1001 today 
[11:02:06] <godog>	 yeah recovering, I don't see pages on alerts.w.o
[11:02:08] <Amir1>	 here
[11:02:24] <Amir1>	 too late then
[11:02:30] <claime>	  11:02:25 up 2 min,  2 users,  load average: 2.68, 1.14, 0.42
[11:02:33] <claime>	 it rebooted
[11:02:35] <claime>	 wth
[11:02:39] <kamila_>	 Mhm 
[11:03:33] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: wikikube-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:03:44] <kamila_>	 I'm currently at a doctor's appointment, I'll stare at it when I get back 
[11:04:26] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol)
[11:04:40] <godog>	 since the probe recovered I'm assuming we're okay claime  ?
[11:05:12] <claime>	 2024-06-13T10:52:34.131434+00:00 wikikube-ctrl1001 systemd-logind[1069]: Power key pressed.
[11:05:14] <claime>	 wat
[11:05:25] <claime>	 godog: yeah
[11:05:42] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:05:50] <godog>	 kk thanks claime, going back to lunch
[11:05:57] <claime>	 sorry for the noise
[11:06:00] <claime>	 enjoy lunch
[11:06:12] <godog>	 np that's what we are here for
[11:07:15] <icinga-wm_>	 PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[11:07:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet
[11:07:30] <wikibugs>	 (03PS1) 10Muehlenhoff: mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998
[11:07:36] <topranks>	 claime: shit, that could have been me 
[11:07:38] * topranks checking 
[11:07:57] <effie>	 jouncebot: now
[11:07:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 52 minute(s)
[11:08:15] <icinga-wm_>	 RECOVERY - ensure kvm processes are running on cloudvirt1032 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[11:08:18] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:08:25] <topranks>	 ugh, yeah :( 
[11:08:32] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:08:43] <_joe_>	 effie: I am finally done, all yours
[11:09:12] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[11:09:20] <topranks>	 claime: I had intended to reset wikikube-ctrl1003 to follow up on some debugging I was doing with kamila... seems I typed the url wrong 
[11:09:22] <effie>	 _joe_: tx 
[11:09:28] <claime>	 topranks: happens
[11:09:34] <topranks>	 ugh shouldn't though 
[11:09:34] <claime>	 at least it's got the new kernel now
[11:09:36] <claime>	 :p
[11:09:54] <topranks>	 ha ok see protecting you guys from hackers :P
[11:10:34] <topranks>	 claime: with any luck 1002 was able to keep the lights on?
[11:10:51] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:10:55] <claime>	 topranks: yeah and there was the old kubemaster aswell
[11:11:12] <topranks>	 ok ok, sorry folks I'll make sure to do better 
[11:12:55] <wikibugs>	 (03PS1) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999
[11:13:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE))
[11:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:13:41] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:14:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet
[11:14:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff)
[11:15:56] <claime>	 kamila_: Jun 13 11:13:53 puppetmaster1001 confd[7831]: 2024-06-13T11:13:53Z puppetmaster1001 /usr/bin/confd[7831]: ERROR "failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/codfw/.kubemaster793668086' with 1 (0.043032169342041016s) [invalid]: { 'host': 'wikikube-ctrl2003.codfw.wmnet', 'weight':10, 'enabled': True } [Errno -2] Name or service not known\n\nupdating error
[11:15:58] <claime>	 mtime on /var/run/confd-template/_srv_config-master_pybal_codfw_kubemaster.err\n"
[11:16:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T352010)', diff saved to https://phabricator.wikimedia.org/P64826 and previous config saved to /var/cache/conftool/dbconfig/20240613-111633-ladsgroup.json
[11:16:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[11:16:37] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[11:16:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367261)', diff saved to https://phabricator.wikimedia.org/P64827 and previous config saved to /var/cache/conftool/dbconfig/20240613-111642-marostegui.json
[11:16:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:16:48] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[11:16:49] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[11:16:50] <moritzm>	 !log installing pillow security updates
[11:16:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64828 and previous config saved to /var/cache/conftool/dbconfig/20240613-111655-ladsgroup.json
[11:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[11:17:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64829 and previous config saved to /var/cache/conftool/dbconfig/20240613-111706-marostegui.json
[11:18:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:41] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:19:10] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol)
[11:19:13] <icinga-wm_>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[11:19:17] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[11:19:46] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl2003.codfw.wmnet
[11:20:15] <wikibugs>	 (03CR) 10Volans: "question/thought inline" [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack)
[11:20:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet
[11:20:51] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:21:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64830 and previous config saved to /var/cache/conftool/dbconfig/20240613-112122-marostegui.json
[11:22:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[11:22:42] <claime>	 kamila_: topranks: I set wikikube-ctrl2003.codfw.wmnet to invalid because it doesn't resolve anymore and that breaks confd
[11:23:41] <jinxer-wm>	 RESOLVED: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kubemaster.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:23:52] <topranks>	 claime: seems sensible, that machine shows as status "decommissioning" in netbox so it makes sense the name is not in DNS 
[11:23:59] <wikibugs>	 (03PS1) 10Ladsgroup: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006
[11:24:52] <claime>	 topranks: yep, but it must have references in puppet, given it's being wrestled into submission by you and k.amila_
[11:25:12] <topranks>	 yeah, possibly those references should have been removed 
[11:25:23] <Amir1>	 jouncebot: nowandnext
[11:25:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 34 minute(s)
[11:25:23] <jouncebot>	 In 0 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200)
[11:25:31] <wikibugs>	 (03PS2) 10Ladsgroup: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006
[11:25:36] <topranks>	 but also likely a brief interruption would have been fine, and what kamilla expected, but we had *problems* 
[11:25:40] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 (owner: 10Ladsgroup)
[11:26:02] <claime>	 Amir1: effie is rebooting k8s nodes, it may impact the k8s pull, and potentially the redeployment of mw-on-k8s
[11:26:14] <claime>	 Possible it won't given it's a small-ish batch
[11:26:16] <topranks>	 kamila_: let me know if I can help with wikikube-ctrl2003, right now in Netbox it looks a little non-standard 
[11:26:18] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily bump circuit breaking threshold to 350 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043006 (owner: 10Ladsgroup)
[11:26:40] <topranks>	 as in it has no IP addresses assigned, but does have it's switch interface connected 
[11:26:59] <topranks>	 I can tidy that up if needed once you're back and we know what next steps are 
[11:27:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet
[11:27:39] <Amir1>	 claime: noted
[11:27:51] <Amir1>	 how long it's going to take?
[11:27:57] <effie>	 Amir1: I will ping you 
[11:28:03] <Amir1>	 thanks!
[11:28:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version
[11:28:34] <wikibugs>	 (03CR) 10JMeybohm: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman)
[11:29:28] <effie>	 Amir1: I generally wanted to make it before the next window
[11:29:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[11:30:05] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] "Good catch, didn't notice those!" [puppet] - 10https://gerrit.wikimedia.org/r/1042999 (owner: 10Muehlenhoff)
[11:30:19] <Amir1>	 I don't think people will deploy things in the next window, I can take over there
[11:31:35] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "This looks like it could be working now ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:32:24] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998 (owner: 10Muehlenhoff)
[11:33:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:35:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version
[11:36:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P64831 and previous config saved to /var/cache/conftool/dbconfig/20240613-113630-marostegui.json
[11:36:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:38:31] <godog>	 checking
[11:39:59] <wikibugs>	 (03PS1) 10Muehlenhoff: ircecho: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615)
[11:40:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] mailman: Remove ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/1042998 (owner: 10Muehlenhoff)
[11:40:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff)
[11:41:31] <wikibugs>	 (03PS2) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1042999
[11:48:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:49:41] <kamila_>	 claime, topranks: thank you for the help with wikikube-ctrl2003, it's scheduled to be juggled by dc-ops and they suggested that I decom it on my schedule because timezones, I suppose it's not that simple '^^
[11:49:46] <jinxer-wm>	 FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[11:51:09] <topranks>	 kamila_: dc-ops are moving it?
[11:51:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399 (10MoritzMuehlenhoff) 03NEW
[11:51:26] <kamila_>	 topranks: yes, it needs to go into a 10G rack
[11:51:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P64832 and previous config saved to /var/cache/conftool/dbconfig/20240613-115137-marostegui.json
[11:53:03] <kamila_>	 Just decom without changing anything else seemed to work when we were doing it quickly, but async apparently gets in the way, I'm sorry 
[11:53:33] <topranks>	 kamila_: but it's in a 10G rack.... hmm maybe they already moved it?
[11:54:22] <kamila_>	 Well in that case someone is confused, most likely me :-D
[11:54:39] <topranks>	 kamila_: ah my bad, it is indeed connected to a 10/25G switch, but all the port blocks on it are set to 1G so probably it does need to move
[11:54:40] <topranks>	 ignore me 
[11:55:14] <kamila_>	 I'll have a task number in a sec, omw home from doctor 
[11:56:00] <topranks>	 no worries, you / dc-ops are right I think it needs to move :( 
[11:56:36] <topranks>	 I've just become aware of a headache I'd not fully considered before, will spare you the details but really sucks we gotta move this will be many more the same I fear :(
[11:57:05] <fabfur>	 !log enabling puppet && repool cp4037 (T360454)
[11:57:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:10] <stashbot>	 T360454: Better Benthos performances - https://phabricator.wikimedia.org/T360454
[11:57:13] <wikibugs>	 (03PS2) 10Klausman: golang: Add version 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948
[11:57:34] <wikibugs>	 (03CR) 10Klausman: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman)
[11:57:45] <topranks>	 kamila_: it's all good lets wait till DC-ops do the move and confirm the new port, hopefully be straightforward after that 
[11:58:16] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[11:59:50] <kamila_>	 I hope so, thank you topranks <3 
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1200)
[12:04:51] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:04:57] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[12:04:59] <stashbot>	 jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[12:05:01] <wikibugs>	 (03PS1) 10Slyngshede: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026
[12:05:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff)
[12:06:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367261)', diff saved to https://phabricator.wikimedia.org/P64834 and previous config saved to /var/cache/conftool/dbconfig/20240613-120644-marostegui.json
[12:06:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:06:50] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[12:06:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:07:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[12:07:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:07:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:07:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64835 and previous config saved to /var/cache/conftool/dbconfig/20240613-120711-marostegui.json
[12:07:27] <effie>	 Amir1: done
[12:07:52] <Amir1>	 awesome
[12:07:54] <Amir1>	 thanks
[12:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:36] <wikibugs>	 (03PS2) 10Slyngshede: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026
[12:09:22] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]]
[12:11:04] <wikibugs>	 (03CR) 10Krinkle: noc: fail with a 404 when the selected wiki is nonexistent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse)
[12:11:06] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] "We handle 429 with retry, there's a test: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/blob/main/common/s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[12:11:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64836 and previous config saved to /var/cache/conftool/dbconfig/20240613-121127-marostegui.json
[12:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: enable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[12:12:14] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:12:21] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[12:15:33] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:16:01] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:16:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 (owner: 10Slyngshede)
[12:17:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add setting for database engine to Docker image. [software/bitu] - 10https://gerrit.wikimedia.org/r/1043026 (owner: 10Slyngshede)
[12:17:56] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:19:49] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:20:34] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:20:44] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888356 (10WDoranWMF)
[12:21:35] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1043006|Temporarily bump circuit breaking threshold to 350]] (duration: 12m 13s)
[12:22:04] <wikibugs>	 (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/1043036 (https://phabricator.wikimedia.org/T367382)
[12:24:08] <wikibugs>	 (03PS1) 10Muehlenhoff: udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702)
[12:24:33] <wikibugs>	 (03PS2) 10Muehlenhoff: udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702)
[12:25:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ircecho: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1043018 (https://phabricator.wikimedia.org/T333615) (owner: 10Muehlenhoff)
[12:26:31] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage and move to OVS
[12:26:33] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage and move to OVS
[12:26:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64837 and previous config saved to /var/cache/conftool/dbconfig/20240613-122634-marostegui.json
[12:28:13] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 10netops, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9888375 (10MatthewVernon) Just to note that per [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&va...
[12:28:29] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudvirt1032: Move to single NIC setup and OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457)
[12:28:54] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/1043036 (https://phabricator.wikimedia.org/T367382) (owner: 10Jelto)
[12:29:13] <icinga-wm_>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 4804 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[12:29:45] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2921/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah)
[12:30:27] <wikibugs>	 (03PS1) 10Reedy: CommonSettings: Mark REL1_42 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850)
[12:30:54] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm
[12:30:56] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[12:33:53] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudvirt1032: Move to single NIC setup and OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043042 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah)
[12:38:24] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888452 (10elukey) IIUC we are missing DHCP's option 12 from the BMC's client. On DELL's we expect something like:...
[12:39:18] <elukey>	 !log reset BIOS/BMC to factory default on sretest1001 - T365372
[12:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:22] <stashbot>	 T365372: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372
[12:39:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub-next: add missing network policy to the mce-consumer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042995 (owner: 10Brouberol)
[12:40:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[12:41:42] <wikibugs>	 (03PS1) 10Hashar: Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050
[12:41:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P64838 and previous config saved to /var/cache/conftool/dbconfig/20240613-124141-marostegui.json
[12:44:47] <wikibugs>	 (03CR) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack)
[12:48:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:43] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage
[12:50:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408 (10cmooney) 03NEW p:05Triage→03Low
[12:51:32] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage
[12:52:46] <logmsgbot>	 !log jmm@cumin1002 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[12:55:42] <wikibugs>	 (03PS1) 10Majavah: prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057
[12:56:18] <wikibugs>	 (03CR) 10DCausse: "aren't 429 handled as part of the normal retry mechanism? meaning that events might enter the error queue because of throttling if the num" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[12:56:31] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9888510 (10elukey) I can confirm that the sretest1001's BMC sends this:  ` DHCP-Message (53), length 1: Discover Hos...
[12:56:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367261)', diff saved to https://phabricator.wikimedia.org/P64839 and previous config saved to /var/cache/conftool/dbconfig/20240613-125648-marostegui.json
[12:56:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[12:56:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2922/console" [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah)
[12:56:53] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[12:56:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance
[12:57:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64840 and previous config saved to /var/cache/conftool/dbconfig/20240613-125700-marostegui.json
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1300).
[13:00:04] <jouncebot>	 Nemoralis, Superpes, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:01:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64841 and previous config saved to /var/cache/conftool/dbconfig/20240613-130117-marostegui.json
[13:01:19] <Lucas_WMDE>	 I’m in a meeting but can deploy later
[13:01:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9888532 (10cmooney) 05Open→03Resolved
[13:03:59] <logmsgbot>	 !log jmm@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[13:04:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah)
[13:04:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] prometheus: nic_saturation_exporter: Depend on node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah)
[13:06:47] <moritzm>	 !log installing pillow security updates
[13:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:13] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:07:26] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:08:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:08:33] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:08:33] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:08:33] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:16] <wikibugs>	 (03PS1) 10Majavah: openstack: nova: Ensure libvirt is running when declaring secrets [puppet] - 10https://gerrit.wikimedia.org/r/1043058
[13:09:33] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:10:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P64842 and previous config saved to /var/cache/conftool/dbconfig/20240613-131006-ladsgroup.json
[13:10:24] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2923/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043058 (owner: 10Majavah)
[13:10:39] <wikibugs>	 (03PS8) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[13:11:32] <wikibugs>	 (03PS1) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310)
[13:12:56] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888607 (10elukey)
[13:13:43] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:13:55] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:13:57] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:14:03] <wikibugs>	 (03PS9) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[13:14:44] <Superpes>	 HI Lucas_WMDE From what time are you available?
[13:16:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64843 and previous config saved to /var/cache/conftool/dbconfig/20240613-131625-marostegui.json
[13:16:50] <Lucas_WMDE>	 o/
[13:16:52] <Lucas_WMDE>	 now :)
[13:16:55] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-deploy repository - https://phabricator.wikimedia.org/T367410#9888639 (10elukey) Created https://gitlab.wikimedia.org/repos/sre/python-deploy  @Volans we can change the name if you want, otherwise please push the first version of the c...
[13:17:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[13:17:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[13:17:43] <Lucas_WMDE>	 no Nemoralis yet afaict
[13:17:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64844 and previous config saved to /var/cache/conftool/dbconfig/20240613-131746-ladsgroup.json
[13:17:50] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:18:39] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bookworm
[13:19:15] <Lucas_WMDE>	 Superpes: I’m confused by the changed fawikibooks comments in logos.php, any idea what happened there?
[13:19:25] <Lucas_WMDE>	 did the script change and the file wasn’t regenerated in the meantime, or something?
[13:19:52] <Lucas_WMDE>	 o_O also the diffConfig reports a difference to cawiki.json
[13:21:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888659 (10Papaul) @Clement_Goubert thank you.
[13:21:43] <wikibugs>	 (03PS1) 10MVernon: install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621)
[13:21:57] <wikibugs>	 (03CR) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse)
[13:22:00] <wikibugs>	 (03PS5) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587
[13:22:40] <wikibugs>	 (03PS2) 10Superpes15: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247)
[13:22:59] <Superpes>	 Lucas_WMDE Maybe it's a fix? I was confused too, but tried with another project, and the same change happened...
[13:23:03] <wikibugs>	 (03PS2) 10MVernon: install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621)
[13:23:18] <Lucas_WMDE>	 I rebased the change, curious what CI will say now
[13:23:30] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043043 (https://phabricator.wikimedia.org/T359850) (owner: 10Reedy)
[13:24:51] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120)
[13:25:05] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[13:25:09] <Lucas_WMDE>	 Superpes: AFAICT the cawiki change might be correct, https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-ca.svg indeed has width="120" and height="14"
[13:25:12] <Superpes>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1035852/1..2 This patch (related to the fawikibooks issue) was likely created without using tox
[13:25:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64845 and previous config saved to /var/cache/conftool/dbconfig/20240613-132512-ladsgroup.json
[13:25:15] <Lucas_WMDE>	 still completely baffling where it comes from though
[13:26:12] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks taavi!" [puppet] - 10https://gerrit.wikimedia.org/r/1043057 (owner: 10Majavah)
[13:26:15] <Lucas_WMDE>	 hang on
[13:26:27] <Lucas_WMDE>	 oh, wait. that change is actually in logos.php
[13:26:29] <moritzm>	 !log installing pillow security updates
[13:26:30] <Lucas_WMDE>	 I just didn’t notice it before
[13:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:39] <Lucas_WMDE>	 okay that explains the diffConfig at least
[13:27:07] <Superpes>	 Lucas_WMDE Afaik, if you don't use tox, you'll get a -1... but don't know why the checks were fine in the fawikibooks patch :/
[13:27:17] <wikibugs>	 (03PS1) 10Btullis: Switch the role for an-redacteddb1001 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453)
[13:28:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[13:28:21] <claime>	 jouncebot: nowandnext
[13:28:21] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1300)
[13:28:21] <jouncebot>	 In 1 hour(s) and 31 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500)
[13:28:32] <claime>	 will wait :)
[13:28:37] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2924/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[13:28:51] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[13:30:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "The additional changes are confusing, but as far as I can tell, harmless (fawiktionary comments) or correct (cawiki’s logo is indeed 120x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15)
[13:30:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15)
[13:30:37] <Lucas_WMDE>	 let’s try it
[13:30:41] <Superpes>	 About cawiki, yep, it seems correct.. but I didn't run the script for cawiki! So it's still weird, but maybe tox fixes all the issues when run, do let's say everything is fine :D
[13:30:51] <volans>	 !log upgrading spicerack on cumin2002 to v8.6.0
[13:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:59] * Lucas_WMDE is very reluctant to touch / run tox ^^
[13:31:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[13:31:11] <Superpes>	 I'll check cawiki too on WMDebug just to be sure
[13:31:19] <Lucas_WMDE>	 yeah, I was gonna do that too, thanks
[13:31:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P64846 and previous config saved to /var/cache/conftool/dbconfig/20240613-133132-marostegui.json
[13:31:43] <wikibugs>	 (03Merged) 10jenkins-bot: [svwikt] Add a temporary logo for the 100.000 pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042997 (https://phabricator.wikimedia.org/T364247) (owner: 10Superpes15)
[13:31:53] <Lucas_WMDE>	 FWIW, the tagline at https://ca.wikipedia.org/wiki/Portada doesn’t look especially “stretched” to me at the moment
[13:32:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]]
[13:32:16] <Lucas_WMDE>	 but then again, 112/13 and 120/14 is almost the same aspect ratio
[13:32:17] <stashbot>	 T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247
[13:32:23] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Switch the role for an-redacteddb1001 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1043063 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis)
[13:32:25] <Lucas_WMDE>	 (a bit over eight and a half)
[13:32:52] <Lucas_WMDE>	 I guess it will become a smidgen bigger
[13:33:05] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:33:44] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:34:02] <Superpes>	 Yep indeed but maybe 112/13 was manually added and tox doesn't like it lmao :D
[13:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:34:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:35:12] <Lucas_WMDE>	 Superpes: please test :)
[13:35:41] <Lucas_WMDE>	 yeah the tagline grows a tiny bit
[13:35:51] <Lucas_WMDE>	 looks fine to me tbh
[13:35:53] <Superpes>	 Yep and looks fine
[13:36:02] * Lucas_WMDE peeks at svwiktionary
[13:36:04] <Superpes>	 Yep amd on svwikt too :)
[13:36:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 superpes, lucaswerkmeister-wmde: Continuing with sync
[13:36:27] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050 (owner: 10Hashar)
[13:36:33] <Superpes>	 I don't like the gold color of the logo tbh :D
[13:36:49] <Superpes>	 But it's not my choice lmao
[13:38:04] <Lucas_WMDE>	 wiki sovereignty \oi
[13:38:05] <Lucas_WMDE>	 * \o/
[13:38:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[13:38:45] <Superpes>	 Lol 
[13:39:13] <Superpes>	 I also have a problem with another patch on a wordmark, tox sets it to 2x1px resolution, which is absurd
[13:39:33] <Superpes>	 I tried to fix the svg but the situation didn't change
[13:39:56] <Lucas_WMDE>	 huh
[13:40:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64847 and previous config saved to /var/cache/conftool/dbconfig/20240613-134017-ladsgroup.json
[13:40:56] <Superpes>	 Furthermore, I also had to fix these svwiktionary logos because they didn't meet the resolution standards! Unfortunately the guidelines are not read, and a lot of people upload logos and wordmarks thinking they're fine like this :D
[13:40:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9888707 (10Jhancock.wm) rails, power, and network cables prepped for mw2282 move.
[13:41:15] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9888709 (10hnowlan) >>! In T361835#9712223, @SGupta-WMF wrote: > @WDoranWMF Ye...
[13:41:38] <wikibugs>	 (03PS1) 10Hashar: Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762)
[13:42:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades? - https://phabricator.wikimedia.org/T367408#9888712 (10cmooney) We could use these cables but the host side but we might not have enough slack to connect to servers at dif...
[13:44:15] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9888715 (10Jhancock.wm) @Marostegui thank you for the reminder. I will be getting this racked on Friday most likely. also thank you for updating puppet files!
[13:44:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[13:44:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[13:44:52] <wikibugs>	 (03Merged) 10jenkins-bot: Merge commit 'stable-3.9@553ea468a1' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043050 (owner: 10Hashar)
[13:44:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T352010)', diff saved to https://phabricator.wikimedia.org/P64848 and previous config saved to /var/cache/conftool/dbconfig/20240613-134456-ladsgroup.json
[13:45:02] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:45:08] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar)
[13:45:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042997|[svwikt] Add a temporary logo for the 100.000 pages (T364247)]] (duration: 13m 24s)
[13:45:41] <stashbot>	 T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247
[13:46:12] <Lucas_WMDE>	 Superpes: should be done :)
[13:46:24] <Lucas_WMDE>	 still no sign of Nemoralis afaict
[13:46:37] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153)
[13:46:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367261)', diff saved to https://phabricator.wikimedia.org/P64849 and previous config saved to /var/cache/conftool/dbconfig/20240613-134639-marostegui.json
[13:46:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[13:46:44] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[13:46:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:40:00 on lsw1-f6-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f6-eqiad
[13:46:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance
[13:47:02] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:40:00 on lsw1-f6-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f6-eqiad
[13:47:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64850 and previous config saved to /var/cache/conftool/dbconfig/20240613-134701-marostegui.json
[13:47:07] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[13:47:45] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9888730 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94b81d4d-316b-4c68-b4a9-a2d07057d180) set by cmooney...
[13:48:33] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[13:48:53] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9888734 (10Eevans) >>! In T362033#9885505, @VRiley-WMF wrote: > It certainly does! I will plan for this tomorrow and start prepping a motherboard for this unit. Thanks!  Standing by; Let me know!
[13:49:03] <Lucas_WMDE>	 jouncebot: next
[13:49:03] <jouncebot>	 In 1 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500)
[13:49:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Note: this is okay because all Test Wikidata clients have reached wmf.9; the wmf.8 backport had to be aborted, so if the train has to be r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE))
[13:50:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE))
[13:50:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64851 and previous config saved to /var/cache/conftool/dbconfig/20240613-135010-marostegui.json
[13:50:48] <wikibugs>	 (03Merged) 10jenkins-bot: Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE))
[13:51:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]]
[13:51:25] <stashbot>	 T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153
[13:53:01] <wikibugs>	 (03Merged) 10jenkins-bot: Merge commit 'stable-3.9@7380128525' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043066 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar)
[13:53:55] <Superpes>	 Lucas_WMDE I can check Nemoralis patch :)
[13:53:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:54:36] <Lucas_WMDE>	 testing my own patch at the moment
[13:54:50] <wikibugs>	 (03PS2) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310)
[13:55:16] <claime>	 !log roll-restarting shellbox-constraints
[13:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:22] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: sync
[13:55:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64852 and previous config saved to /var/cache/conftool/dbconfig/20240613-135523-ladsgroup.json
[13:55:28] <Lucas_WMDE>	 looks good so far…
[13:55:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: sync
[13:55:56] <wikibugs>	 (03PS3) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310)
[13:56:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync
[13:57:35] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede)
[13:58:03] <wikibugs>	 (03PS4) 10JMeybohm: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310)
[13:58:25] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:58:37] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:58:37] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:59:14] <wikibugs>	 (03PS1) 10Majavah: hieradata: Move cloudvirt1033 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043071 (https://phabricator.wikimedia.org/T364457)
[13:59:33] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:59:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: sync
[13:59:51] <Lucas_WMDE>	 Superpes: I don’t think we’ll have time for that anyway, sorry
[13:59:57] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1033.eqiad.wmnet with reason: reimage and move to OVS
[14:00:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: sync
[14:00:09] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1033.eqiad.wmnet with reason: reimage and move to OVS
[14:00:10] <Superpes>	 Oh yep no problem :) Thanks for your assistance btw :P
[14:00:15] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[14:01:12] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit: Use LOG_LEVEL warn by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043059 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[14:03:07] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm
[14:03:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Switch backend calls to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1043062 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[14:03:43] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:03:53] <wikibugs>	 (03PS3) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490
[14:03:55] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:03:57] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:04:12] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1033 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043071 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah)
[14:04:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack)
[14:05:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64853 and previous config saved to /var/cache/conftool/dbconfig/20240613-140517-marostegui.json
[14:05:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042208|Load EntitySchema on Test Wikidata clients (T363153)]] (duration: 14m 14s)
[14:05:39] <stashbot>	 T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153
[14:05:44] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:05:46] <Lucas_WMDE>	 ping claime :)
[14:05:46] <wikibugs>	 (03PS4) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490
[14:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:57] <claime>	 Thanks Lucas_WMDE :)
[14:06:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] grafana: change performance testing graphite endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1042223 (https://phabricator.wikimedia.org/T367064) (owner: 10Filippo Giunchedi)
[14:06:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] udpmxircecho: One more Python 2 -> Python 3 fix [puppet] - 10https://gerrit.wikimedia.org/r/1043038 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[14:06:46] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto)
[14:06:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563)
[14:06:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563)
[14:06:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563)
[14:08:49] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888788 (10dcaro) Added two pannels to the health...
[14:08:53] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888795 (10dcaro) Added the discards also to the c...
[14:08:57] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 10Sustainability (Incident Followup): [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9888796 (10dcaro) 05Open→03Resolved
[14:12:38] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] "Yes, they would be retried like any other failed request. We could make an exception here and let the HTTP client retry in case of 429. Th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[14:15:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:15:25] <logmsgbot>	 !log cgoubert@deploy1002 Started scap: Change mwapi listener to mw-api-int - T333120
[14:15:30] <stashbot>	 T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120
[14:16:16] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: update exclude filter [puppet] - 10https://gerrit.wikimedia.org/r/1043082
[14:16:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply
[14:16:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[14:16:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:18:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64854 and previous config saved to /var/cache/conftool/dbconfig/20240613-141810-ladsgroup.json
[14:18:16] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[14:18:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:18:51] <wikibugs>	 (03CR) 10BBlack: [C:03+2] geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack)
[14:19:04] <wikibugs>	 (03PS5) 10BBlack: geo-maps: Add more FB ranges, differentiate eqiad [dns] - 10https://gerrit.wikimedia.org/r/1042490
[14:19:46] <wikibugs>	 (03Abandoned) 10BBlack: geodns: eqiad non-primary for all public users [dns] - 10https://gerrit.wikimedia.org/r/545385 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack)
[14:20:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P64855 and previous config saved to /var/cache/conftool/dbconfig/20240613-142024-marostegui.json
[14:20:45] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn)
[14:21:00] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn)
[14:21:09] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[14:21:24] <logmsgbot>	 !log cgoubert@deploy1002 Finished scap: Change mwapi listener to mw-api-int - T333120 (duration: 06m 47s)
[14:21:28] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9888856 (10Clement_Goubert)
[14:21:31] <stashbot>	 T333120: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120
[14:21:32] <wikibugs>	 (03PS1) 10Hashar: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762)
[14:23:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:23:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563)
[14:24:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563)
[14:24:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563)
[14:24:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563)
[14:24:09] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[14:24:56] <claime>	 hmm looking at the memcached issue
[14:24:56] <wikibugs>	 (03PS1) 10Jelto: sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341)
[14:25:13] <wikibugs>	 (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1042490 (owner: 10BBlack)
[14:27:04] <wikibugs>	 (03PS6) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342)
[14:27:15] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar)
[14:27:19] <bblack>	 !log authdns-update for https://gerrit.wikimedia.org/r/1042490 (remaps some Facebook ranges to codfw+eqiad)
[14:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:49] <cdanis>	 bblack: neat
[14:27:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update to a snapshot of Gerrit 3.9.6 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1043084 (https://phabricator.wikimedia.org/T358762) (owner: 10Hashar)
[14:28:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[14:28:35] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:28:55] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[14:29:18] <wikibugs>	 (03CR) 10Elukey: Allow to only report images of supported Debian versions (033 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm)
[14:30:23] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[14:30:44] <wikibugs>	 (03CR) 10Elukey: "I have zero context on this, it is difficult to review from the commit msg. Janis could you expand it a little to add more details?" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm)
[14:32:11] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit2002 # T358762
[14:32:15] <stashbot>	 T358762: Gerrit commit message formatting does not handle angle-bracketed URLs well, adds extra semicolon - https://phabricator.wikimedia.org/T358762
[14:32:18] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit2002 # T358762 (duration: 00m 07s)
[14:33:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64856 and previous config saved to /var/cache/conftool/dbconfig/20240613-143318-ladsgroup.json
[14:33:56] <wikibugs>	 (03PS1) 10Clément Goubert: shellbox-constraints: bump to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043087
[14:34:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:35:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367261)', diff saved to https://phabricator.wikimedia.org/P64857 and previous config saved to /var/cache/conftool/dbconfig/20240613-143531-marostegui.json
[14:35:33] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Create the python-release repository - https://phabricator.wikimedia.org/T367410#9888942 (10elukey)
[14:35:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance
[14:35:36] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[14:35:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance
[14:35:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64858 and previous config saved to /var/cache/conftool/dbconfig/20240613-143554-marostegui.json
[14:37:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: update exclude filter [puppet] - 10https://gerrit.wikimedia.org/r/1043082 (owner: 10Elukey)
[14:38:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64859 and previous config saved to /var/cache/conftool/dbconfig/20240613-143859-marostegui.json
[14:40:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563)
[14:40:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563)
[14:40:47] <hashar>	 I am doing a quick upgrade of Gerrit again
[14:40:58] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit1003 # T358762
[14:41:03] <stashbot>	 T358762: Gerrit commit message formatting does not handle angle-bracketed URLs well, adds extra semicolon - https://phabricator.wikimedia.org/T358762
[14:41:03] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@89042ad]: Gerrit to snapshot version 3.9.5-22-g7380128525 on gerrit1003 # T358762 (duration: 00m 05s)
[14:41:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:41:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:43:57] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: bump to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043087 (owner: 10Clément Goubert)
[14:44:06] <hashar>	 gerrit upgraded
[14:44:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[14:44:25] <wikibugs>	 (03CR) 10Hashar: "recheck due to Gerrit restart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:44:29] <wikibugs>	 (03CR) 10Hashar: "recheck due to Gerrit restart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:44:29] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[14:44:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:44:34] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[14:44:39] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[14:45:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003
[14:46:06] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[14:46:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:46:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetserver::git::private: Use wrapper from puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1037778 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:47:11] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003
[14:48:11] <wikibugs>	 (03CR) 10CDanis: [C:03+1] eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:48:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64860 and previous config saved to /var/cache/conftool/dbconfig/20240613-144825-ladsgroup.json
[14:48:36] <wikibugs>	 (03CR) 10CDanis: [C:03+1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:13] <wikibugs>	 (03CR) 10CDanis: [C:03+1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:15] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm
[14:49:19] <moritzm>	 !log rebalance ganeti/B in eqiad following reboots
[14:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:24] <wikibugs>	 (03CR) 10CDanis: [C:03+1] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:30] <wikibugs>	 (03CR) 10CDanis: [C:03+1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: enable mesh tracing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:49:50] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[14:49:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:50:02] <wikibugs>	 (03PS1) 10Brouberol: datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603)
[14:50:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:50:34] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis)
[14:50:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 depool ahead of T365983', diff saved to https://phabricator.wikimedia.org/P64861 and previous config saved to /var/cache/conftool/dbconfig/20240613-145035-arnaudb.json
[14:50:40] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[14:50:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[14:50:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:50:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1039.eqiad.wmnet with reason: T365983
[14:51:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:51:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563)
[14:51:09] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1039.eqiad.wmnet with reason: T365983
[14:52:43] <wikibugs>	 (03PS1) 10Dbrant: Look for iPadOS in user-agent, in addition to iOS. [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723)
[14:53:23] <wikibugs>	 (03PS2) 10Brouberol: datahub-gms: enable prometheus scraping of metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603)
[14:53:37] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) (owner: 10CDanis)
[14:53:58] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:54:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64862 and previous config saved to /var/cache/conftool/dbconfig/20240613-145406-marostegui.json
[14:55:49] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:55:55] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:57:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003
[14:57:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003
[14:57:27] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:57:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:57:58] <wikibugs>	 (03PS2) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673)
[14:58:46] <wikibugs>	 (03CR) 10Brouberol: "As it turns out, the mce/mae-consumer pods already expose JMX metrics." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043106 (https://phabricator.wikimedia.org/T366603) (owner: 10Brouberol)
[14:59:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:59:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003
[14:59:36] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:59:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003
[15:00:04] <jouncebot>	 brennen and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1500).
[15:00:55] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:01:01] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:01:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup puppetmaster preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1043114
[15:01:35] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:01:57] <wikibugs>	 (03PS11) 10EoghanGaffney: lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174
[15:03:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-f6-eqiad,lsw1-f6-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f6-eqiad
[15:03:30] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-f6-eqiad,lsw1-f6-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f6-eqiad
[15:03:32] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2925/co" [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney)
[15:03:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64863 and previous config saved to /var/cache/conftool/dbconfig/20240613-150332-ladsgroup.json
[15:03:36] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889146 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=891c00a3-b649-4659-b39f-5ad6b01367a9) set by cmooney...
[15:03:37] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:04:16] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:35:00 on an-worker[1169-1171].eqiad.wmnet,es1039.eqiad.wmnet,ms-be1080.eqiad.wmnet with reason: JunOS upgrade lsw1-f6-eqiad
[15:04:33] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on an-worker[1169-1171].eqiad.wmnet,es1039.eqiad.wmnet,ms-be1080.eqiad.wmnet with reason: JunOS upgrade lsw1-f6-eqiad
[15:04:46] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889149 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5a6a58c5-4681-4aea-8e80-e8ba2c613022) set by cmooney...
[15:04:47] <topranks>	 !log rebooting lsw1-f6-codfw to upgrade JunOS on switch T365983
[15:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:51] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:05:58] <volans>	 !log upgrading spicerack on cumin1002 to v8.6.0
[15:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9889158 (10elukey) Hi @Jhancock.wm! I was able to tcpdump the DHCP traffic sent from the host's BMC to `install2004`, and sadly it doesn't set any valid Hostname. This i...
[15:07:25] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:07:37] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:07:43] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:07:48] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:07:57] <wikibugs>	 (03PS1) 10MVernon: apus: setup for codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621)
[15:07:58] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:08:07] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:08:52] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:09:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P64864 and previous config saved to /var/cache/conftool/dbconfig/20240613-150913-marostegui.json
[15:10:06] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] apus: setup for codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:11:04] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:15:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[15:15:36] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[15:16:26] <wikibugs>	 (03CR) 10MVernon: [C:03+2] install_server: new partitioning scheme for cephadm nodes [puppet] - 10https://gerrit.wikimedia.org/r/1043061 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:17:50] <wikibugs>	 (03PS1) 10JHathaway: postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406)
[15:18:23] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Cleanup puppetmaster preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1043114 (owner: 10Muehlenhoff)
[15:18:36] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[15:19:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64865 and previous config saved to /var/cache/conftool/dbconfig/20240613-151910-marostegui.json
[15:19:15] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[15:22:06] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889279 (10cmooney) Switch has reloaded on the new version, all looks good at first glance.  ` cmooney@lsw1-f6-eqiad> show inter...
[15:22:12] <elukey>	 !log drop eventgate-ci docker images from the Docker Registry
[15:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64866 and previous config saved to /var/cache/conftool/dbconfig/20240613-152300-arnaudb.json
[15:23:04] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:24:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367261)', diff saved to https://phabricator.wikimedia.org/P64867 and previous config saved to /var/cache/conftool/dbconfig/20240613-152420-marostegui.json
[15:24:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance
[15:24:25] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[15:24:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance
[15:25:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm
[15:26:25] <elukey>	 !log drop mediawiki-services-parsoid docker images from the Docker Registry - T367427
[15:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:29] <stashbot>	 T367427: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427
[15:27:08] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[15:27:14] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[15:27:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance
[15:27:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance
[15:27:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64868 and previous config saved to /var/cache/conftool/dbconfig/20240613-152748-marostegui.json
[15:28:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[15:28:11] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9889319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq...
[15:28:13] <Lucas_WMDE>	 !log STOPPED lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 --start '["55386869"]' 2>&1 | tee -a ~/T315510-enwiki-9; date # Ctrl+C – had slowed down, unnecessary work by this point; was at --start '["55914913"]'
[15:28:13] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341) (owner: 10Jelto)
[15:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:29] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-in role [puppet] - 10https://gerrit.wikimedia.org/r/1043124 (https://phabricator.wikimedia.org/T325406)
[15:30:34] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174 (owner: 10EoghanGaffney)
[15:30:40] <wikibugs>	 (03PS1) 10JMeybohm: ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310)
[15:30:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64869 and previous config saved to /var/cache/conftool/dbconfig/20240613-153056-marostegui.json
[15:31:02] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[15:32:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:32:27] <wikibugs>	 (03PS1) 10Jforrester: Convert local function to arrow function to fix context [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366)
[15:33:18] <wikibugs>	 (03Merged) 10jenkins-bot: ratelimit: Increase CPU limit and set GOMAXPROCS everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043125 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm)
[15:34:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64870 and previous config saved to /var/cache/conftool/dbconfig/20240613-153417-marostegui.json
[15:34:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply
[15:34:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[15:35:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[15:35:15] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9889387 (10Papaul) @kamila no problem we can move that one. Once done we will update the task.
[15:35:30] <wikibugs>	 (03PS2) 10JHathaway: postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406)
[15:36:09] <icinga-wm_>	 PROBLEM - Host registry2003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:11] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2002.codfw.wmnet with OS bookworm
[15:36:21] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm executed with errors: - moss-fe2002 (...
[15:36:57] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-in role [puppet] - 10https://gerrit.wikimedia.org/r/1043124 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[15:37:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm
[15:37:07] <icinga-wm_>	 PROBLEM - Host apt2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:37:17] <icinga-wm_>	 PROBLEM - Host cloudidm2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[15:37:19] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm
[15:37:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889405 (10MatthewVernon) Swift looks good, thanks.
[15:37:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[15:37:37] <icinga-wm_>	 PROBLEM - Host kubemaster2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:37:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply
[15:38:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64871 and previous config saved to /var/cache/conftool/dbconfig/20240613-153805-arnaudb.json
[15:38:10] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:38:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply
[15:38:21] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:38:22] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:38:31] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:38:37] <ChrisDobbins901_>	 !log cdobbins@cumin1002 sudo -i cookbook sre.cdn.roll-reboot --alias 'cp-upload_eqsin' --batchsize 1 --reason T366555 --task-id T366555 --grace-sleep 5400
[15:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:41] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:38:46] <logmsgbot>	 !log cdobbins@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin
[15:39:14] <Amir1>	 here
[15:39:33] <herron>	 !incidents
[15:39:34] <sirenbot>	 4746 (UNACKED)  [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw)
[15:39:34] <sirenbot>	 4745 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams)
[15:39:34] <sirenbot>	 4743 (RESOLVED)  [2x] ProbeDown sre (probes/custom eqiad)
[15:39:34] <sirenbot>	 4740 (RESOLVED)  [6x] ProbeDown sre (probes/service ulsfo)
[15:39:39] <jhathaway>	 here
[15:39:44] <kamila_>	 ^ we have slightly reduced kubemaster capacity in codfw (one of the new hw nodes is down)
[15:39:46] <kamila_>	 not sure if related
[15:39:56] <herron>	 !ack 4746
[15:39:57] <sirenbot>	 4746 (ACKED)  [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw)
[15:40:24] <claime>	 registry and cloudidm going down at the same time smells like ganeti
[15:40:36] <claime>	 iirc all of them are vms
[15:40:40] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889411 (10Jdforrester-WMF) Looks like this is now done except for "some straggling traffic" for the api-gateway?  {F55289507}
[15:40:41] <kamila_>	 great, we have more reduced capacity! \o/
[15:41:02] <effie>	 I hope we are not moving the wrong server accidentally 
[15:41:11] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127
[15:41:12] <hnowlan>	 don't see a spike in requests on k8s api in codfw 
[15:41:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubemaster2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:41:35] <kamila_>	 effie: that would seriously suck '^^ the move should be happening about now :D 
[15:41:42] <logmsgbot>	 !log cdobbins@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin
[15:42:00] <effie>	 is kube-ctrl up ?
[15:42:01] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage
[15:42:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798)
[15:42:49] <kamila_>	 effie: wikikube-ctrl2003 is decommed
[15:43:01] <claime>	 effie:efyes, on ctrl2002
[15:43:06] <kamila_>	 the other two should be up
[15:43:08] <claime>	 root@deploy1002:~# kubectl -n kube-system get leases.coordination.k8s.io
[15:43:10] <claime>	 NAME                                      HOLDER                                                                          AGE
[15:43:12] <claime>	 cert-manager-cainjector-leader-election   cert-manager-cainjector-79df7c6cc8-jb6rf_9db5da6a-27c6-45ff-8aec-b1c273c06c90   478d
[15:43:14] <claime>	 cert-manager-controller                   cert-manager-ff469f6b6-tt7t7-external-cert-manager-controller                   478d
[15:43:16] <claime>	 kube-controller-manager                   wikikube-ctrl2002_af95e93d-6681-462f-86da-75f450626107                          478d
[15:43:16] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2926/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney)
[15:43:18] <claime>	 kube-scheduler                            wikikube-ctrl2002_be7d4a2d-3abf-41dd-9749-fa36d599d3a4                          478d
[15:43:35] <cdanis>	 uhh
[15:43:39] <cdanis>	 💙cdanis@ganeti2020.codfw.wmnet ~ 🕦☕ sudo gnt-instance list
[15:43:41] <cdanis>	 it's hanging
[15:43:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:54] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:43:54] <cdanis>	 oh, there it goes
[15:44:14] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "lgtm! uid 46919 - https://app.betterworks.com/app/#/profile/441803" [puppet] - 10https://gerrit.wikimedia.org/r/1042331 (https://phabricator.wikimedia.org/T367053) (owner: 10Herron)
[15:44:20] <cdanis>	 I think something is funky with the ganeti master in codfw?
[15:44:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: host reimage
[15:45:35] <jhathaway>	 cdanis: is that command usually quicker?
[15:45:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:46:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P64872 and previous config saved to /var/cache/conftool/dbconfig/20240613-154603-marostegui.json
[15:46:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:46:19] <cdanis>	 jhathaway: I thought so?  but I could be wrong
[15:46:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9889457 (10Dzahn) 05Open→03In progress
[15:46:44] <jhathaway>	 cdanis: roger, not sure myself
[15:46:57] <cdanis>	 it takes <1s on eqiad
[15:47:19] <cdanis>	 --> #-sre
[15:47:26] <herron>	 definitely feels slow
[15:49:24] <herron>	 ganeti2028 seems common between vms that are down according to icinga
[15:49:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64873 and previous config saved to /var/cache/conftool/dbconfig/20240613-154924-marostegui.json
[15:49:46] <jinxer-wm>	 FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[15:50:00] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-fe2002.codfw.wmnet with OS bookworm
[15:50:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm
[15:50:18] <herron>	 and some interesting drbd messages in dmesg on ganeti2028
[15:51:20] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[15:52:23] <elukey>	 !log drop mediawiki-services-restbase docker images from the Docker Registry - T367427
[15:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:28] <stashbot>	 T367427: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427
[15:53:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64874 and previous config saved to /var/cache/conftool/dbconfig/20240613-155310-arnaudb.json
[15:53:16] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:53:45] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:54:05] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[15:54:25] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889560 (10Clement_Goubert) Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters.
[15:54:51] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427)
[15:55:40] <jinxer-wm>	 RESOLVED: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2013:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:55:59] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[15:56:35] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[15:57:06] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9889580 (10cmooney) 05Open→03Resolved Thanks for checking things, all stable on our side I will close the task now.
[15:57:55] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9889584 (10hnowlan) I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway refers to traffic to the mediawiki API as "mwapi_cluster"...
[15:58:11] <wikibugs>	 (03CR) 10MVernon: "I'm in principle happy for this to go ahead, but I'm afraid I don't know enough about the puppetserver puppet code to feel confident givin" [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:58:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:18] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: misc postfix mx profile fixes [puppet] - 10https://gerrit.wikimedia.org/r/1043123 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[16:01:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P64875 and previous config saved to /var/cache/conftool/dbconfig/20240613-160110-marostegui.json
[16:02:39] <wikibugs>	 (03CR) 10Elukey: Allow to only report images of supported Debian versions (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/966200 (https://phabricator.wikimedia.org/T348876) (owner: 10JMeybohm)
[16:04:17] <icinga-wm_>	 PROBLEM - Host ganeti2028 is DOWN: PING CRITICAL - Packet loss = 100%
[16:04:26] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:04:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T364069)', diff saved to https://phabricator.wikimedia.org/P64876 and previous config saved to /var/cache/conftool/dbconfig/20240613-160431-marostegui.json
[16:04:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance
[16:04:36] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[16:04:37] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[16:04:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance
[16:04:53] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:04:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64877 and previous config saved to /var/cache/conftool/dbconfig/20240613-160453-marostegui.json
[16:04:57] <wikibugs>	 (03CR) 10Herron: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1042331 (https://phabricator.wikimedia.org/T367053) (owner: 10Herron)
[16:05:27] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove wdqs2023 from the public cluster and enable the updaters [puppet] - 10https://gerrit.wikimedia.org/r/1042965 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[16:05:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: No IPv6 ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439 (10cmooney) 03NEW p:05Triage→03High
[16:05:46] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti2028:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:06:52] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@ee5a291]: make public data from wdqs subgraph analysis readable by others
[16:07:15] <icinga-wm_>	 RECOVERY - Host registry2003 is UP: PING WARNING - Packet loss = 66%, RTA = 0.32 ms
[16:07:15] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@ee5a291]: make public data from wdqs subgraph analysis readable by others (duration: 00m 22s)
[16:07:35] <icinga-wm_>	 PROBLEM - Docker registry health on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[16:07:37] <icinga-wm_>	 PROBLEM - Docker registry HTTPS interface certificate expiry on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[16:08:05] <icinga-wm_>	 PROBLEM - SSH on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:08:05] <icinga-wm_>	 PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[16:08:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64878 and previous config saved to /var/cache/conftool/dbconfig/20240613-160816-arnaudb.json
[16:08:20] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[16:08:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: update k8s_rules.ini exclude list [puppet] - 10https://gerrit.wikimedia.org/r/1043131 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey)
[16:08:36] <cdanis>	 !log forcibly rebooted ganeti2028, drdbd hung
[16:08:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage
[16:08:45] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:08:45] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:08:52] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:09:11] <logmsgbot>	 !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:09:16] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889670 (10VRiley-WMF) 05Open→03In progress Starting the Motherboard swap now.
[16:11:33] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage
[16:11:46] <cdanis>	 !log gnt-node failover -f ganeti2028.codfw.wmnet
[16:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:50] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[16:11:54] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[16:11:56] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9889673 (10herron) 05In progress→03Resolved a:03herron Group membership has been provisioned, thanks!
[16:12:10] <wikibugs>	 (03PS1) 10Pppery: Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177)
[16:12:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery)
[16:12:34] <wikibugs>	 (03PS1) 10Majavah: openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458)
[16:13:38] <icinga-wm_>	 PROBLEM - Host registry2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:14:08] <icinga-wm_>	 PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100%
[16:15:21] <wikibugs>	 (03CR) 10Abijeet Patro: [C:03+1] Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery)
[16:15:46] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:16:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367261)', diff saved to https://phabricator.wikimedia.org/P64880 and previous config saved to /var/cache/conftool/dbconfig/20240613-161617-marostegui.json
[16:16:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance
[16:16:22] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[16:16:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance
[16:16:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64881 and previous config saved to /var/cache/conftool/dbconfig/20240613-161641-marostegui.json
[16:17:59] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033
[16:18:03] <stashbot>	 T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033
[16:18:13] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033
[16:18:21] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889731 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7d73e7a7-7fc0-4f4e-8b18-84ce78db6c6b) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r...
[16:18:22] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[16:18:27] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[16:18:45] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:18:46] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:18:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#9889726 (10jijiki) (me too ubuntu-forum style reply)  This happened again on ganeti2028:  ` [Thu Jun 13 15:38:21 2024] INFO: task drbd_r_resource:1033579 blocked for more than 121...
[16:18:57] <brennen>	 jouncebot nowandnext
[16:18:58] <jouncebot>	 For the next 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1600)
[16:18:58] <jouncebot>	 In 0 hour(s) and 41 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700)
[16:18:58] <jouncebot>	 In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700)
[16:19:36] <brennen>	 James_F: shall i go ahead and sling out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1043126 ?
[16:19:43] <James_F>	 brennen: Sure!
[16:19:52] <James_F>	 Sorry, distracted by other things.
[16:20:02] <brennen>	 thanks for getting that in order!
[16:20:04] <icinga-wm_>	 RECOVERY - Host ganeti2028 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms
[16:20:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366) (owner: 10Jforrester)
[16:20:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64882 and previous config saved to /var/cache/conftool/dbconfig/20240613-162040-marostegui.json
[16:20:58] <icinga-wm_>	 RECOVERY - SSH on registry2003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:21:00] <icinga-wm_>	 RECOVERY - Host apt2002 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[16:21:00] <icinga-wm_>	 RECOVERY - Host registry2003 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms
[16:21:00] <icinga-wm_>	 RECOVERY - Host cloudidm2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms
[16:21:28] <icinga-wm_>	 RECOVERY - Host kubemaster2001 is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms
[16:21:32] <icinga-wm_>	 RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Docker
[16:21:36] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:21:38] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:21:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:22:58] <icinga-wm_>	 RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Docker
[16:23:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P64883 and previous config saved to /var/cache/conftool/dbconfig/20240613-162321-arnaudb.json
[16:23:22] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:26] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[16:23:32] <icinga-wm_>	 RECOVERY - Docker registry HTTPS interface certificate expiry on registry2003 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Fri 28 Jun 2024 08:55:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker
[16:23:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:23:41] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service kubemaster2001:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:45] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:23:50] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:24:10] <mutante>	 !log gitlab-replica.wikimedia.org - short downtime - renaming to gitlab-replica-a
[16:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn)
[16:25:17] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442 (10RKemper) 03NEW
[16:25:19] <wikibugs>	 (03PS2) 10Dzahn: rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344
[16:25:34] <wikibugs>	 (03CR) 10Dzahn: "not a netbox change - these are just marked as manually managed there" [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn)
[16:25:42] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9889792 (10RKemper)
[16:25:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah)
[16:26:33] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubemaster2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubemaster2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:27:49] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:28:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm
[16:29:09] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9889808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm completed: - moss-fe2002 (**PASS**)...
[16:29:46] <jinxer-wm>	 FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[16:30:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9889810 (10cmooney)
[16:30:23] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:30:46] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job docker-registry in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:31:51] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9889825 (10elukey) @KartikMistry @santhosh Hi! Getting back to this task since it is getting attention from other pe...
[16:31:57] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: nova-fullstack: Use g4 flavor [puppet] - 10https://gerrit.wikimedia.org/r/1043142 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah)
[16:34:26] <wikibugs>	 (03PS33) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[16:35:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1042344 (owner: 10Dzahn)
[16:35:28] <wikibugs>	 (03PS1) 10JMeybohm: Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155
[16:35:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64884 and previous config saved to /var/cache/conftool/dbconfig/20240613-163547-marostegui.json
[16:37:02] <wikibugs>	 (03CR) 10JMeybohm: golang: Add version 1.22 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042948 (owner: 10Klausman)
[16:37:33] <wikibugs>	 (03PS2) 10Dzahn: gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767
[16:39:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:39:43] <wikibugs>	 (03PS3) 10Dzahn: gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767
[16:40:17] <wikibugs>	 (03Merged) 10jenkins-bot: Convert local function to arrow function to fix context [extensions/Echo] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043126 (https://phabricator.wikimedia.org/T367366) (owner: 10Jforrester)
[16:40:52] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]]
[16:40:56] <stashbot>	 T367366: Failed to fetch notifications: Notifications fail to load - https://phabricator.wikimedia.org/T367366
[16:41:09] <wikibugs>	 (03PS34) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[16:41:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002"
[16:42:16] <icinga-wm_>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 36 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:43:26] <wikibugs>	 (03CR) 10Tacsipacsi: "Thanks for backporting this!" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery)
[16:43:29] <logmsgbot>	 !log brennen@deploy1002 jforrester, brennen: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:43:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS info - pt1979@cumin2002"
[16:43:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:45:31] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:46:17] <wikibugs>	 (03PS3) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162)
[16:46:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gitlab: rename gitlab-replica to gitlab-replica-a [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn)
[16:47:16] <icinga-wm_>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 10 probes of 789 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:48:26] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9889928 (10VRiley-WMF) 05In progress→03Open Motherboard has been swapped, returning ticket into open status.
[16:48:46] <logmsgbot>	 !log brennen@deploy1002 jforrester, brennen: Continuing with sync
[16:49:05] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458)
[16:49:19] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:50:14] <wikibugs>	 (03CR) 10Majavah: [C:03+1] nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458) (owner: 10Andrew Bogott)
[16:50:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P64885 and previous config saved to /var/cache/conftool/dbconfig/20240613-165055-marostegui.json
[16:51:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED
[16:52:01] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:52:44] <wikibugs>	 (03PS2) 10JMeybohm: Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155
[16:53:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] nova policy: temporarily disable VM resizing [puppet] - 10https://gerrit.wikimedia.org/r/1043161 (https://phabricator.wikimedia.org/T364458) (owner: 10Andrew Bogott)
[16:53:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "gitlab-exporter service was temp disabled, DNS changed, config changed,, then reactivated. service is running again" [puppet] - 10https://gerrit.wikimedia.org/r/1041767 (owner: 10Dzahn)
[16:53:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:54:43] <wikibugs>	 (03CR) 10Klausman: [C:03+1] Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm)
[16:55:48] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 using stat1009.eqiad.wmnet)
[16:56:28] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:57:43] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:1043126|Convert local function to arrow function to fix context (T367366)]] (duration: 16m 51s)
[16:57:47] <stashbot>	 T367366: Failed to fetch notifications: Notifications fail to load - https://phabricator.wikimedia.org/T367366
[16:58:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Stop syncing swift rings on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1043128 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[16:59:01] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163
[17:00:04] <jouncebot>	 bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1700)
[17:00:46] <bd808>	 nothing to do for my deploy window today.
[17:01:10] <wikibugs>	 (03CR) 10Majavah: [C:04-2] "not yet." [puppet] - 10https://gerrit.wikimedia.org/r/1043163 (owner: 10Andrew Bogott)
[17:01:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 (owner: 10Dzahn)
[17:01:30] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:02:04] <wikibugs>	 (03PS4) 10Dzahn: acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750
[17:03:14] <wikibugs>	 (03PS1) 10MVernon: installer/cephadm: specify a very large maximum size [puppet] - 10https://gerrit.wikimedia.org/r/1043165 (https://phabricator.wikimedia.org/T279621)
[17:06:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] acme_chief/idp/gitlab: remove "old" and "new" service names [puppet] - 10https://gerrit.wikimedia.org/r/1041750 (owner: 10Dzahn)
[17:06:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367261)', diff saved to https://phabricator.wikimedia.org/P64886 and previous config saved to /var/cache/conftool/dbconfig/20240613-170602-marostegui.json
[17:06:07] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[17:13:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn)
[17:13:09] <wikibugs>	 (03PS4) 10Dzahn: move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260
[17:19:18] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ using stat1009.eqiad.wmnet)
[17:24:50] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043172
[17:25:03] <wikibugs>	 (03PS4) 10Btullis: [WIP] Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472)
[17:25:52] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028)
[17:25:56] <wikibugs>	 (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn)
[17:26:13] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028)
[17:26:33] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott)
[17:33:50] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet
[17:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:39:14] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[17:39:21] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[17:41:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9890137 (10VRiley-WMF) @Clement_Goubert I believe it would be better to open a new task for any servers that need to be relabeled.
[17:41:23] <wikibugs>	 (03PS1) 10Dzahn: idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174
[17:42:06] <wikibugs>	 (03CR) 10Dzahn: "< taavi> did someone forget to cleanup the CAS config after gitlab moved from the cas protocol to OIDC?" [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn)
[17:42:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn)
[17:44:24] <wikibugs>	 (03PS10) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[17:44:27] <wikibugs>	 (03PS2) 10Dzahn: idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174
[17:45:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on mw2337:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:46:10] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl1003 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:47:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9890151 (10VRiley-WMF) @RKemper When is there a preference on when we could schedule this?
[17:47:37] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] idp: update renamed gitlab-replica OIDC service IDs [puppet] - 10https://gerrit.wikimedia.org/r/1043174 (owner: 10Dzahn)
[17:47:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9890165 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF
[17:48:41] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wikikube-ctrl1003:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:51:15] <jhathaway>	 o/
[17:52:34] <jhathaway>	 expired downtime?
[17:53:30] <herron>	 !incidents
[17:53:30] <sirenbot>	 4747 (UNACKED)  [2x] ProbeDown sre (wikikube-ctrl1003:6443 probes/custom eqiad)
[17:53:30] <sirenbot>	 4746 (RESOLVED)  [2x] ProbeDown sre (kubemaster2001:6443 probes/custom codfw)
[17:53:31] <sirenbot>	 4745 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams)
[17:53:31] <sirenbot>	 4743 (RESOLVED)  [2x] ProbeDown sre (probes/custom eqiad)
[17:53:37] <herron>	 not sure
[17:53:42] <herron>	 !ack 4747
[17:53:43] <sirenbot>	 4747 (ACKED)  [2x] ProbeDown sre (wikikube-ctrl1003:6443 probes/custom eqiad)
[17:54:38] <jhathaway>	 I see kamila_ was reimaging it earlier today
[17:54:52] <jhathaway>	 and it never had a positive host health, in the graphs
[17:56:48] <wikibugs>	 (03PS1) 10Dzahn: idp: drop gitlab-new.wikimedia.org service ID [puppet] - 10https://gerrit.wikimedia.org/r/1043181
[17:57:25] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye
[17:57:34] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4...
[17:57:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[17:58:03] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[18:00:04] <jouncebot>	 brennen and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T1800).
[18:00:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "no IP change, only comments and moving the entry around" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn)
[18:01:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "to avoid that someone uses this IP another time because the comment looks like it's free" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn)
[18:02:36] <wikibugs>	 (03CR) 10Dzahn: "should we just drop this?  But it seems we still need _some_ name for "the other machine that is not a replica"." [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn)
[18:04:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64887 and previous config saved to /var/cache/conftool/dbconfig/20240613-180404-ladsgroup.json
[18:04:09] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:05:15] <herron>	 jhathaway: yeah I think you are right, I had a look around the hots too and am inclined to leave it as-is
[18:05:25] <herron>	 host*
[18:06:12] <jhathaway>	 well the host is reachalbe now, and has a 2+ hour uptime
[18:06:34] <wikibugs>	 (03CR) 10Dzahn: "certainly not "gitlab-a" and "gitlab-b" even though that would match the replicas now. but once gitlab2003.wikimedia.org is setup it will " [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn)
[18:06:39] <jhathaway>	 but i see the is dmesg, 
[18:06:41] <jhathaway>	 [Thu Jun 13 15:39:39 2024] bnxt_en 0000:3b:00.1 enp59s0f1np1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[18:07:03] <herron>	 puppet has some issues and a few services are broken too, looks like its not finished with setup?  not sure
[18:07:40] <jhathaway>	 nod
[18:08:08] <brennen>	 o/
[18:09:26] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9890246 (10KartikMistry) @elukey Yes. We can move to Swift. Is there any documentation for services using a similar...
[18:10:31] <mutante>	 was it reimaged while the puppet role was applied? probably the usual problem that it won't work on first run, only on second run
[18:10:48] <mutante>	 but then it won't work the reimage cookbook unless the prod role is temp removed
[18:10:50] <dduvall>	 brennen: o/
[18:10:54] <brennen>	 good for train deploy here?
[18:12:58] <jhathaway>	 herron: I think it is find to leave as is, but I'll ask in servicops, in case someone is around
[18:13:36] <herron>	 jhathaway: thanks sgtm
[18:15:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on mw2337:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:16:17] <brennen>	 !log 1.43.0-wmf.9 train (T361403): no current blockers, rolling to group2
[18:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:25] <stashbot>	 T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403
[18:16:48] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403)
[18:16:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot)
[18:17:30] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043184 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot)
[18:17:42] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye
[18:17:46] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: - cp4...
[18:18:08] <brett>	 Forgot to merge the hiera config :P
[18:19:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64888 and previous config saved to /var/cache/conftool/dbconfig/20240613-181911-ladsgroup.json
[18:19:22] <wikibugs>	 (03PS1) 10BCornwall: Set cp4038 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043185 (https://phabricator.wikimedia.org/T364891)
[18:26:49] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[18:26:50] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[18:27:15] <wikibugs>	 (03PS11) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[18:28:58] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[18:28:59] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[18:29:32] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.9  refs T361403
[18:29:36] <stashbot>	 T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403
[18:34:01] <icinga-wm_>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4038 is CRITICAL: connect to address 10.128.0.27 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[18:34:15] <dduvall>	 brennen: you seeing the "fwrite(): write of 199 bytes failed with errno=32 Broken pipe" errors?
[18:34:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P64889 and previous config saved to /var/cache/conftool/dbconfig/20240613-183417-ladsgroup.json
[18:34:28] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] Set cp4038 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043185 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[18:35:28] <dduvall>	 oh sorry that's wmf.8. dumps related?
[18:36:24] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[18:36:35] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[18:37:02] <brennen>	 dduvall: i believe so, yeah
[18:37:54] <dduvall>	 k. so much logspam this week
[18:38:02] <brennen>	 yeah, it's not quiet.
[18:39:39] <icinga-wm_>	 RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[18:45:24] <wikibugs>	 (03PS35) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:45:46] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:47:04] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 12, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367457 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:47:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T367457 (10ops-monitoring-bot) 03NEW
[18:49:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[18:49:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T352010)', diff saved to https://phabricator.wikimedia.org/P64890 and previous config saved to /var/cache/conftool/dbconfig/20240613-184924-ladsgroup.json
[18:49:27] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:49:30] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:49:30] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:50:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] openstack::clientpackages::vms::bobcat::bullseye: install 'zed' packages [puppet] - 10https://gerrit.wikimedia.org/r/1043173 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott)
[19:05:12] <kamila_>	 jhathaway: I'm back, will look at the wikikube-ctrl1003 thing, sorry about that 
[19:05:53] <jhathaway>	 kamila_: no problem at all
[19:06:02] <jhathaway>	 was it supposed to be up?
[19:06:53] <kamila_>	 what's your definition of supposed? :D 
[19:07:04] <kamila_>	 I was hoping it would be, but apparently the reimage failed
[19:07:27] <kamila_>	 (I was afk for a while) 
[19:08:13] <wikibugs>	 (03PS1) 10Scott French: Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851)
[19:08:13] <kamila_>	 so now it's not expected to be up, I'll plop a downtime on it 
[19:08:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[19:08:50] <icinga-wm_>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:08:51] <jhathaway>	 kamila_: ah that makes sense, thanks
[19:09:52] <icinga-wm_>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:10:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: reimage failing
[19:10:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: etcd.service on wikikube-ctrl1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:10:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-ctrl1003.eqiad.wmnet with reason: reimage failing
[19:10:36] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890462 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ffb1c0b-d404-4615-accd-65085d64f738) set by kamila@c...
[19:13:50] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890464 (10CDanis) Hi all.  @joanna_borun asked me to do some looking into this.  I promise I skimmed the above, but...
[19:20:28] <wikibugs>	 (03PS2) 10Scott French: Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851)
[19:22:26] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs: Upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1042234 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans)
[19:22:58] <icinga-wm_>	 PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100%
[19:23:31] <wikibugs>	 (03PS1) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772)
[19:24:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Pass --allow-releaseinfo-change when adding new openstack client apt repos [puppet] - 10https://gerrit.wikimedia.org/r/1043199 (https://phabricator.wikimedia.org/T366028)
[19:27:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye
[19:27:18] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9890504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad....
[19:27:21] <kamila_>	 🎉
[19:27:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1013.eqiad.wmnet
[19:27:30] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1013.eqiad.wmnet
[19:28:32] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002
[19:28:37] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[19:34:15] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890544 (10dcaro) >>! In T348643#9890463, @CDanis wrote: > Hi all.  @joanna_borun asked me to do some looking into t...
[19:38:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028)
[19:38:40] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-in hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406)
[19:38:41] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9890547 (10CDanis) Very helpful, thanks @dcaro and enjoy the pto!  I'll be gentle, and definitely won't do any write...
[19:38:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott)
[19:39:07] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:39:42] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:40:29] <wikibugs>	 (03PS2) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028)
[19:40:34] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:40:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott)
[19:41:11] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-in hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043204 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:41:49] <foks>	 !log removing 2 files for legal compliance
[19:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:20] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Pass --allow-releaseinfo-change when adding new openstack client apt repos [puppet] - 10https://gerrit.wikimedia.org/r/1043199 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott)
[19:43:02] <wikibugs>	 (03PS3) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028)
[19:46:25] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041)
[19:47:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[19:47:35] <wikibugs>	 (03PS5) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969)
[19:51:32] <logmsgbot>	 !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS bullseye
[19:51:36] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors: -...
[19:51:42] <foks>	 !log removing 2 files for legal compliance
[19:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:51] <wikibugs>	 (03PS6) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969)
[19:53:15] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS bullseye
[19:53:21] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye
[19:57:18] <wikibugs>	 (03PS7) 10Kgraessle: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969)
[19:58:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl1003.eqiad.wmnet
[19:58:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl1003.eqiad.wmnet
[19:59:09] <foks>	 !log removing 2 files for legal compliance
[19:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T2000)
[20:00:05] <jouncebot>	 dbrant and pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <Pppery>	 here
[20:00:09] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+1] "looks good to me; we've tested this locally and on beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle)
[20:00:12] <dbrant>	 present
[20:00:23] <logmsgbot>	 !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl1003.eqiad.wmnet
[20:00:24] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-in{1001,2001} change role to postfix::mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406)
[20:00:38] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:01:49] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:02:24] <JSherman>	 have a backport that I just realized I didn't put on the calendar
[20:05:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-in{1001,2001} change role to postfix::mx_in [puppet] - 10https://gerrit.wikimedia.org/r/1043214 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:06:29] <JSherman>	 I stuck it in there in hopes of tagging along at the end: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1041699
[20:10:49] <JSherman>	 do we have a deployer on hand? I could deploy if needed
[20:13:24] <foks>	 !log removing 1 file for legal compliance
[20:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:52] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[20:15:07] <dbrant>	 looks like you've volunteered
[20:15:36] <JSherman>	 dbrant: getting everything setup. both of these backports look straightforward
[20:17:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723) (owner: 10Dbrant)
[20:17:58] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage
[20:18:51] <Pppery>	 Now Jenkins will take ~20 minutes to approve the patch. You could manually +2 my patch as well so the two 20-minute delays run in parallel rather than series.
[20:19:14] <JSherman>	 Pppery: ack
[20:20:09] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+2] "looks good to me; giving zuul a head start in the deployment window" [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery)
[20:20:40] <Pppery>	 thanks
[20:26:31] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-in{1001,2001} fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406)
[20:26:52] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:28:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-in{1001,2001} fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1043222 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:29:46] <jinxer-wm>	 FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50%   - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25
[20:31:27] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041)
[20:32:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[20:32:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890729 (10cmooney) It seems this was an inadvertent result of the upgrade to the codfw row A/B switches, and the move there from a purely L2 switching layer to a rout...
[20:34:27] <wikibugs>	 (03PS1) 10JHathaway: mx-in: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406)
[20:34:40] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:35:47] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:37:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64891 and previous config saved to /var/cache/conftool/dbconfig/20240613-203708-marostegui.json
[20:37:13] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[20:38:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:40:30] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466 (10CDanis) 03NEW
[20:40:55] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] mx-in: acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/1043224 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[20:42:58] <wikibugs>	 (03Merged) 10jenkins-bot: Look for iPadOS in user-agent, in addition to iOS. [extensions/MobileApp] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043110 (https://phabricator.wikimedia.org/T362723) (owner: 10Dbrant)
[20:43:00] <wikibugs>	 (03Merged) 10jenkins-bot: Fix logging bugs in unfuzzy handling [extensions/Translate] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043141 (https://phabricator.wikimedia.org/T49177) (owner: 10Pppery)
[20:43:23] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890786 (10CDanis)
[20:44:03] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS bullseye
[20:44:07] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9890788 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed: - cp4038 (**P...
[20:44:13] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9890789 (10CDanis)
[20:45:58] <JSherman>	 Pppery: it looks like your backport comes with some other unexpected commits due to submodules
[20:46:24] <Pppery>	 Sorry, I have no idea what that means
[20:48:02] <JSherman>	 basically, it looks like it has a submodule update from master instead  the release branch
[20:48:46] <wikibugs>	 (03PS3) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041)
[20:49:04] <Pppery>	 I just used Gerrit's cherry-pick option in the UI. I had no idea that the translate extension even had submodules
[20:49:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[20:50:21] <JSherman>	 Subproject commit b085c3259dd6e36c16a8149767ba841b5d597d9a
[20:50:32] <logmsgbot>	 !log cdobbins@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4038.eqsin.wmnet
[20:51:00] <wikibugs>	 (03PS4) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041)
[20:51:12] <Pppery>	 That doesn't make sense. b085c3259dd6e36c16a8149767ba841b5d597d9a is the hash of my commit
[20:51:30] <JSherman>	 https://phabricator.wikimedia.org/rMWfe91de424bd1f20936fd48f2bc3e7321e65f46a7
[20:52:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P64892 and previous config saved to /var/cache/conftool/dbconfig/20240613-205215-marostegui.json
[20:52:19] <Pppery>	 That commit updates the pointer for translate in the branch of the mediawiki/core repo from the version before my commit to the version after my commit. That looks right
[20:52:48] <Pppery>	 And that's what should be deployed, right?
[20:52:49] <JSherman>	 now that looks like dbrant:
[20:53:02] <JSherman>	 https://phabricator.wikimedia.org/rMWa436c8f2782830b36c2244546f219a9cc964dd15
[20:53:30] <Pppery>	 Yep, that's the same submodule update for dbrant's MobileApp commit. I'm not seeing the problem here
[20:53:48] <dbrant>	 I also just used the cherry-pick feature in gerrit.
[20:54:15] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:54:51] <JSherman>	 I wonder if the deployments got put together because of gate-and-submit finishing at simultaneously
[20:55:47] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:55:47] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002
[20:55:47] <JSherman>	 scap is giving me a warning
[20:55:48] <JSherman>	 `20:44:37 There were unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.43.0-wmf.9`
[20:55:54] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[20:56:49] <JSherman>	 I think if I had scap deployed both changes together this may have been the expected result without the warning
[20:57:00] <JSherman>	 but I'm not super confident about moving forward
[20:59:36] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:59:51] <JSherman>	 cjming: could you advise?
[21:00:20] <RoanKattouw>	 So it sounds like commits have been merged in the wmf.9 branch of other extensions
[21:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:00:38] <JSherman>	 yes, I kicked off a +2 on another patch to be backported
[21:00:43] <RoanKattouw>	 Those are also going to be deployed together with the change that you were planning to deploy, which might not have been what you expected
[21:00:46] <JSherman>	 it finished in the middle of the first patch
[21:00:58] <RoanKattouw>	 It's OK to do that as long as you know that it's happening and you have the person who requested the deploy test it etc
[21:01:04] <JSherman>	 which seems like the right outcome
[21:01:06] <cjming>	 hi hi - yes i've encountered that before - not sure if it's the right decision but i've plowed ahead
[21:01:07] <wikibugs>	 (03PS1) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439)
[21:01:11] <JSherman>	 good good
[21:01:24] <JSherman>	 thanks guys!
[21:01:25] <cjming>	 what Roan said
[21:01:29] <RoanKattouw>	 Yeah if the change that the scap tool is complaining about is one you know about and are comfortable deploying, then move forward
[21:01:43] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]]
[21:01:48] <stashbot>	 T362723: Data Validation for iOS Image Recs - https://phabricator.wikimedia.org/T362723
[21:02:01] <JSherman>	 I always have a freeze response when I see a warning about sub modules
[21:02:07] <cjming>	 same!
[21:02:13] <RoanKattouw>	 The tool has this feature to warn you in the scenario where someone +2s a patch (most commonly in mw-config) and never deploys it, then you "scap deploy" another patch, and end up deploying a completely unrelated change along with yours
[21:02:20] <wikibugs>	 (03CR) 10Scott French: "Thanks, Tobias! That's a good point about the routing." [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[21:02:31] <JSherman>	 but it looks like the extensions are submodules in the deployment repo, which makes sense
[21:02:38] <RoanKattouw>	 That 's exactly right
[21:02:40] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[21:03:26] <RoanKattouw>	 And there's a magic feature in Gerrit where, when a patch is merged in the wmf.9 branch of e.g. the Translate extension, a commit is automatically created and merged in the wmf.9 branch of core updating the submodule for Translate. https://phabricator.wikimedia.org/rMWfe91de424bd1f20936fd48f2bc3e7321e65f46a7 is one of those automagic update commits
[21:03:34] <topranks>	 !log changing BGP aggregate contribution policy / external route announcement cr2-eqord (T367439)
[21:03:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:39] <stashbot>	 T367439: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439
[21:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[21:04:01] <RoanKattouw>	 That way the submodules in the deployment branch of core always stay in sync with the deployment branches of the extensions
[21:04:04] <topranks>	 !log changing BGP aggregate contribution policy / external route announcement cr2-eqdfw (T367439)
[21:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:09] <logmsgbot>	 !log jsn@deploy1002 dbrant, jsn: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:04:15] <dbrant>	 TIL
[21:04:16] <JSherman>	 sorry for the delay dbrant: and Pppery: please test
[21:04:19] <Pppery>	 On it
[21:04:58] <Pppery>	 First of two things my patch did looks good. Still testing the second
[21:05:05] <dbrant>	 mine looks good!
[21:05:34] <JSherman>	 RoanKattouw: That makes sense and explains why we do reverts when we do
[21:06:07] <wikibugs>	 (03PS2) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439)
[21:06:11] <RoanKattouw>	 How does it explain reverts?
[21:06:15] <JSherman>	 dbrant: thanks! These are rolling together, so we'll wait to hear from Pppery:
[21:06:56] <Pppery>	 Second of two things looks good as well. Proceed
[21:07:02] <logmsgbot>	 !log jsn@deploy1002 dbrant, jsn: Continuing with sync
[21:07:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P64893 and previous config saved to /var/cache/conftool/dbconfig/20240613-210723-marostegui.json
[21:08:41] <JSherman>	 If you deploy an extension change and it tests bad, you can't just stop the sync, you have to revert the change too. I know this is super obvious when I think about it, but scap abstracts things quite a bit.
[21:09:09] <RoanKattouw>	 Oh right yes because it's already merged in the deployment branch
[21:09:21] <RoanKattouw>	 So even if you didn't sync it and just left it there, it would be a nasty surprise for the next deployer
[21:09:33] <JSherman>	 ^
[21:13:45] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:15:27] <JSherman>	 jouncebot next
[21:15:27] <jouncebot>	 In 8 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0600)
[21:15:34] <JSherman>	 jouncebot now
[21:15:34] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 44 minute(s)
[21:15:55] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1043110|Look for iPadOS in user-agent, in addition to iOS. (T362723)]] (duration: 14m 11s)
[21:15:59] <stashbot>	 T362723: Data Validation for iOS Image Recs - https://phabricator.wikimedia.org/T362723
[21:16:23] <JSherman>	 Pppery: & dbrant: y'all should be good; I'm going to pull in our config change too, since there's nothing else happening
[21:16:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle)
[21:17:14] <dbrant>	 thx!
[21:17:31] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041699 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle)
[21:17:49] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]]
[21:17:53] <stashbot>	 T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969
[21:18:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890880 (10cmooney) I've pushed this change to cr2-eqdfw and it seems to be doing what we need there:  Codfw /48 is announced to Facebook: ` cmoo...
[21:18:16] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:18:42] <Pppery>	 I'm sorry. I've attended four backport windows and every time something went uniquely wrong
[21:19:44] <JSherman>	 Pppery: no worries! This was just my inexperience with scap. Nothing went wrong here.
[21:19:55] <Pppery>	 Thanks
[21:20:16] <logmsgbot>	 !log jsn@deploy1002 jsn, kgraessle: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:20:44] <JSherman>	 testing
[21:22:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T364069)', diff saved to https://phabricator.wikimedia.org/P64894 and previous config saved to /var/cache/conftool/dbconfig/20240613-212230-marostegui.json
[21:22:36] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[21:23:23] <logmsgbot>	 !log jsn@deploy1002 jsn, kgraessle: Continuing with sync
[21:23:37] <JSherman>	 looks good, surveys live on all 4 wikis
[21:23:58] <JSherman>	 (on the debug host)
[21:25:47] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:28:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890909 (10cmooney) I'm monitoring the change in traffic levels.  Right now it seems negligible, however that is not much surprise, prior to the...
[21:29:21] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406)
[21:29:34] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[21:30:32] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581)
[21:32:07] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1041699|Deploy QuickSurvey for Automoderator patroller workstream survey (T362969)]] (duration: 14m 18s)
[21:32:08] <wikibugs>	 (03PS2) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581)
[21:32:12] <stashbot>	 T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969
[21:33:38] <wikibugs>	 (03PS3) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581)
[21:33:53] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002
[21:33:56] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[21:34:23] <wikibugs>	 (03PS2) 10JHathaway: postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406)
[21:34:33] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[21:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:35:24] <wikibugs>	 (03PS3) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439)
[21:38:15] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-in add missing next hop [puppet] - 10https://gerrit.wikimedia.org/r/1043245 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[21:39:28] <wikibugs>	 (03PS1) 10Dzahn: idp: remove gitlab from the CAS protocol section [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390)
[21:42:15] <wikibugs>	 (03PS2) 10Dzahn: idp: remove gitlab from the CAS protocol section [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390)
[21:42:39] <wikibugs>	 (03PS4) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439)
[21:44:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9890956 (10cmooney) Just to note that for the same time period (since March 5th) we've not been announcing the codfw aggregates from eqord: ` cmo...
[21:56:38] <wikibugs>	 (03PS13) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772)
[21:59:36] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:00:00] <wikibugs>	 (03PS1) 10JHathaway: mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T189655)
[22:00:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T189655) (owner: 10JHathaway)
[22:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:04:58] <wikibugs>	 (03PS2) 10JHathaway: mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406)
[22:05:28] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[22:07:16] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:07:48] <wikibugs>	 (03PS1) 10JHathaway: vrts_aliases: use keyword params [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406)
[22:08:10] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:10:18] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:12:33] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] vrts_aliases: use keyword params [puppet] - 10https://gerrit.wikimedia.org/r/1043261 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:12:45] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] mariadb::ferm_misc add mx-in{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1043252 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:18:18] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 58.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:25:00] <wikibugs>	 (03PS1) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043275 (https://phabricator.wikimedia.org/T363001)
[22:30:47] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:33:45] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:35:28] <wikibugs>	 (03PS1) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001)
[22:35:47] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:37:34] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED
[22:37:41] <wikibugs>	 (03PS12) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[22:39:19] <wikibugs>	 (03PS5) 10Zabe: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041)
[22:39:33] <zabe>	 jouncebot: nowandnext
[22:39:33] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 20 minute(s)
[22:39:33] <jouncebot>	 In 7 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0600)
[22:40:17] <wikibugs>	 (03PS2) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001)
[22:40:43] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:42:35] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[22:43:15] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043210 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[22:46:38] <wikibugs>	 (03PS3) 10Bking: dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001)
[22:46:54] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are m
[22:46:54] <icinga-wm_>	 n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:46:54] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet are m
[22:46:54] <icinga-wm_>	 n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:46:57] <jinxer-wm>	 FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:47:04] <icinga-wm_>	 PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[22:47:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:47:54] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:47:56] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:47:56] <icinga-wm_>	 RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset 0.000438846 secs https://wikitech.wikimedia.org/wiki/NTP
[22:47:57] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[22:49:02] <zabe>	 !log create plwiki sysop wiki # T361041
[22:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:08] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[22:49:14] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127
[22:50:46] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2927/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney)
[22:51:57] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:52:57] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[22:56:56] <wikibugs>	 (03PS1) 10Zabe: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041)
[22:57:49] <wikibugs>	 (03PS2) 10Zabe: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041)
[22:57:55] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[22:58:36] <wikibugs>	 (03Merged) 10jenkins-bot: Fully disable local uploads on sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043289 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[22:59:19] <logmsgbot>	 !log zabe@deploy1002 Started scap: T361041
[22:59:23] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[23:01:53] <logmsgbot>	 !log zabe@deploy1002 zabe: T361041 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:02:39] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002
[23:02:43] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[23:06:19] <logmsgbot>	 !log zabe@deploy1002 Sync cancelled.
[23:07:16] <wikibugs>	 (03PS1) 10Zabe: multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041)
[23:07:45] <wikibugs>	 (03CR) 10Zabe: [C:03+2] multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[23:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Fix sysop_plwiki mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043298 (https://phabricator.wikimedia.org/T361041) (owner: 10Zabe)
[23:08:52] <logmsgbot>	 !log zabe@deploy1002 Started scap: T361041
[23:08:57] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[23:10:01] <wikibugs>	 (03PS1) 10Bking: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616)
[23:11:40] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking)
[23:13:13] <wikibugs>	 (03PS2) 10Bking: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616)
[23:13:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T365616) (owner: 10Bking)
[23:17:15] <foks>	 !log removing 9 files for legal compliance
[23:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:28] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T361041 (duration: 11m 36s)
[23:20:33] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[23:23:35] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=sysop_plwiki --cluster=all 2>&1 | tee /tmp/sysop_plwiki.UpdateSearchIndexConfig.log # T361041
[23:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043309
[23:38:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043309 (owner: 10TrainBranchBot)
[23:44:05] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311
[23:44:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311 (owner: 10Zabe)
[23:44:45] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043311 (owner: 10Zabe)
[23:45:26] <logmsgbot>	 !log zabe@deploy1002 Started scap: T361041, [[gerrit:1043311|Update interwiki cache]]
[23:45:30] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041
[23:48:06] <foks>	 !log removing 7 files for legal compliance
[23:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:00] <wikibugs>	 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9891246 (10Papaul) We Will be going on site this Monday, June 17th at 11am to work with Equinix team on fixing this issue. @cmooney will be depooling the site.
[23:56:33] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T361041, [[gerrit:1043311|Update interwiki cache]] (duration: 11m 07s)
[23:56:37] <stashbot>	 T361041: Create wikipedia-pl-sysop.wikimedia.org (was: sysop-pl.wikipedia.org) - https://phabricator.wikimedia.org/T361041