[00:38:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673
[00:38:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673 (owner: 10TrainBranchBot)
[01:00:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933673 (owner: 10TrainBranchBot)
[01:11:01] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:23:50] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117)
[01:25:47] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus)
[01:26:46] <wikibugs>	 (03Merged) 10jenkins-bot: opentelemetry-collector: Vendor 0.61.0 as 0.61.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934026 (https://phabricator.wikimedia.org/T324117) (owner: 10RLazarus)
[01:32:55] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[01:33:02] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[01:33:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[02:00:11] <icinga-wm>	 RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:19] <icinga-wm>	 PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:21:57] <icinga-wm>	 PROBLEM - snapshot of s4 in eqiad on backupmon1001 is CRITICAL: snapshot for s4 at eqiad (db1145) taken more than 3 days ago: Most recent backup 2023-06-26 02:50:16 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:16:07] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1027: Remove comments [puppet] - 10https://gerrit.wikimedia.org/r/934037 (https://phabricator.wikimedia.org/T337812)
[05:26:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1027: Remove comments [puppet] - 10https://gerrit.wikimedia.org/r/934037 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui)
[05:29:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui)
[05:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:33:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[05:48:15] <icinga-wm>	 PROBLEM - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: snapshot for s5 at eqiad (db1145) taken more than 3 days ago: Most recent backup 2023-06-26 05:19:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0600).
[06:32:51] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:45:07] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Add the alert (icinga + alertmanager) hosts Bookworm node definitions [puppet] - 10https://gerrit.wikimedia.org/r/934245 (https://phabricator.wikimedia.org/T333615)
[06:45:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[06:49:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[06:56:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[06:56:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[06:59:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0700).
[07:01:08] <apergos>	 morning! there are o trainees signed up today to learn all the fine ins and outs of deployment
[07:01:27] <apergos>	 and that's a nice thing because there are no patches scheduled for deployment that they could watch or try their hand at
[07:01:41] <apergos>	 so... have a nice quiet Friday and a good weekend everybody, see you next time!
[07:02:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[07:04:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/934246
[07:05:48] <RhinosF1>	 apergos: you said nice and quiet
[07:06:04] <RhinosF1>	 I think we should touch wood
[07:08:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/934246 (owner: 10Muehlenhoff)
[07:08:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[07:08:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[07:08:49] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[07:10:52] <apergos>	 RhinosF1: I leave that to you ;-)
[07:15:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[07:16:55] <icinga-wm>	 PROBLEM - Host kubernetes2005 is DOWN: PING CRITICAL - Packet loss = 100%
[07:17:31] <icinga-wm>	 PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:17:35] <icinga-wm>	 PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:19:35] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:20:31] <icinga-wm>	 RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms
[07:20:33] <icinga-wm>	 RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 32.18 ms
[07:20:35] <icinga-wm>	 RECOVERY - Host kubernetes2005 is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms
[07:21:07] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:22:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[07:22:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[07:22:41] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[07:22:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[07:22:59] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[07:24:13] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 326147 seconds left:Certificate wikipedia.fi valid until 2023-09-06 10:30:20 +0000 (expires in 69 days) https://wikitech.wikimedia.org/wiki/Ncredir
[07:24:29] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 272130 seconds left:Certificate *.wikimania.com valid until 2023-07-30 13:28:28 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir
[07:24:31] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 534928 seconds left:Certificate *.wikispecies.net valid until 2023-07-30 11:29:53 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/Ncredir
[07:29:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:29] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:49] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:46:29] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.drain-vm: Sync DRBD after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/934248 (https://phabricator.wikimedia.org/T203964)
[07:52:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.drain-vm: Sync DRBD after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/934248 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff)
[07:53:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/933976 (owner: 10Ottomata)
[07:59:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Change s4 and s5 eqiad backup sources to db1150 and db1216 [puppet] - 10https://gerrit.wikimedia.org/r/933895 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo)
[08:00:05] <jouncebot>	 brennen and jnuche: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T0800).
[08:01:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet
[08:01:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet
[08:08:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet
[08:08:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet
[08:10:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[08:12:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
[08:14:59] <icinga-wm>	 PROBLEM - mysqld processes on db1216 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:15:30] <marostegui>	 ^ checking
[08:16:12] <marostegui>	 it seems down
[08:16:36] <marostegui>	 ah no
[08:17:43] <marostegui>	 jynus: ^ some alerting might need tobe adjusted as we are going from 2 backups to 3 backups I guess?
[08:17:55] <jynus>	 mmm, must be a race condition
[08:18:00] <jynus>	 it should have been disabled
[08:18:07] <marostegui>	 ah ok :)
[08:18:09] <jynus>	 but I haven't touched it
[08:18:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
[08:18:17] <marostegui>	 I'd leave it to you then if it is "expected"
[08:18:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[08:18:41] <jynus>	 well, the alert is unexpected, but yeah probably my fault
[08:19:26] <jynus>	 yeah, it is a race condition because puppet hasn't run on icinga
[08:19:53] <jynus>	 and it should have 3 instances now
[08:21:10] <wikibugs>	 (03PS1) 10Hashar: rake_modules: mute Ruby 2.7 Pathname deprecation [puppet] - 10https://gerrit.wikimedia.org/r/934255
[08:21:28] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero)
[08:22:07] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[08:22:17] <wikibugs>	 (03CR) 10Hashar: "Looks like that mute the deprecation warning when using Ruby 2.7 :]" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar)
[08:22:25] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team, and 2 others: cloudservices2005-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338779 (10aborrero) 05Open→03Resolved I'm closing this ticket as I believe the reimage is completed and the server is mostly working, barring...
[08:25:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet
[08:30:28] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:32:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] README.release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933971 (owner: 10Jbond)
[08:33:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[08:33:15] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[08:33:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert)
[08:34:54] <wikibugs>	 (03CR) 10Jbond: Enforce using a node regex without the wmnet tld (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[08:35:30] <wikibugs>	 (03CR) 10Jbond: "other then the nits looks good to me,m and thanks for adding the fix stanzas 😊" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[08:36:41] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) a:03SLyngshede-WMF
[08:37:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
[08:38:55] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "source" field on user objects - https://phabricator.wikimedia.org/T340717 (10SLyngshede-WMF)
[08:39:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/934257
[08:41:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the hadoop-worker-canary cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/933432 (https://phabricator.wikimedia.org/T338227) (owner: 10Btullis)
[08:44:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
[08:44:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet
[08:44:56] <wikibugs>	 (03CR) 10David Caro: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[08:47:13] <wikibugs>	 (03PS1) 10Btullis: datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514)
[08:48:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/934257 (owner: 10Muehlenhoff)
[08:49:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[08:51:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[08:52:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert)
[08:52:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[08:52:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[08:53:33] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: Use new image and fix elasticsearch setup [deployment-charts] - 10https://gerrit.wikimedia.org/r/934259 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[08:53:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[08:58:33] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "Few minor things inline." [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro)
[08:58:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[08:59:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[08:59:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[09:00:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[09:02:36] <wikibugs>	 (03PS1) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251)
[09:03:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet
[09:06:22] <wikibugs>	 (03PS6) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277)
[09:06:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar)
[09:06:56] <wikibugs>	 (03PS1) 10Slyngshede: Credit logo artist. [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828)
[09:07:59] <wikibugs>	 (03CR) 10Slyngshede: "Thank you for point out the missing licens." [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede)
[09:08:07] <wikibugs>	 (03PS7) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277)
[09:10:12] <wikibugs>	 (03CR) 10Clément Goubert: noc: Pass ports without ferm-specific service constants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff)
[09:11:00] <wikibugs>	 (03PS1) 10Hashar: contint: rm git safe.directory for dev-images [puppet] - 10https://gerrit.wikimedia.org/r/934268 (https://phabricator.wikimedia.org/T335354)
[09:15:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file [alerts] - 10https://gerrit.wikimedia.org/r/934270
[09:19:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[09:21:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp
[09:22:17] <icinga-wm>	 PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:22:20] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
[09:25:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "These are real hw hosts, and I don't think we have the replacements so we'll have to upgrade in place/reimage" [puppet] - 10https://gerrit.wikimedia.org/r/934245 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[09:25:25] <icinga-wm>	 RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms
[09:26:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[09:26:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet
[09:27:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet
[09:28:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152)
[09:30:51] <moritzm>	 !log installing libx11 security updates
[09:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:43] <wikibugs>	 (03CR) 10FNegri: "Could this go in a global .gitignore? I have '*.swp' in ~/.gitignore for example, and refer to it in ~/.gitconfig" [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez)
[09:34:35] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez)
[09:34:54] <wikibugs>	 (03CR) 10David Caro: .gitignore: ignore nano swp file (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez)
[09:36:07] <moritzm>	 !log restarting FPM on mw canaries to pick up libx11 updates
[09:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
[09:37:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez)
[09:38:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152) (owner: 10Arturo Borrero Gonzalez)
[09:38:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar)
[09:38:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[09:38:39] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: team-wmcs: refresh openstack_apis_response.yaml [alerts] - 10https://gerrit.wikimedia.org/r/934272 (https://phabricator.wikimedia.org/T339152)
[09:38:54] <wikibugs>	 (03CR) 10Awight: [C: 03+1] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[09:39:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/933915 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[09:43:21] <wikibugs>	 (03PS1) 10Elukey: ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/934274
[09:43:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
[09:43:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet
[09:43:46] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
[09:44:25] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275
[09:45:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos)
[09:45:13] <wikibugs>	 (03Abandoned) 10Elukey: ml-services: update docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/934274 (owner: 10Elukey)
[09:46:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos)
[09:46:37] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp
[09:47:16] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: ores legacy root redirect [deployment-charts] - 10https://gerrit.wikimedia.org/r/934275 (owner: 10Ilias Sarantopoulos)
[09:47:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver::git: add operations/private [puppet] - 10https://gerrit.wikimedia.org/r/933909 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[09:50:36] <wikibugs>	 (03CR) 10Hashar: "Happy I have managed to figure out something with my limited knowledge of ruby :]" [puppet] - 10https://gerrit.wikimedia.org/r/934255 (owner: 10Hashar)
[09:50:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[09:52:36] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:52:49] <wikibugs>	 (03PS1) 10Jbond: puppetserver: init private repo [puppet] - 10https://gerrit.wikimedia.org/r/934277 (https://phabricator.wikimedia.org/T340635)
[09:53:29] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:53:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: init private repo [puppet] - 10https://gerrit.wikimedia.org/r/934277 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[09:53:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[09:57:02] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:57:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for wikikube-staging masters - jiji@cumin1001"
[09:58:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez)
[09:58:03] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:58:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for wikikube-staging masters - jiji@cumin1001"
[09:58:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:58:13] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: .gitignore: ignore nano swp file [alerts] - 10https://gerrit.wikimedia.org/r/934270 (owner: 10Arturo Borrero Gonzalez)
[09:59:56] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp
[10:00:03] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp
[10:00:05] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1000).
[10:00:05] <jouncebot>	 claime: May I have your attention please! MediaWiki infrastucture (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1000)
[10:00:42] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF)
[10:00:44] <claime>	 Let's go
[10:00:46] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[10:01:11] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[10:01:15] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF)
[10:02:04] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Validate managers for permission approval - https://phabricator.wikimedia.org/T335484 (10SLyngshede-WMF)
[10:02:08] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[10:02:14] <claime>	 !log Redirect www.mediawiki.org to mw-on-k8s - T337490
[10:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:19] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:02:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933469 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[10:02:48] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) p:05Triage→03Medium
[10:03:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: Add new puppet server to block [puppet] - 10https://gerrit.wikimedia.org/r/933639 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:03:15] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[10:03:31] <claime>	 !log Running puppet on cp-text trafficservers - T337490
[10:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:07] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10SLyngshede-WMF)
[10:05:09] <wikibugs>	 (03PS1) 10Jbond: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/934281
[10:05:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/934281 (owner: 10Jbond)
[10:05:34] <wikibugs>	 (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro)
[10:06:00] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1002.eqiad.wmnet
[10:06:01] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[10:07:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
[10:08:19] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] replica_cnf_api: refactor to use multiple backends (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (owner: 10David Caro)
[10:08:22] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: Correct config path [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635)
[10:08:38] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:08:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Jclark-ctr) @akosiaris Server Ran in cpu stress test for in total 5 days with no errors prior to running firmware was updated.   At this t...
[10:08:48] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster1002.eqiad.wmnet
[10:08:57] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement a staging setup for the IDM - https://phabricator.wikimedia.org/T320795 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[10:08:59] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF)
[10:09:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: Correct config path [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:09:45] <claime>	 !log www.mediawiki.org now hosted on mw-on-k8s - T337490
[10:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:49] <jbond>	 !log puppetserver1001 added back to puppet-merge
[10:09:49] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:53] <claime>	 And ExtensionDistributor works :p
[10:10:05] <taavi>	 claime: did you test after purging its cache?
[10:10:05] <icinga-wm>	 PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:10:06] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF)
[10:10:11] <icinga-wm>	 PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:10:33] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:10:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:10:44] <claime>	 taavi: ExtensionDistributor's cache?
[10:10:49] <claime>	 Give me a sec
[10:11:13] <taavi>	 https://phabricator.wikimedia.org/T340483#8965645 to purge the root Special:ExtensionDistributor page cache, ftr
[10:11:27] <claime>	 yep, on it, ty <3
[10:11:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42122/console" [puppet] - 10https://gerrit.wikimedia.org/r/934282 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:11:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) a:05Jclark-ctr→03akosiaris >>! In T339340#8975564, @Jclark-ctr wrote: > @akosiaris Server Ran in cpu stress test for in tot...
[10:12:04] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Update IDM servers to Bookworm - https://phabricator.wikimedia.org/T340722 (10SLyngshede-WMF)
[10:12:20] <claime>	 taavi: confirmed working after cache purge
[10:12:28] <taavi>	 awesome, thank you
[10:12:32] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Update IDM servers to Bookworm - https://phabricator.wikimedia.org/T340722 (10SLyngshede-WMF) p:05Triage→03Medium
[10:12:33] <Reedy>	 wheee
[10:13:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert)
[10:14:09] <wikibugs>	 (03PS1) 10Jbond: Revert "puppetserver: add ssh known_hosts entries for new puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/934014 (https://phabricator.wikimedia.org/T340635)
[10:14:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
[10:15:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[10:15:16] <wikibugs>	 (03PS2) 10Clément Goubert: mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490)
[10:15:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "puppetserver: add ssh known_hosts entries for new puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/934014 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:15:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[10:15:33] <icinga-wm>	 RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 33.95 ms
[10:15:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:15:43] <icinga-wm>	 RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms
[10:17:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[10:17:54] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp
[10:17:58] <claime>	 !log Redirect office.wikimedia.org to mw-on-k8s - T337490
[10:18:00] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933467 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert)
[10:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:02] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:18:32] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678)
[10:18:52] <claime>	 !log Running puppet on cp-text trafficservers - T337490
[10:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan)
[10:19:43] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS buster
[10:19:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host parse1002.eqiad.wmnet with OS buster
[10:19:56] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Delete records created by accident - jiji@cumin1001"
[10:20:00] <vgutierrez>	 claime: ^^ puppet might be disabled on some cp nodes..
[10:20:06] <vgutierrez>	 fabfur: how's the update going?
[10:20:15] <claime>	 vgutierrez: ack
[10:20:36] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Delete records created by accident - jiji@cumin1001"
[10:20:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:20:58] <fabfur>	 failed on cp1079.eqiad.wmnet investigating if puppet is disabled
[10:21:04] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster1002.eqiad.wmnet
[10:21:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[10:22:05] <claime>	 vgutierrez: Didn't have errors on my first run, I'm running with your suggested query 'A:cp-text and P{P:trafficserver::backend}', 15 hosts batch
[10:22:26] <vgutierrez>	 claime: ack
[10:23:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001"
[10:23:47] <wikibugs>	 (03PS1) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635)
[10:23:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001"
[10:23:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:23:54] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster1002.eqiad.wmnet on all recursors
[10:23:57] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster1002.eqiad.wmnet on all recursors
[10:24:23] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001"
[10:25:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster1002.eqiad.wmnet - jiji@cumin1001"
[10:25:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:25:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet
[10:25:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster1002.eqiad.wmnet with OS bullseye
[10:25:50] <claime>	 !log office.wikimedia.org now hosted on mw-on-k8s - T337490
[10:25:53] <wikibugs>	 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye
[10:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:57] <stashbot>	 T337490: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490
[10:28:19] <icinga-wm>	 PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:28:37] <wikibugs>	 (03PS2) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635)
[10:29:29] <wikibugs>	 (03PS1) 10Btullis: datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286)
[10:30:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:30:29] <icinga-wm>	 RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.03 ms
[10:30:33] <icinga-wm>	 PROBLEM - etcd service on kubestagetcd2001 is CRITICAL: CRITICAL - Expecting active but unit etcd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:31:21] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis)
[10:31:29] <wikibugs>	 (03PS3) 10Jbond: puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635)
[10:31:33] <icinga-wm>	 RECOVERY - etcd service on kubestagetcd2001 is OK: OK - etcd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:31:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet
[10:31:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[10:32:05] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: Retain the setup jobs after execution to help with debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934286 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis)
[10:32:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:31] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[10:32:52] <claime>	 !log Redirect vrt-wiki.wikimedia.org to mw-on-k8s - T340549
[10:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:56] <stashbot>	 T340549: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549
[10:32:57] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/933474 (https://phabricator.wikimedia.org/T340549) (owner: 10Clément Goubert)
[10:34:19] <claime>	 !log Running puppet on cp-text trafficservers - T340549
[10:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42126/console" [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:35:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet
[10:35:35] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[10:37:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:30] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp
[10:39:01] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[10:40:53] <claime>	 !log vrt-wiki.wikimedia.org now hosted on mw-on-k8s - T340549
[10:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:57] <stashbot>	 T340549: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549
[10:41:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet
[10:43:59] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:44:36] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:44:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group0 to Kubernetes - https://phabricator.wikimedia.org/T337490 (10Clement_Goubert) 05In progress→03Resolved
[10:46:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert)
[10:46:43] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:46:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) 05Open→03In progress
[10:47:41] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet
[10:48:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add ssh known_hosts entries for new puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/934285 (https://phabricator.wikimedia.org/T340635) (owner: 10Jbond)
[10:48:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet
[10:48:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet
[10:49:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet
[10:50:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: add lua script for gateway routing [puppet] - 10https://gerrit.wikimedia.org/r/934015 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan)
[10:51:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:52:21] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[10:53:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol: introduce cloud-private support for memcached [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488)
[10:56:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:57:25] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:57:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:58:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:58:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:59:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:59:46] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "Forgot my username" feature - https://phabricator.wikimedia.org/T340636 (10SLyngshede-WMF) p:05Triage→03Low
[10:59:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:59:56] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) p:05Triage→03Medium
[11:00:08] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "source" field on user objects - https://phabricator.wikimedia.org/T340717 (10SLyngshede-WMF) p:05Triage→03Low
[11:00:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:00:36] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build Debian packages for Bookwork - https://phabricator.wikimedia.org/T340721 (10SLyngshede-WMF) p:05Triage→03Medium
[11:01:01] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05Open→03In progress p:05Triage→03Medium
[11:01:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:01:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppet-merge: add new puppetserveres to puppet merge - https://phabricator.wikimedia.org/T340635 (10jbond) 05Open→03Resolved a:03jbond puppet-merge and the private repo post-commit hooks are bot...
[11:01:22] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) 05Open→03In progress
[11:01:24] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[11:02:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
[11:02:36] <moritzm>	 !log installing Java 8 security updates 
[11:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:55] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1001"
[11:04:10] <wikibugs>	 (03PS1) 10Jbond: pki::multiroot: update the client auth file to include new puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557)
[11:05:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42127/console" [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond)
[11:06:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: pick up Java 8 sec updates - jmm@cumin2002
[11:08:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multiroot: update the client auth file to include new puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/934291 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond)
[11:08:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:09:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:09:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:10:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:10:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
[11:11:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet
[11:12:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet
[11:13:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[11:18:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[11:19:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
[11:19:42] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[11:20:53] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestagemaster1002.eqiad.wmnet with OS bullseye
[11:20:53] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster1002.eqiad.wmnet
[11:20:59] <wikibugs>	 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster1002.eqiad.wmnet with OS bullseye executed with errors: - kubestagemaster1002...
[11:21:04] <wikibugs>	 (03PS1) 10Hnowlan: Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020
[11:21:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020 (owner: 10Hnowlan)
[11:21:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudservices2005-dev.wikimedia.org
[11:21:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudservices2005-dev.wikimedia.org
[11:21:49] <icinga-wm>	 PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:21:59] <icinga-wm>	 PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:22:42] <wikibugs>	 (03PS2) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251)
[11:22:49] <wikibugs>	 (03CR) 10Func: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[11:24:04] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933675
[11:25:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: add lua script for gateway routing" [puppet] - 10https://gerrit.wikimedia.org/r/934020 (owner: 10Hnowlan)
[11:25:33] <icinga-wm>	 RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms
[11:25:35] <icinga-wm>	 RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms
[11:28:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
[11:28:07] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/934302
[11:28:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet
[11:28:18] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[11:28:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet
[11:31:25] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet
[11:34:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/934288/42129/" [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez)
[11:39:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet
[11:41:09] <wikibugs>	 (03PS1) 10Matthias Mullie: Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021
[11:42:10] <wikibugs>	 (03PS1) 10Jbond: pki::client: use the wmf-ca-certificats bundle for ca auth [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557)
[11:44:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42130/console" [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond)
[11:44:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::client: use the wmf-ca-certificats bundle for ca auth [puppet] - 10https://gerrit.wikimedia.org/r/934307 (https://phabricator.wikimedia.org/T340557) (owner: 10Jbond)
[11:46:53] <wikibugs>	 (03PS1) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[11:47:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[11:47:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet
[11:47:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet
[11:47:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: profile::memcached::instance: allow to specify srange [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488)
[11:48:19] <wikibugs>	 (03PS2) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[11:48:32] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05In progress→03Resolved a:03jbond This has now been corrected systems on the new puppet infrastruct...
[11:48:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[11:48:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[11:49:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet
[11:50:42] <wikibugs>	 (03PS3) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[11:51:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[11:52:02] <wikibugs>	 (03PS4) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[11:52:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC is a NOOP for core resources https://puppet-compiler.wmflabs.org/output/934310/42131/" [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez)
[11:52:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/934302 (owner: 10Muehlenhoff)
[11:52:49] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934311
[11:52:51] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934311 (owner: 10QChris)
[11:53:32] <wikibugs>	 (03PS1) 10Jbond: puppetserver: move default db back to 443 [puppet] - 10https://gerrit.wikimedia.org/r/934312
[11:53:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: move default db back to 443 [puppet] - 10https://gerrit.wikimedia.org/r/934312 (owner: 10Jbond)
[11:53:58] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611)
[11:54:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
[11:54:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[11:57:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie)
[11:59:24] <wikibugs>	 (03PS5) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[11:59:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet::agent: set manage_puppet_ca_file false [puppet] - 10https://gerrit.wikimedia.org/r/933968 (owner: 10Jbond)
[12:00:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
[12:00:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet
[12:01:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[12:03:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet
[12:03:59] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[12:04:19] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5551 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[12:08:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
[12:10:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:11:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove urldownloader role from old buster servers [puppet] - 10https://gerrit.wikimedia.org/r/933904 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff)
[12:13:26] <wikibugs>	 (03PS1) 10Jbond: puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811)
[12:15:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:15:40] <wikibugs>	 (03PS2) 10Jbond: puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811)
[12:15:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
[12:15:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet
[12:16:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet
[12:16:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42133/console" [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:17:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppedb1003: migrate to new infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/934315 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:18:53] <wikibugs>	 10SRE, 10ops-eqiad: Decom cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T338459 (10Jclark-ctr) 05Open→03Resolved Disconnected and removed from Rack updated netbox
[12:19:21] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Jclark-ctr)
[12:21:13] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10Jclark-ctr)
[12:21:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10decommission-hardware: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10Jclark-ctr) 05Open→03Resolved
[12:22:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Jclark-ctr) 05Open→03Resolved
[12:22:06] <wikibugs>	 10SRE, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Jclark-ctr)
[12:23:36] <wikibugs>	 (03CR) 10Matthias Mullie: "recheck" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie)
[12:24:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10Jclark-ctr)
[12:24:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10Jclark-ctr) 05Open→03Resolved
[12:28:11] <wikibugs>	 (03PS3) 10David Caro: replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691)
[12:28:13] <wikibugs>	 (03CR) 10David Caro: replica_cnf_api: refactor to use multiple backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:31:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: refactor to use multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/933973 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[12:32:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
[12:33:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Creat cookbook to migrate serveres from the puppetmnasters to puppetservers - https://phabricator.wikimedia.org/T340739 (10jbond)
[12:34:39] <icinga-wm>	 PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:34:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: pick up Java 8 sec updates - jmm@cumin2002
[12:34:43] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:36:13] <wikibugs>	 (03PS2) 10D3r1ck01: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630)
[12:38:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
[12:39:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet
[12:39:21] <wikibugs>	 (03PS3) 10D3r1ck01: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630)
[12:39:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) @Dwisehaupt  i verified this server they are connected to port 24 on both switches
[12:39:47] <Reedy>	 jouncebot: nowandnext
[12:39:48] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[12:39:48] <jouncebot>	 In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300)
[12:39:48] <jouncebot>	 In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300)
[12:40:04] <wikibugs>	 (03PS1) 10Jbond: puppetdb::bookworm: pouplate hiera config [puppet] - 10https://gerrit.wikimedia.org/r/934323 (https://phabricator.wikimedia.org/T338811)
[12:40:24] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01)
[12:40:33] <icinga-wm>	 RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.59 ms
[12:40:33] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms
[12:41:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb::bookworm: pouplate hiera config [puppet] - 10https://gerrit.wikimedia.org/r/934323 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:42:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet']
[12:42:23] <wikibugs>	 (03PS1) 10Elukey: ml-services: update ores-legacy's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934324
[12:42:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet']
[12:43:13] <wikibugs>	 (03PS1) 10Jbond: puppetdb: on reflection lets just use the puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/934325 (https://phabricator.wikimedia.org/T338811)
[12:43:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet']
[12:43:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update ores-legacy's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934324 (owner: 10Elukey)
[12:44:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: on reflection lets just use the puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/934325 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond)
[12:45:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet
[12:46:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:46:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:47:58] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/929156 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[12:48:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:48:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond)
[12:49:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
[12:50:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: pick up Java 8 sec updates - jmm@cumin2002
[12:53:22] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-test-worker1003.eqiad.wmnet']
[12:53:45] <wikibugs>	 (03CR) 10Jforrester: "Aha, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[12:54:15] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[12:54:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42134/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[12:55:07] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1001"
[12:55:07] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS buster
[12:55:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host parse1002.eqiad.wmnet with OS buster comp...
[12:56:09] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet
[12:57:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
[12:57:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet
[12:58:05] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=parse1002.eqiad.wmnet
[12:59:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10akosiaris) 05Open→03Resolved I 've just re-imaged the server and set it back in conftool as active. Made sure to scap pull too. I am g...
[13:00:05] <jouncebot>	 xSavitar: May I have your attention please! Mobileapps/RESTBase/Wikifeeds. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1300).
[13:00:05] <jouncebot>	 Func, matthiasmullie, and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <matthiasmullie>	 o/
[13:00:10] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[13:00:12] <Func>	 o/
[13:00:13] <taavi>	 o/
[13:00:27] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[13:00:39] <xSavitar>	 o/
[13:00:50] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Remove unconditional $wgKartographerNearby flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933674 (https://phabricator.wikimedia.org/T340251) (owner: 10Func)
[13:00:57] <duesen>	 o/
[13:01:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet
[13:01:09] <taavi>	 xSavitar: hmm I don't see any patches from you listed?
[13:01:12] <wikibugs>	 (03CR) 10D3r1ck01: [C: 03+2] proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01)
[13:01:18] <taavi>	 James_F: are you deploying already?
[13:01:42] <taavi>	 xSavitar: oh right you have a separate window at the same time. sorry for the confusion
[13:01:55] <xSavitar>	 taavi there is a RB deploy window going on currently - https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_June_29
[13:02:10] <xSavitar>	 taavi, no worries!
[13:02:13] <James_F>	 taavi: I just pushed out a beta-cluster-only patch, no deploy.
[13:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: proton: Deploy latest Proton image - 2023-06-29-120130-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/934316 (https://phabricator.wikimedia.org/T340630) (owner: 10D3r1ck01)
[13:02:27] * duesen wibbles
[13:02:28] <taavi>	 aha
[13:02:39] <James_F>	 (And CI took 10 minutes not 30 seconds.)
[13:02:47] <taavi>	 :/
[13:03:03] <taavi>	 duesen: iirc your patches need some monitoring afterwards, so ok if I let you self-deploy after the other patches are done?
[13:03:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye
[13:03:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "deploying" [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie)
[13:04:05] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[13:04:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) (owner: 10Func)
[13:04:47] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: profile::memcached::instance: allow to specify srange [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488)
[13:04:57] <wikibugs>	 (03Merged) 10jenkins-bot: Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933671 (https://phabricator.wikimedia.org/T340697) (owner: 10Func)
[13:05:16] <duesen>	 taavi: ok. Ping me.
[13:05:30] <taavi>	 will do!
[13:05:41] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[13:05:43] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]]
[13:05:48] <stashbot>	 T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697
[13:06:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet
[13:06:05] <logmsgbot>	 !log derick@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply
[13:07:15] <logmsgbot>	 !log derick@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[13:07:31] <logmsgbot>	 !log taavi@deploy1002 taavi and func: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:07:45] <Func>	 testing...
[13:07:55] <logmsgbot>	 !log derick@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[13:08:25] <icinga-wm>	 PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:08:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/934310/42135/" [puppet] - 10https://gerrit.wikimedia.org/r/934310 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez)
[13:08:51] <Func>	 taavi: looks good
[13:09:07] <taavi>	 Func: thanks, the logs look clean too so syncing
[13:09:30] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: cloudcontrol: introduce cloud-private support for memcached [puppet] - 10https://gerrit.wikimedia.org/r/934288 (https://phabricator.wikimedia.org/T340488) (owner: 10Arturo Borrero Gonzalez)
[13:10:15] <logmsgbot>	 !log derick@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[13:10:46] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye
[13:11:07] <icinga-wm>	 RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[13:12:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet
[13:12:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet
[13:13:02] <wikibugs>	 (03PS1) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557)
[13:14:48] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:933671|Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace (T340697)]] (duration: 09m 05s)
[13:14:53] <stashbot>	 T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697
[13:14:57] <taavi>	 Func: and done!
[13:15:04] <Func>	 thanks
[13:15:05] <taavi>	 matthiasmullie: yours is up next, just waiting for the CI on that
[13:15:19] <matthiasmullie>	 taavi: thanks; mine can skip mwdebug testing - there's nothing to test, it only affects a maint script
[13:15:37] <taavi>	 ack
[13:16:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet
[13:16:50] <wikibugs>	 (03CR) 10Vgutierrez: "you should remove the associated 13-tls-redirect.vtc test as well" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[13:18:37] <Func>	 taavi: hum, I noticed something went wrong, subpage counts become [INVALID] when I am not on mwdebug hosts: https://zh.wikibooks.org/w/index.php?title=MediaWiki:Anontalkpagetext&action=info&uselang=en
[13:19:21] <wikibugs>	 (03Merged) 10jenkins-bot: Only send 1 suggestion per section [extensions/ImageSuggestions] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934021 (owner: 10Matthias Mullie)
[13:19:59] <taavi>	 Func: where do you see that? even when logged in (so edge caches should not be a problem) it says 'Number of subpages of this page: 6 (0 redirects; 6 non-redirects)'
[13:20:04] <wikibugs>	 (03PS2) 10Fabfur: varnish: Remove http/https redirection [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557)
[13:20:19] <wikibugs>	 (03CR) 10Fabfur: varnish: Remove http/https redirection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[13:20:28] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:934021|Only send 1 suggestion per section]]
[13:20:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-worker1003.eqiad.wmnet with OS bullseye
[13:21:07] <wikibugs>	 (03PS1) 10Reedy: Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568)
[13:21:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 04-1] "Precautionary -1 since the script seems to lead to different results in my tests:" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[13:22:06] <logmsgbot>	 !log taavi@deploy1002 mlitn and taavi: Backport for [[gerrit:934021|Only send 1 suggestion per section]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:22:13] <taavi>	 syncing
[13:22:17] <Reedy>	 taavi: ^^ want me to +2 it as it's gonna take ages to merge
[13:23:01] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330
[13:23:05] <taavi>	 duesen: do you see a reasonable risk of having to revert your patch?
[13:23:48] <taavi>	 your turn is after the currently-syncing patch, btw
[13:24:08] <duesen>	 taavi: unlikely, we already tested the same change for enwiki+dewiki+frwiki. If we do have to revert, we will probably not find out for a couple of hours
[13:24:22] <taavi>	 perfect, thanks
[13:24:23] <Func>	 taavi: weird, I saw `Number of subpages of this page	[INVALID] ([INVALID] redirects; [INVALID] non-redirects)` how can I know which server is serving my request?
[13:24:29] <taavi>	 Reedy: go for it
[13:24:35] <duesen>	 ok
[13:24:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
[13:24:40] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568) (owner: 10Reedy)
[13:24:58] <taavi>	 Func: `X-Cache` and `Server` response headers
[13:25:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:25:36] <wikibugs>	 (03PS2) 10Daniel Kinzler: Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867)
[13:25:40] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:25:49] <duesen>	 waiting for it to merge
[13:26:01] <duesen>	 claime, effie: deploying now
[13:26:15] <claime>	 ack
[13:26:28] <taavi>	 my deploy is still running..
[13:26:58] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:27:32] <wikibugs>	 (03Merged) 10jenkins-bot: Disable PC writes for parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933453 (https://phabricator.wikimedia.org/T339867) (owner: 10Daniel Kinzler)
[13:27:37] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:934021|Only send 1 suggestion per section]] (duration: 07m 08s)
[13:27:41] <taavi>	 aaand I'm done. duesen: floor is yours
[13:27:59] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: update page_change_kind for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/933060 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[13:28:02] <logmsgbot>	 !log daniel@deploy1002 Started scap: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]]
[13:28:06] <stashbot>	 T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867
[13:28:15] <matthiasmullie>	 taavi: thanks!
[13:28:28] <duesen>	 taavi: sorry, I misread your message earlier. I thought you had told me to go ahead. But you were talking to Reedly...
[13:29:32] <logmsgbot>	 !log daniel@deploy1002 daniel: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:30:03] <taavi>	 yeah, no worries. scap's locking system would have stopped you from syncing at the same time I think
[13:30:42] <Func>	 taavi: Did you see something like `Invalid parameter for message "{msgkey}": {param}` or `Invalid list type for message` in `Bug58676` log channel? I searched the core and found this may be relevant.
[13:30:59] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42138/console" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[13:31:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "varnish upload tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/934328 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur)
[13:31:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
[13:31:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet
[13:31:54] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330 (owner: 10Jgiannelos)
[13:32:06] <wikibugs>	 (03PS6) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[13:32:13] <moritzm>	 !log failover ganeti master in codfw to ganeti2020
[13:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:36] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934330 (owner: 10Jgiannelos)
[13:33:20] <taavi>	 Func: I didn't see those when testing, but after the merge there have been ~100 on zhwikibooks
[13:33:54] <Func>	 yeah, mwdebug hosts served me correctly
[13:35:03] <wikibugs>	 (03PS1) 10Btullis: Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514)
[13:35:10] <logmsgbot>	 !log daniel@deploy1002 Finished scap: Backport for [[gerrit:933453|Disable PC writes for parsoid endpoints (T339867)]] (duration: 07m 07s)
[13:35:13] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:35:15] <wikibugs>	 (03CR) 10JHathaway: Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[13:35:17] <stashbot>	 T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints - https://phabricator.wikimedia.org/T339867
[13:35:49] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:35:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena)
[13:35:56] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:36:17] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[13:36:25] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena)
[13:36:33] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:36:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:37:24] <wikibugs>	 (03Merged) 10jenkins-bot: page-content-change: fix error sink stream name. [deployment-charts] - 10https://gerrit.wikimedia.org/r/933914 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena)
[13:37:46] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the version of the datahub image in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/934332 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:38:12] <moritzm>	 !log installing bind9 security updates (tools/libs only)
[13:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:17] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:38:21] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:39:20] <duesen>	 claime, effie: ok, parsoid pc cache writes are now disabled for parsoid endpoints, we are fully relying on the background jobs todo the parsing.
[13:39:35] <duesen>	 Let's see how this goes for as couple of days
[13:39:57] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:39:58] <claime>	 duesen: I'm preparing the patches and reimage to add more jobrunners to be prudent
[13:40:05] <effie>	 duesen: so what are tha parsoid* servers now left with?
[13:40:23] <duesen>	 claime: thank you!
[13:40:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:40:52] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:40:55] <taavi>	 Reedy: when your patch merges, will you deploy it or do you want me to do it?
[13:40:59] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:41:10] <duesen>	 effie: at the moment, they still parse, they just don't cache. Eventually, we will turn off pre-generation in restbase, at that point, the parsoid cluster will no longer be hit
[13:41:32] <duesen>	 The current experiment just makes sure that when we do that, the jobrunners don't fall over.
[13:41:47] <effie>	 duesen: we need to decide to which channel we will sync :)
[13:42:01] <claime>	 we should move to -serviceops tbh
[13:42:10] <effie>	 sure sure 
[13:42:41] <duesen>	 joined.
[13:42:44] <Reedy>	 taavi: Any chance you could do it please? I've gotta go out for an appointment soon... MatmaRex should be around to test it
[13:42:56] <taavi>	 sure, not a problem
[13:43:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Upgrade the analytics airflow instance to 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933087 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis)
[13:43:03] <MatmaRex>	 (hi)
[13:43:09] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:43:17] <Reedy>	 cheers
[13:44:02] <duesen>	 taavi: i'm done
[13:44:05] <taavi>	 thx
[13:44:40] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:44:50] <Func>	 taavi: It seems the InfoAction::pageCounts() method's cache didn't take config change into account, it now accessing unset array keys. should we revert my patch for now?
[13:44:54] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:44:58] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:45:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10JArguello-WMF)
[13:45:44] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10JArguello-WMF)
[13:46:04] <wikibugs>	 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10JArguello-WMF)
[13:46:33] <wikibugs>	 (03PS7) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[13:46:50] <taavi>	 Func: that sounds like the safest option to me
[13:46:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[13:47:14] <wikibugs>	 (03Merged) 10jenkins-bot: Fix trying to get a PageRecord for a non-existent page [extensions/VisualEditor] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934023 (https://phabricator.wikimedia.org/T340568) (owner: 10Reedy)
[13:47:33] <wikibugs>	 (03PS1) 10Majavah: Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024
[13:47:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah)
[13:47:58] <taavi>	 aand the VE change merged. I'll sync both at the same time
[13:48:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah)
[13:49:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934024 (owner: 10Majavah)
[13:49:27] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]]
[13:49:32] <stashbot>	 T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568
[13:50:20] <wikibugs>	 (03PS8) 10FNegri: Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756)
[13:50:58] <logmsgbot>	 !log taavi@deploy1002 taavi and reedy: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:51:10] <taavi>	 MatmaRex: please test
[13:51:19] <taavi>	 and Func, if you can test the revert
[13:51:59] <MatmaRex>	 looking
[13:52:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:53:03] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri)
[13:53:35] <Func>	 tavvi: confirmed restored to no subpage count cell
[13:54:14] <Func>	 oh sorry
[13:55:27] <MatmaRex>	 taavi: somehow i can't reproduce the bug this is supposed to fix (when not on mwdebug), but i can say that at least nothing gets worse
[13:55:38] <taavi>	 ok
[13:55:57] <icinga-wm>	 RECOVERY - snapshot of s5 in eqiad on backupmon1001 is OK: Last snapshot for s5 at eqiad (db1216) taken on 2023-06-29 12:10:16 (571 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:56:31] <MatmaRex>	 oh, never mind, i can reproduce it
[13:56:43] <MatmaRex>	 and the patch does fix it
[13:56:44] <jynus>	 ^ s4 should recover soon too, as I rerun them
[13:56:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet
[13:57:08] <MatmaRex>	 you have to be creating a new article, add an external URL, and get a CAPTCHA. previously you'd get an error instead of the CAPTCHA
[14:00:00] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-worker1003.eqiad.wmnet with OS bullseye
[14:00:32] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:00:59] <wikibugs>	 (03PS1) 10JMeybohm: envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934335 (https://phabricator.wikimedia.org/T300324)
[14:01:28] <wikibugs>	 (03PS1) 10Elukey: role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336
[14:01:29] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:934023|Fix trying to get a PageRecord for a non-existent page (T340568)]], [[gerrit:934024|Revert "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"]] (duration: 12m 01s)
[14:01:33] <stashbot>	 T340568: Wikimedia\Assert\PreconditionException: Precondition failed: This Title instance does not represent an existing page: Property Simplify - https://phabricator.wikimedia.org/T340568
[14:01:46] <taavi>	 ok done
[14:02:02] <taavi>	 !log UTC afternoon backports done
[14:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:12] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[14:02:49] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[14:03:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[14:03:37] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[14:04:03] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey)
[14:04:06] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[14:04:09] <jayme>	 !log imported envoyproxy 1.26.1 to component/envoy-future in buster-wikimedia - T300324
[14:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:13] <stashbot>	 T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324
[14:04:36] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert)
[14:07:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2002.codfw.wmnet
[14:07:07] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[14:07:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:07] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[14:10:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet
[14:10:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[14:10:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:10:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors
[14:10:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors
[14:11:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[14:12:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[14:12:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:13:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:13:15] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: tighten proton group capture criteria [deployment-charts] - 10https://gerrit.wikimedia.org/r/934313 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan)
[14:14:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey)
[14:16:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet
[14:16:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet
[14:16:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::cache::text: set pass for ores-legacy.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/934336 (owner: 10Elukey)
[14:17:27] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable db1216 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934338 (https://phabricator.wikimedia.org/T340610)
[14:17:34] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: Helm release datahub/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=datahub - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:17:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:15] <wikibugs>	 (03Abandoned) 10MVernon: swift: roll object_expirer into cluster_info [puppet] - 10https://gerrit.wikimedia.org/r/933478 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon)
[14:18:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[14:18:54] <wikibugs>	 (03CR) 10MVernon: "[this is the version puppet office hours thought was better, but please do review :)]" [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon)
[14:19:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[14:19:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:19:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors
[14:19:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors
[14:19:25] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubestagemaster2002.codfw.wmnet
[14:20:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: Add more servers to jobrunner cluster [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert)
[14:20:55] <claime>	 !log Depooling mw148[2-6].eqiad.wmnet from api_appserver to move them to jobrunners - T329366
[14:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:59] <stashbot>	 T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366
[14:21:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: pick up Java 8 sec updates - jmm@cumin2002
[14:21:40] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw148[2-6].eqiad.wmnet
[14:22:06] <wikibugs>	 (03PS1) 10Btullis: Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514)
[14:22:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] conftool: Add more servers to jobrunner cluster [puppet] - 10https://gerrit.wikimedia.org/r/934337 (https://phabricator.wikimedia.org/T329366) (owner: 10Clément Goubert)
[14:25:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[14:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: Disable the datahub upgrade job [deployment-charts] - 10https://gerrit.wikimedia.org/r/934339 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[14:28:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:30:13] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:31:14] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1482.eqiad.wmnet with OS buster
[14:31:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1483.eqiad.wmnet with OS buster
[14:31:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS buster
[14:31:30] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1484.eqiad.wmnet with OS buster
[14:31:33] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1485.eqiad.wmnet with OS buster
[14:31:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host mw1486.eqiad.wmnet with OS buster
[14:31:47] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw1484.eqiad.wmnet with OS buster
[14:37:54] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934335 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[14:40:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable db1216 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934338 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo)
[14:41:39] <wikibugs>	 (03PS1) 10JMeybohm: envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934340 (https://phabricator.wikimedia.org/T300324)
[14:41:58] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy-future: Update to 1.26.1 and add draining script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934340 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[14:44:03] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage
[14:44:10] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage
[14:44:24] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage
[14:44:26] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage
[14:44:31] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage
[14:45:08] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610)
[14:45:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Not ready." [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo)
[14:45:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo)
[14:46:43] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage
[14:46:59] <jayme>	 !log published image docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324
[14:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:03] <stashbot>	 T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324
[14:49:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1483.eqiad.wmnet with reason: host reimage
[14:51:25] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1485.eqiad.wmnet with reason: host reimage
[14:51:36] <icinga-wm>	 RECOVERY - snapshot of s4 in eqiad on backupmon1001 is OK: Last snapshot for s4 at eqiad (db1150) taken on 2023-06-29 10:52:47 (1851 GiB, -2.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[14:52:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:52:35] <wikibugs>	 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm)
[14:52:37] <wikibugs>	 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340753 (10Urbanecm)
[14:53:55] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1484.eqiad.wmnet with reason: host reimage
[14:54:26] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage
[14:55:27] <wikibugs>	 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm)
[14:55:29] <wikibugs>	 10SRE-Access-Requests: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340753 (10Urbanecm)
[14:58:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] logspam-watch: Add a fox emoji [puppet] - 10https://gerrit.wikimedia.org/r/921050 (owner: 10Samtar)
[14:59:56] <wikibugs>	 (03CR) 10Dzahn: "Eoghan isn't here at the moment to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[15:00:05] <jouncebot>	 Daimona: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Create new tables for the CampaignEvents extension deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1500).
[15:04:07] <wikibugs>	 (03CR) 10Dzahn: "I would like to keep using just actual host names and not be forced to use regexes even for a single host." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:05:09] <wikibugs>	 (03CR) 10Dzahn: "wait, are you saying the site.pp is now used in cloud??" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:06:26] <Daimona>	 !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T340000
[15:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:31] <stashbot>	 T340000: Create the tables for participant questions in prod - https://phabricator.wikimedia.org/T340000
[15:07:16] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:09:06] <wikibugs>	 (03PS1) 10Urbanecm: admin: Add SSH key for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/934366 (https://phabricator.wikimedia.org/T340752)
[15:11:15] <wikibugs>	 (03CR) 10Dzahn: "let's merge then?" [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25)
[15:11:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: lower certificate_expiry_days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/932181 (https://phabricator.wikimedia.org/T339862) (owner: 10Jelto)
[15:13:06] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T337591 is resolved. so is this not needed anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/923631 (https://phabricator.wikimedia.org/T337591) (owner: 10RhinosF1)
[15:13:09] <wikibugs>	 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) Key verification: Uploaded the key as `bast3006.wikimedia.org:T340752_key.txt`. Happy to also confirm in a different way if needed.
[15:14:32] <wikibugs>	 (03Abandoned) 10RhinosF1: admin: rename neilpquinn-wmf to nshahquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/923631 (https://phabricator.wikimedia.org/T337591) (owner: 10RhinosF1)
[15:14:54] <RhinosF1>	 mutante: ty for reminder
[15:15:03] <mutante>	 yw
[15:16:14] <moritzm>	 !log installing Java 8 security updates on sessionstore/codfw
[15:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:28] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1482.eqiad.wmnet with OS buster
[15:16:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm)
[15:18:02] <wikibugs>	 (03CR) 10Dzahn: "This change is open since 2020 and has not received a single comment from reviewers. what can we do to fix this problem?" [puppet] - 10https://gerrit.wikimedia.org/r/631888 (https://phabricator.wikimedia.org/T109950) (owner: 10Dereckson)
[15:19:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1483.eqiad.wmnet with OS buster
[15:19:15] <wikibugs>	 (03CR) 10Dzahn: "is this still needed? What was it for btw?" [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond)
[15:19:39] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar)
[15:19:51] <wikibugs>	 (03CR) 10Dzahn: "How can we get cluster redirects reviewed?" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[15:21:19] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1485.eqiad.wmnet with OS buster
[15:23:43] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff)
[15:23:53] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff)
[15:23:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1484.eqiad.wmnet with OS buster
[15:24:14] <wikibugs>	 10SRE, 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, and 2 others: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10MoritzMuehlenhoff) 05Open→03Resolved This has been implemented with the new sre.ganeti.drain-node cookbook, which I've use...
[15:24:59] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/weight=10; selector: name=mw148[2-6].eqiad.wmnet
[15:25:08] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=mw148[2-6].eqiad.wmnet
[15:26:33] <wikibugs>	 (03PS4) 10Dzahn: durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:27:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001"
[15:27:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "it's been a while. and only comments. good to go!?" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:28:48] <wikibugs>	 (03PS2) 10Dzahn: wikidough: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:29:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:29:09] <wikibugs>	 (03CR) 10Ssingh: "This on me for delaying it, thanks for the reminder. I will merge." [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:29:43] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw148[2-6].eqiad.wmnet,cluster=jobrunner
[15:30:59] <claime>	 !log Pooled mw148[2-6].eqiad.wmnet as jobrunners - T329366
[15:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:05] <stashbot>	 T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366
[15:32:39] <wikibugs>	 (03CR) 10Dzahn: "needs manual rebase ." [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm)
[15:33:44] <wikibugs>	 (03PS2) 10Dzahn: extdist: Remove pre-bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm)
[15:34:32] <wikibugs>	 (03CR) 10Dzahn: "rebased.. which shows most of this was already done" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm)
[15:34:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1482.eqiad.wmnet
[15:34:42] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1482.eqiad.wmnet
[15:34:43] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1483.eqiad.wmnet
[15:34:43] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1483.eqiad.wmnet
[15:34:45] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1484.eqiad.wmnet
[15:34:45] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1484.eqiad.wmnet
[15:34:47] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1485.eqiad.wmnet
[15:34:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1485.eqiad.wmnet
[15:34:48] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1486.eqiad.wmnet
[15:34:49] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1486.eqiad.wmnet
[15:35:03] <wikibugs>	 (03PS1) 10Elukey: ml-services: update the ores-legacy Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934370
[15:35:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/842884/42144/" [puppet] - 10https://gerrit.wikimedia.org/r/842884 (owner: 10Legoktm)
[15:35:50] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863294/42145/durum1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:35:53] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:37:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] "Thanks moritz for the patch and dzahn for the reminder!" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:37:19] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "looks like it can be abandoned. old and there is only node /^cloudnet100[5-6]\.eqiad\.wmnet$/  now" [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[15:37:40] <wikibugs>	 (03PS2) 10Effie Mouzeli: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm)
[15:39:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update the ores-legacy Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/934370 (owner: 10Elukey)
[15:39:26] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371
[15:39:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudnet1003: decom host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[15:39:34] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: cloudnet1003: decom host [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[15:41:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] etherpad: Add a link to CoC in the defaultPadText [puppet] - 10https://gerrit.wikimedia.org/r/827512 (https://phabricator.wikimedia.org/T136744) (owner: 10Alexandros Kosiaris)
[15:42:47] <wikibugs>	 (03CR) 10Dzahn: "thanks! just going through old open puppet patches to clean up a bit" [puppet] - 10https://gerrit.wikimedia.org/r/838793 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[15:43:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, note inline" [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm)
[15:46:12] <wikibugs>	 (03CR) 10Dzahn: "@Legoktm How do you feel about this in 2023" [puppet] - 10https://gerrit.wikimedia.org/r/806473 (https://phabricator.wikimedia.org/T67270) (owner: 10Thcipriani)
[15:46:51] <wikibugs>	 (03CR) 10Dzahn: "are we still doing this or is it meanwhile "move to alertmanager" anyways?" [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[15:47:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:48:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] profile::java: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/934371 (owner: 10Muehlenhoff)
[15:48:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[15:48:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:49:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864 (owner: 10Muehlenhoff)
[15:49:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:49:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Can we do prometheus checks please? 🙃" [puppet] - 10https://gerrit.wikimedia.org/r/836775 (owner: 10Muehlenhoff)
[15:49:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp
[15:49:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp
[15:50:04] <wikibugs>	 (03PS3) 10Effie Mouzeli: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm)
[15:54:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm)
[15:55:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) (owner: 10JMeybohm)
[15:59:31] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2002.codfw.wmnet
[15:59:33] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[15:59:36] <rzl>	 dancy: 👋 the change lgtm, are you going to want a puppet run anywhere?
[16:00:04] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230629T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:06] <dancy>	 yes.. lemme find out which hosts...
[16:00:13] <rzl>	 👍
[16:00:43] <dancy>	 so.. the WMCS gitlab-runners...
[16:01:37] <dancy>	 runner-1029.gitlab-runners.eqiad1.wikimedia.cloud and friends.
[16:02:10] <dancy>	 looks like that's runner-1021 through runner-1030
[16:03:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[16:03:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[16:03:47] <logmsgbot>	 !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:04:02] <logmsgbot>	 !log klausman@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:04:21] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:04:41] <logmsgbot>	 !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[16:10:03] <sukhe>	 !log systemctl restart bird.service on doh2002
[16:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:15] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp
[16:12:55] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp
[16:13:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (4) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[16:14:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10JArguello-WMF)
[16:15:03] <sukhe>	  2620:0:860:2:208:80:153:38 is doh2002, restarted bird
[16:15:30] <sukhe>	 (recovered)
[16:15:39] <wikibugs>	 (03Abandoned) 10Dzahn: reload apache after config change [puppet] - 10https://gerrit.wikimedia.org/r/263745 (owner: 10JanZerebecki)
[16:16:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[16:16:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:16:11] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.wipe-cache kubestagemaster2002.codfw.wmnet on all recursors
[16:16:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2002.codfw.wmnet on all recursors
[16:16:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[16:17:01] <wikibugs>	 (03CR) 10Bking: [C: 03+2] query_service: align all hiera configuration to the same order [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel)
[16:17:24] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2002.codfw.wmnet - jiji@cumin1001"
[16:18:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster2002.codfw.wmnet with OS bullseye
[16:18:19] <wikibugs>	 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye
[16:18:41] <mutante>	 !log releases1003 - re-enabling puppet after recent webserver debugging
[16:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[16:20:17] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) 05Resolved→03Open Unfortunately i closed this too soon.  things work fine on the puppetserver but now...
[16:20:48] <wikibugs>	 (03PS1) 10Ssingh: Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377
[16:21:49] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:22:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:23:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks! there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/863295/1" [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[16:27:01] <wikibugs>	 (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:27:54] <wikibugs>	 (03CR) 10Ssingh: "Adding CI for this repo in I8a37327241230b3af4c19cefd3900fee52c5dabf" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh)
[16:28:26] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: remove nsfw model [deployment-charts] - 10https://gerrit.wikimedia.org/r/934380 (https://phabricator.wikimedia.org/T331416)
[16:30:31] <wikibugs>	 (03CR) 10Dzahn: "Sorry, you should mostly ignore my comment then. I have no context about dev environment or how that works. Is there like a master / track" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:32:31] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the reminder! thiThis will require restarting dnsdist as the conf file will change, so I will do that on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[16:32:34] <wikibugs>	 (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:32:36] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863295/42148/doh1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[20:13:30] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] apt: Ensure sources.list is updated before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah)
[20:13:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye
[20:15:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T340788" [puppet] - 10https://gerrit.wikimedia.org/r/932435 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[20:18:30] <jinxer-wm>	 (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[20:22:36] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:23:43] <wikibugs>	 10SRE, 10Add-Link, 10Growth-Team, 10GrowthExperiments-NewcomerTasks, 10serviceops: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Marostegui) Thank you!!
[20:24:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Dzahn) @Superpes15 @SLyngshede-WMF @MatthewVernon If i read the ticket right then access to NDA is done and access to deployment is postpone...
[20:28:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn)
[20:35:01] <wikibugs>	 (03PS4) 10JHathaway: site.pp: Drop wmnet domain and always use regexes [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972)
[20:37:19] <wikibugs>	 (03PS3) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920
[20:37:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) Yes, the overlap in people is small, and at first it does seem to make sense to merge them.  But the groups have prett...
[20:37:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[20:39:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Dzahn) a:03Arnoldokoth +1 - uploaded key owned by urbanecm, matches gerrit
[20:39:34] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] swift: roll object_expirer into cluster_info (remove profile) [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon)
[20:39:48] <wikibugs>	 (03PS4) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920
[20:40:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn)
[20:40:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Dzahn) 05Open→03In progress a:03Arnoldokoth
[20:41:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) 05Open→03In progress a:03CCoxwell-WMF Access for @gengh is handled by T340614. For @CCoxwell-WMF we stil...
[20:41:40] <wikibugs>	 (03CR) 10JHathaway: "@dcaro, @jbond, @mutante, I think this is ready for review. I don't think it is a perfect solution, but I think it is worth trying." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:42:25] <wikibugs>	 (03CR) 10JHathaway: Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[20:43:08] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:45:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10RhinosF1) If deploy-service is used for k8s deploys, surely everyone in 'deployment' needs it with MediaWiki moving to k8s....
[20:46:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) Since we are not sure how much longer it will take for T329360 and because this ticket would now sit in "stalled" and be checked by a different perso...
[20:50:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) basically "deployment" is "mediawiki scap deployers" and deploy-service is "any service k8 deployers" and started as "...
[20:50:45] <wikibugs>	 (03CR) 10Hashar: "On Bullseye the issue comes from cloud-init `/etc/cloud/templates/sources.list.debian.tmpl` file which has:" [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah)
[20:53:55] <wikibugs>	 (03CR) 10JHathaway: "Rob I would love your review as well" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[20:54:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) Hi, this ticket seems resolved. Is it?
[20:54:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) a:05BTullis→03cjming Clare, does "it" work and we can close this?
[20:55:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Dzahn) a:03MatthewVernon
[20:56:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) a:03ItamarWMDE
[20:59:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10Dzahn) This ticket seems technically resolved. What, if any, would be the next step here? Is it still in discussion?
[21:00:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10taavi) >>! In T340165#8978137, @Dzahn wrote: > basically "deployment" is "mediawiki scap deployers" and deploy-service is "an...
[21:06:34] <wikibugs>	 (03PS2) 10Samtar: IS: Phonos, reorder and enable for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763)
[21:07:12] <TheresNoTime>	 jouncebot: nowandnext
[21:07:12] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 52 minute(s)
[21:07:12] <jouncebot>	 In 8 hour(s) and 52 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230630T0600)
[21:08:51] <TheresNoTime>	 Going to deploy a config change, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/934391
[21:09:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[21:09:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10taavi) > Should we drop this group? From my count[0] there are only three users in that group and not deployment, so it's jus...
[21:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (28) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:09:53] <wikibugs>	 (03PS1) 10Majavah: Drop deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165)
[21:09:56] <wikibugs>	 (03Merged) 10jenkins-bot: IS: Phonos, reorder and enable for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934391 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[21:10:12] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]]
[21:10:17] <stashbot>	 T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763
[21:10:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Dzahn) per ` role/common/deployment_server/kubernetes.yaml`  ` profile::admin::groups:   - deployment...
[21:11:40] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:11:53] * TheresNoTime testing
[21:13:10] * TheresNoTime syncing
[21:13:32] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) The only thing that changed since 2015 is the numbers are now higher :)   `   ~/repos/puppet$ git grep "ensure *=> *\([a-z]\+\)" | wc -l 3027...
[21:14:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (28) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:15:31] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2020 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:17:36] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:18:38] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:934391|IS: Phonos, reorder and enable for mediawikiwiki (T336763)]] (duration: 08m 26s)
[21:18:43] <stashbot>	 T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763
[21:19:08] * TheresNoTime done
[21:22:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[21:23:03] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:23:09] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:23:29] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[21:23:41] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:45] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:24:13] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2020 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[21:24:36] <TheresNoTime>	 hmm
[21:25:05] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[21:26:19] <inflatador>	 ryankemper looks like the cookbook must be removing downtimes? I guess we should add a flag for that myabe
[21:28:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage
[21:28:40] <mutante>	 same for 2020 and 2021?
[21:30:46] <wikibugs>	 (03CR) 10Dzahn: "This should be reviewed by serviceops team (Giuseppe/Alex team)" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah)
[21:37:40] <ryankemper>	 mutante: we data transferred from 2022 to 2020 so the cookbook would have removed both those downtimes. 2021 was getting reimaged by inflatador
[21:37:58] <ryankemper>	 inflatador: yeah, a --no-remove-downtime flag would be nice
[21:45:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Jdlrobson) Fine with me to be removed from that group.
[21:48:19] <wikibugs>	 10SRE-OnFire, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JArguello-WMF)
[21:50:37] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10JArguello-WMF)
[21:51:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10cjming) hi @Dzahn - yes, all is good - thanks!
[21:52:13] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops-radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF)
[21:52:37] <mutante>	 ryankemper: ah! gotcha, thanks
[21:53:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Kerberos for cjming - https://phabricator.wikimedia.org/T340491 (10Dzahn) 05Open→03Resolved great :)
[21:58:30] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:00:02] <wikibugs>	 10SRE, 10Data Pipelines, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10JArguello-WMF)
[22:03:08] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:03:10] <wikibugs>	 10SRE, 10Data Pipelines, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10JArguello-WMF)
[22:14:07] <wikibugs>	 (03CR) 10Dzahn: "I feel like this should maybe be a post to an SRE mailing list." [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[22:57:36] <wikibugs>	 10SRE, 10Data-Engineering, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10JArguello-WMF)
[23:15:53] <wikibugs>	 (03PS1) 10RLazarus: opentelemetry-collector: Switch off unused default receivers and ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/934420 (https://phabricator.wikimedia.org/T320564)
[23:28:18] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service,wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:30] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:28:38] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:28:54] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 398 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:29:26] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[23:29:26] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:29:26] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook