[00:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:12:45] <wikibugs>	 (03PS1) 10Cathal Mooney: Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191)
[00:14:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[00:14:40] <wikibugs>	 (03Merged) 10jenkins-bot: Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[00:16:07] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191)
[00:17:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[00:17:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[00:18:32] <wikibugs>	 (03CR) 10Krinkle: Enable $wgStatsTarget for requests to kube-mw-debug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[00:26:16] <wikibugs>	 (03PS4) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989)
[00:26:53] <wikibugs>	 (03PS6) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989)
[00:32:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157']
[00:39:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1157']
[00:39:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636
[00:39:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636 (owner: 10TrainBranchBot)
[00:50:24] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye
[00:50:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[00:55:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[00:59:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636 (owner: 10TrainBranchBot)
[01:00:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1158']
[01:12:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye
[01:12:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[01:14:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) P53530
[01:25:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) Here's the change in errors on /dev/sdj since the 31st.   ` 4c4 < (1) cloudcephosd1024.eqiad.wmnet 198 Of...
[01:58:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:38:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:21] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:24:40] <wikibugs>	 (03PS1) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635)
[03:59:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53531 and previous config saved to /var/cache/conftool/dbconfig/20231117-035924-arnaudb.json
[03:59:32] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[04:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:14:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P53532 and previous config saved to /var/cache/conftool/dbconfig/20231117-041430-arnaudb.json
[04:29:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P53533 and previous config saved to /var/cache/conftool/dbconfig/20231117-042937-arnaudb.json
[04:44:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53534 and previous config saved to /var/cache/conftool/dbconfig/20231117-044443-arnaudb.json
[04:44:45] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[04:44:48] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[04:44:58] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[04:45:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53535 and previous config saved to /var/cache/conftool/dbconfig/20231117-044504-arnaudb.json
[04:53:15] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] ""groups/Phabricator/Phabricator.yaml" in translatewiki.net repository has to be updated as well" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery)
[05:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.838285390921219s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.9822750948484793s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:58:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:48:21] <logmsgbot>	 !log mabualruz@deploy2002 Backport cancelled.
[06:48:28] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1119 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/975117 (https://phabricator.wikimedia.org/T351386)
[06:49:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/975117 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui)
[06:54:30] <wikibugs>	 (03PS1) 10Marostegui: db2133: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/975118 (https://phabricator.wikimedia.org/T351386)
[06:55:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2133: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/975118 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui)
[06:55:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2133.codfw.wmnet with OS bookworm
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0700)
[07:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:12:30] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:12:37] <marostegui>	 ^ expected
[07:12:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:13:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2133.codfw.wmnet with reason: host reimage
[07:16:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2133.codfw.wmnet with reason: host reimage
[07:19:33] <mo_abualruz>	 https://www.irccloud.com/pastebin/Zl0lzsiz
[07:20:44] <mo_abualruz>	 Good morning I have some trouble back porting details in snippet above
[07:30:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2133.codfw.wmnet with OS bookworm
[07:31:15] <RhinosF1>	 jouncebot: nowandnext
[07:31:15] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0700)
[07:31:15] <jouncebot>	 In 0 hour(s) and 28 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0800)
[07:31:26] <RhinosF1>	 mo_abualruz: why are you backporting?
[07:31:38] <RhinosF1>	 It's Friday. Has approval for an emergency deploy been sought?
[07:34:43] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report
[07:34:44] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[07:34:49] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report
[07:34:49] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[07:35:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Adapt VM name [puppet] - 10https://gerrit.wikimedia.org/r/975122 (https://phabricator.wikimedia.org/T349402)
[07:38:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Adapt VM name [puppet] - 10https://gerrit.wikimedia.org/r/975122 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[07:41:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619)
[07:43:53] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[07:44:19] <mo_abualruz>	 RhinosF1: Not sure about the workflow on Friday my team have requested of me to backport it, there is a high number of front end errors because of it
[07:45:11] <wikibugs>	 (03PS4) 10Slyngshede: P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694)
[07:47:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/532/console" [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:47:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619)
[07:49:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host debmonitor2003.codfw.wmnet
[07:50:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:50:16] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619)
[07:57:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host debmonitor2003.codfw.wmnet
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0800)
[08:00:48] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:01:21] <RhinosF1>	 mo_abualruz: you still need SRE/Releng approval
[08:01:51] <RhinosF1>	 moritzm: you seem to be around? We've got a request for an emergency deploy
[08:02:21] <mo_abualruz>	 Sure where to submit a request
[08:02:48] <RhinosF1>	 mo_abualruz: you ask in here and -releng, I've done that
[08:03:15] <RhinosF1>	 Hopefully releng can also look at the error you got
[08:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:03:37] <mo_abualruz>	 thanks a lot
[08:04:22] <moritzm>	 I'm around, but let's rather wait for one of the releng folks to be around (hashar or jnuche), not sure on which basis those exceptions handled
[08:05:10] <RhinosF1>	 moritzm: I've pinged both in -releng, someone from SRE is supposed to say it's ok too I believe. We'll need them for the fact mo_abualruz couldn't work scap either.
[08:05:33] <RhinosF1>	 mo_abualruz: it will be at least an hour for Jamie, not sure about has.har
[08:05:49] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report
[08:05:50] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[08:06:34] <mo_abualruz>	 No worries I will wait thanks a lot RhinosF1
[08:06:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:07:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff)
[08:08:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff)
[08:09:59] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host crm2001.codfw.wmnet
[08:10:00] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.dns.netbox
[08:13:27] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM crm2001.codfw.wmnet - jmm@cumin1001"
[08:14:18] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM crm2001.codfw.wmnet - jmm@cumin1001"
[08:14:18] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:14:18] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.dns.wipe-cache crm2001.codfw.wmnet on all recursors
[08:14:21] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) crm2001.codfw.wmnet on all recursors
[08:14:48] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM crm2001.codfw.wmnet - jmm@cumin1001"
[08:15:39] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM crm2001.codfw.wmnet - jmm@cumin1001"
[08:18:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure crm2001 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975202 (https://phabricator.wikimedia.org/T349402)
[08:19:16] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2055 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:23:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Configure crm2001 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975202 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[08:25:18] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host crm2001.codfw.wmnet with OS bookworm
[08:25:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host crm2001.codfw.wmnet with OS...
[08:29:00] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[08:29:11] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[08:30:13] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[08:36:32] <jynus>	 !disable puppet on dbprov2001 for testing T351491
[08:36:32] <stashbot>	 T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup - https://phabricator.wikimedia.org/T351491
[08:42:42] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on crm2001.codfw.wmnet with reason: host reimage
[08:45:41] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on crm2001.codfw.wmnet with reason: host reimage
[08:48:29] <wikibugs>	 (03CR) 10WMDE-Fisch: "> 18:51:32 map-bmswiki is referenced for wgPopupsConflictingNavPopupsGadgetName, but it isn't either a wiki or a dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch)
[08:57:42] <hashar>	 o/
[08:58:04] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551)
[08:58:26] <hashar>	 mo_abualruz: sorry I only opened IRC a couple minutes ago
[08:59:16] <hashar>	 > I try scap backport 975096 I get ```fatal: cannot change to '/srv/mediawiki-staging/php-master': No such file or directory
[08:59:16] <hashar>	 > 07:16:21 backport failed: <CalledProcessError> Command '['git', '-C', '/srv/mediawiki-staging/php-master', 'rev-list', 'origin/master', '--regexp-ignore-case', '--grep', 'Change-Id: Ie417d62484192f1b9ac270b1e619ec783da89d9d']' returned non-zero exit status 128.
[08:59:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:00:01] <hashar>	 that `php-master` link is for the beta cluster which runs mediawiki out of the master branches
[09:00:10] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:00:11] <hashar>	 there is ZERO reason for it to exist on production
[09:01:17] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host crm2001.codfw.wmnet with OS bookworm
[09:01:17] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host crm2001.codfw.wmnet
[09:01:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host crm2001.codfw.wmnet with OS bookworm completed: - crm...
[09:01:37] <hashar>	 my bet is something got changed in puppet/config iwhich feeds the wrong value
[09:02:09] <taavi>	 hashar: mo_abualruz: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096/ is against the master branch, not a wmf release branch
[09:02:29] <hashar>	 that is the point of running `scap-backport`
[09:04:37] <taavi>	 no, scap backport has never created the cherry-picks automatically
[09:04:58] <moritzm>	 !log imported php-memcached  3.1.5+2.2.0-5+deb11u1+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[09:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:37] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner
[09:16:22] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:18] <wikibugs>	 (03PS1) 10Brouberol: Replace an-druid1001 by an-druid1001 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604)
[09:18:50] <wikibugs>	 (03PS2) 10Brouberol: Replace an-druid1001 by an-druid1002 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604)
[09:22:51] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with new runners
[09:24:03] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with new runners
[09:25:24] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis)
[09:26:56] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[09:31:14] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1003.eqiad.wmnet
[09:32:35] <wikibugs>	 (03PS2) 10Brouberol: Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[09:34:18] <wikibugs>	 (03CR) 10Brouberol: "I fixed the tests" [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[09:34:20] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-etcd1003.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975208 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[09:34:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol)
[09:36:40] <wikibugs>	 (03CR) 10Btullis: Configure Matomo's TagManager to write to existing tmpdir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[09:38:05] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1003.eqiad.wmnet
[09:39:24] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Replace an-druid1001 by an-druid1002 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol)
[09:41:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402)
[09:41:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[09:43:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402)
[09:44:10] <wikibugs>	 (03PS1) 10Volans: remote: add RemoteHost.get_subset() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211
[09:44:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53537 and previous config saved to /var/cache/conftool/dbconfig/20231117-094412-arnaudb.json
[09:44:17] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:45:01] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I've sent a separate CR that should allow to simplify a bit this one and be a bit less hacky ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond)
[09:50:59] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/535/con" [puppet] - 10https://gerrit.wikimedia.org/r/975210 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[09:51:51] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1002.eqiad.wmnet
[09:55:30] <wikibugs>	 (03PS1) 10Elukey: Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278)
[09:55:32] <wikibugs>	 (03PS1) 10Elukey: profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278)
[09:55:34] <wikibugs>	 (03PS1) 10Elukey: Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278)
[09:55:36] <wikibugs>	 (03PS1) 10Elukey: Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278)
[09:55:38] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278)
[09:55:40] <wikibugs>	 (03PS1) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278)
[09:55:42] <wikibugs>	 (03PS1) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278)
[09:55:44] <wikibugs>	 (03PS1) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278)
[09:58:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:59:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P53539 and previous config saved to /var/cache/conftool/dbconfig/20231117-095918-arnaudb.json
[10:01:30] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010)
[10:03:39] <wikibugs>	 (03PS1) 10JMeybohm: Normalize config/sites.yaml to be machine editable [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074)
[10:03:43] <wikibugs>	 (03PS1) 10JMeybohm: Move mw appservers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074)
[10:03:47] <wikibugs>	 (03PS1) 10JMeybohm: Normalize conftool-data/node/{eqiad,codfw}.yaml to be machine editable [puppet] - 10https://gerrit.wikimedia.org/r/975227 (https://phabricator.wikimedia.org/T351074)
[10:03:49] <wikibugs>	 (03PS1) 10JMeybohm: Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074)
[10:03:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[10:04:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon)
[10:04:33] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-etcd*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975210 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:05:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[10:05:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Should we also set the repository in read-only and/or archive? No idea what is the procedure from gerrit to gitlab, if the old gerrit repo" [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (owner: 10BCornwall)
[10:07:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) @thcipriani you're the approver for the `deployment` group, can you approve (or otherwise) this request, please?
[10:08:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply crm role to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/975229 (https://phabricator.wikimedia.org/T349402)
[10:08:59] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1002.eqiad.wmnet
[10:09:44] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1001.eqiad.wmnet
[10:10:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon)
[10:11:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) ssh pubkey confirmed OOB; this just needs group approval.
[10:12:46] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1001.eqiad.wmnet
[10:12:48] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new crm VM - jmm@cumin1001 - T349402"
[10:12:54] <stashbot>	 T349402: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402
[10:13:17] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders)
[10:13:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] "I'll deploy this one now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders)
[10:14:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P53540 and previous config saved to /var/cache/conftool/dbconfig/20231117-101425-arnaudb.json
[10:14:54] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491)
[10:15:00] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders)
[10:15:24] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232
[10:15:26] <wikibugs>	 (03CR) 10Tchanders: ipoid: Disable the daily updates job and schedule an import (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders)
[10:15:56] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:16:01] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:16:13] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:16:17] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:17:51] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new crm VM - jmm@cumin1001 - T349402"
[10:17:55] <stashbot>	 T349402: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402
[10:18:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Apply crm role to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/975229 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[10:18:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211 (owner: 10Volans)
[10:18:44] <wikibugs>	 (03CR) 10Jcrespo: "The puppet side of the change for the ca update." [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo)
[10:19:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:19:24] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:19:27] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:19:54] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-serve-ctrl*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975233 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:19:59] <wikibugs>	 (03PS3) 10Jbond: puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995
[10:20:14] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo)
[10:20:30] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve-ctrl1002.eqiad.wmnet
[10:22:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend Wmflib::Team type with Fundraising Tech [puppet] - 10https://gerrit.wikimedia.org/r/975234 (https://phabricator.wikimedia.org/T349402)
[10:23:14] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet
[10:25:29] <wikibugs>	 (03PS1) 10Punith.nyk: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975235
[10:27:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond)
[10:28:10] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve-ctrl1001.eqiad.wmnet
[10:29:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo)
[10:29:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53541 and previous config saved to /var/cache/conftool/dbconfig/20231117-102931-arnaudb.json
[10:29:33] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[10:29:36] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:29:47] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[10:29:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53542 and previous config saved to /var/cache/conftool/dbconfig/20231117-102952-arnaudb.json
[10:31:24] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet
[10:31:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend Wmflib::Team type with Fundraising Tech [puppet] - 10https://gerrit.wikimedia.org/r/975234 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[10:32:25] <wikibugs>	 (03CR) 10Jcrespo: "Should work ok, although cloud hiera is showing the trivial (non existent) output: https://puppet-compiler.wmflabs.org/output/975231/536/b" [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo)
[10:32:38] <mo_abualruz>	 hashar: Thanks I will cherrypick into a patch against release branch
[10:34:03] <wikibugs>	 (03PS1) 10Mabualruz: Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362)
[10:43:54] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): Update the list of ReferenceTooltip gadget names (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch)
[10:44:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:48:53] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975238 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:50:00] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449)
[10:50:08] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449) (owner: 10Kosta Harlan)
[10:50:59] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449) (owner: 10Kosta Harlan)
[10:51:17] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1008.eqiad.wmnet
[10:52:19] <wikibugs>	 (03CR) 10Klausman: "This should be coordinated with the data persistence team (of which I am not a member). You can find them on IRC (Libera) in #wikimedia-da" [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk)
[10:52:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz)
[10:52:43] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:53:02] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:53:28] <wikibugs>	 (03CR) 10Mabualruz: "recheck" [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz)
[10:54:22] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:54:32] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1008.eqiad.wmnet
[10:54:47] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:55:40] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:56:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402)
[10:56:33] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:56:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[10:57:12] <wikibugs>	 (03CR) 10Klausman: "I presume the prod-site data (current state) will be automagically removed when this is submitted?" [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:58:13] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:58:28] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:59:55] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430)
[11:00:41] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz)
[11:00:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402)
[11:03:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[11:04:18] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:04:24] <wikibugs>	 (03CR) 10Jcrespo: "Hello, Punith, thank you for your contribution, but deploying a change that can affect TLS on MySQL production servers is something that c" [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk)
[11:04:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert)
[11:06:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:06:38] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert)
[11:07:01] <claime>	 !log Redeploying mw-on-k8s for T350430
[11:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:07] <stashbot>	 T350430: php-fpm logs from Kubernetes lack 'message' and 'normalized_message' - https://phabricator.wikimedia.org/T350430
[11:08:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:08:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:08:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:08:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:08:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:08:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:08:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:08:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:08:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:08:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:08:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:08:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:08:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:08:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:08:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[11:08:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[11:10:04] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo)
[11:10:08] <jynus>	 !log running schema change on backup1-eqiad (mediabackups) T191804
[11:10:10] <wikibugs>	 (03Abandoned) 10Punith.nyk: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk)
[11:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:12] <stashbot>	 T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804
[11:10:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:11:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:11:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:11:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff crm2001.codfw.wmnet has been created and configured to al...
[11:11:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:11:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:12:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:12:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:12:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:13:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:13:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:13:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:13:29] <mo_abualruz>	 I will start deployment for 975037 as now the change is against release branch, thanks for the directions and the approval
[11:13:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mabualruz@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz)
[11:14:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:14:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[11:14:52] <wikibugs>	 (03PS5) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823
[11:14:53] <hashar>	 :)
[11:14:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[11:14:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[11:15:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[11:15:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[11:15:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[11:15:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[11:15:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:16:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[11:16:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[11:16:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:16:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[11:16:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[11:16:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[11:16:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[11:16:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[11:17:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[11:17:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[11:17:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[11:17:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[11:20:18] <wikibugs>	 (03CR) 10MVernon: swift: migrate one node to envoy for TLS termination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:20:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond)
[11:20:45] <wikibugs>	 (03PS1) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248
[11:20:46] <jynus>	 !log running schema change on backup1-codfw (mediabackups) T191804
[11:20:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:51] <stashbot>	 T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804
[11:27:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:28:21] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz)
[11:28:36] <logmsgbot>	 !log mabualruz@deploy2002 Started scap: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]]
[11:28:40] <stashbot>	 T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362
[11:29:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249
[11:29:55] <logmsgbot>	 !log mabualruz@deploy2002 mabualruz: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:30:18] <logmsgbot>	 !log mabualruz@deploy2002 mabualruz: Continuing with sync
[11:36:08] <logmsgbot>	 !log mabualruz@deploy2002 Finished scap: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]] (duration: 07m 32s)
[11:36:12] <stashbot>	 T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362
[11:36:16] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:37:12] <mo_abualruz>	 Thanks a lot deployment is successful
[11:39:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Hiera setting on an-worker1111 [puppet] - 10https://gerrit.wikimedia.org/r/975251
[11:39:18] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:42:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) 05Open→03Resolved Patches to support this have been merged and it's working for the codfw row A/B public vlans, closing task.
[11:46:31] <wikibugs>	 (03PS5) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[11:47:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[11:47:37] <wikibugs>	 (03CR) 10MVernon: [C: 04-1] "Hi," [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo)
[11:47:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Hiera setting on an-worker1111 [puppet] - 10https://gerrit.wikimedia.org/r/975251
[11:48:24] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney)
[11:48:55] <wikibugs>	 (03PS6) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[11:49:41] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney)
[11:53:21] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney)
[11:54:05] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500)
[11:54:10] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan)
[11:54:20] <wikibugs>	 (03PS7) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[11:54:31] <wikibugs>	 (03CR) 10MVernon: "What practical effect does this have?" [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:54:38] <hashar>	 mo_abualruz: congratulations :)
[11:54:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/975252
[11:55:01] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan)
[11:55:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond)
[11:55:16] <mo_abualruz>	 hashar thanks
[11:55:24] <wikibugs>	 (03PS8) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[11:55:31] * hashar lunches
[11:59:14] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond)
[11:59:40] <wikibugs>	 (03PS4) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[11:59:42] <wikibugs>	 (03PS5) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796)
[12:01:48] <wikibugs>	 (03PS2) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043)
[12:02:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene)
[12:03:08] <wikibugs>	 (03PS3) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043)
[12:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:04:44] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:06:19] <wikibugs>	 (03CR) 10Muehlenhoff: Switch moss nodes to role::insetup::buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:06:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[12:07:11] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:09:46] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[12:09:54] <wikibugs>	 (03PS5) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[12:10:08] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[12:10:10] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) public1-a-codfw and public1-b-codfw have gateways have been migrated to the new setup.    **Problems**  Unfortu...
[12:10:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Also configure acmechief hosts for initially migrated roles [puppet] - 10https://gerrit.wikimedia.org/r/975254
[12:10:56] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] "Deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan)
[12:11:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:15:19] <wikibugs>	 (03PS6) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[12:18:58] <wikibugs>	 (03PS1) 10Majavah: O:puppetserver: create role for per-project puppet server [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452)
[12:19:00] <wikibugs>	 (03PS1) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257
[12:19:15] <wikibugs>	 (03PS7) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[12:20:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/546/console" [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[12:23:48] <wikibugs>	 (03PS2) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257
[12:23:50] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/975258
[12:24:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[12:24:20] <wikibugs>	 (03Abandoned) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/974615 (owner: 10JMeybohm)
[12:24:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[12:25:02] <wikibugs>	 (03PS8) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[12:25:18] <wikibugs>	 (03PS3) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257
[12:25:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[12:27:17] <wikibugs>	 (03PS4) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257
[12:27:44] <wikibugs>	 (03CR) 10JMeybohm: "I've tested this in staging-codfw. The node is created with the proper taint and unschedulable: true flag. Both of which are not reset on " [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm)
[12:27:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[12:29:03] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/550/con" [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm)
[12:30:23] <wikibugs>	 (03PS5) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257
[12:31:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[12:34:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249 (owner: 10Muehlenhoff)
[12:34:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619)
[12:35:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249 (owner: 10Muehlenhoff)
[12:39:52] <wikibugs>	 (03PS9) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690)
[12:40:15] <wikibugs>	 (03CR) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[12:42:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw1001.wikimedia.org
[12:45:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:45:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619)
[12:45:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:47:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[12:47:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:49:46] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene)
[12:50:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Yep anything that helps!" [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[12:50:51] <logmsgbot>	 !log joal@deploy2002 Started deploy [airflow-dags/analytics@a5e5ddc]: Airflow HOTFIX [airflow-dags/analytics@a5e5ddca]
[12:50:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:51:19] <logmsgbot>	 !log joal@deploy2002 Finished deploy [airflow-dags/analytics@a5e5ddc]: Airflow HOTFIX [airflow-dags/analytics@a5e5ddca] (duration: 00m 28s)
[12:52:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ldap-rw1001.wikimedia.org
[12:52:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw2001.wikimedia.org
[12:52:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:52:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[12:53:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ldap-rw2001.wikimedia.org
[12:53:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw2001.wikimedia.org
[12:54:18] <taavi>	 elukey: puppet on deploy nodes is failing with /Stage[main]/Profile::Httpbb/Httpbb::Test_suite[ores/test_ores.yaml]/File[/srv/deployment/httpbb-tests/ores/test_ores.yaml] Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/httpbb/ores/test_ores.yaml
[12:54:20] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ldap-rw2001.wikimedia.org
[12:55:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporarily revert change for ldap-rw2001 [puppet] - 10https://gerrit.wikimedia.org/r/975262
[12:56:00] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "the directory should be created by g10k and owned by root" [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott)
[12:58:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Temporarily revert change for ldap-rw2001 [puppet] - 10https://gerrit.wikimedia.org/r/975262 (owner: 10Muehlenhoff)
[13:03:29] <wikibugs>	 (03CR) 10Jbond: "I have tested this using the script at https://phabricator.wikimedia.org/P53543 and it produced the following results (i manualy updated t" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[13:06:05] <wikibugs>	 (03CR) 10Jbond: puppet: update gat_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[13:07:18] <wikibugs>	 (03PS4) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004)
[13:08:46] <wikibugs>	 (03PS1) 10Jbond: Revert "Temporarily revert change for ldap-rw2001" [puppet] - 10https://gerrit.wikimedia.org/r/975038
[13:08:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Temporarily revert change for ldap-rw2001" [puppet] - 10https://gerrit.wikimedia.org/r/975038 (owner: 10Jbond)
[13:11:14] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah)
[13:13:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:13:45] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah)
[13:13:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:15:02] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah)
[13:16:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975254 (owner: 10Muehlenhoff)
[13:16:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/975265
[13:18:24] <wikibugs>	 (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[13:20:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah)
[13:20:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, cosmetic comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[13:22:51] <wikibugs>	 (03PS2) 10Elukey: profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278)
[13:22:53] <wikibugs>	 (03PS2) 10Elukey: Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278)
[13:22:55] <wikibugs>	 (03PS2) 10Elukey: Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278)
[13:22:57] <wikibugs>	 (03PS2) 10Elukey: profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278)
[13:22:59] <wikibugs>	 (03PS2) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278)
[13:23:01] <wikibugs>	 (03PS2) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278)
[13:23:03] <wikibugs>	 (03PS2) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278)
[13:23:05] <wikibugs>	 (03PS1) 10Elukey: profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278)
[13:23:33] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:24:32] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1007.eqiad.wmnet
[13:26:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm)
[13:26:51] <wikibugs>	 (03PS1) 10Vgutierrez: wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/975268 (https://phabricator.wikimedia.org/T351069)
[13:26:53] <wikibugs>	 (03Abandoned) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[13:26:56] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1007.eqiad.wmnet
[13:27:44] <wikibugs>	 (03Abandoned) 10Vgutierrez: wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/975268 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[13:28:29] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1006.eqiad.wmnet
[13:28:49] <wikibugs>	 (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[13:29:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:30:16] <wikibugs>	 (03CR) 10Vgutierrez: "PS2 PCC https://puppet-compiler.wmflabs.org/output/974623/491/ shows a working example for ncredir6001 after setting ipip_encapsulation: t" [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[13:30:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:30:58] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1006.eqiad.wmnet
[13:31:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:32:48] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1005.eqiad.wmnet
[13:33:43] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos)
[13:33:49] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1004.eqiad.wmnet
[13:35:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:35:31] <wikibugs>	 (03PS9) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[13:35:36] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1005.eqiad.wmnet
[13:35:36] <wikibugs>	 (03CR) 10Btullis: Configure the analytics prometheus instance to start scraping airflow (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[13:35:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:35:59] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos)
[13:36:06] <wikibugs>	 (03PS1) 10Kosta Harlan: [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299)
[13:36:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:36:26] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1004.eqiad.wmnet
[13:37:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:39:45] <wikibugs>	 (03CR) 10Majavah: O:puppetserver: create role for per-project puppet server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah)
[13:39:48] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] O:puppetserver: create role for per-project puppet server [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah)
[13:41:25] <wikibugs>	 (03PS1) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272
[13:42:24] <wikibugs>	 (03PS3) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278)
[13:42:26] <wikibugs>	 (03PS3) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278)
[13:42:28] <wikibugs>	 (03PS3) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278)
[13:42:46] <moritzm>	 !log imported php-luasandbox 4.0.2-3+wmf2+bullseye1 to component/php74 for bullseye-wikimedia
[13:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:20] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "thanks for the patch but im not sure this is the best approch.  i sent another fix in" [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[13:44:00] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet
[13:44:04] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1002.eqiad.wmnet
[13:44:06] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[13:44:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:44:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[13:44:20] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1003.eqiad.wmnet
[13:44:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:45:35] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet
[13:45:36] <logmsgbot>	 !log klausman@cumin1001 END (ERROR) - Cookbook sre.puppet.migrate-host (exit_code=97) for host ml-serve1003.eqiad.wmnet
[13:45:50] <wikibugs>	 (03PS7) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069)
[13:45:50] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet
[13:46:02] <klausman>	 argh, ^C in wrong window...
[13:46:22] <wikibugs>	 (03PS1) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069)
[13:46:42] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1003.eqiad.wmnet
[13:47:04] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1001.eqiad.wmnet
[13:47:21] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1002.eqiad.wmnet
[13:47:24] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1001.eqiad.wmnet
[13:48:08] <jynus>	 !log reenable puppet on dbprov2001 T351491
[13:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:13] <stashbot>	 T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup - https://phabricator.wikimedia.org/T351491
[13:49:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[13:51:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/975275 (owner: 10Klausman)
[13:52:56] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[13:54:48] <wikibugs>	 (03PS1) 10Klausman: Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7" [puppet] - 10https://gerrit.wikimedia.org/r/975039
[13:55:22] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7" [puppet] - 10https://gerrit.wikimedia.org/r/975039 (owner: 10Klausman)
[13:55:59] <wikibugs>	 (03PS1) 10Majavah: puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277
[13:56:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[13:56:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah)
[13:57:51] <wikibugs>	 (03PS2) 10Majavah: puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277
[13:57:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:58:06] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:58:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:59:16] <wikibugs>	 (03PS2) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272
[13:59:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/552/con" [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah)
[14:00:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975265 (owner: 10Muehlenhoff)
[14:00:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/553/con" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[14:00:51] <wikibugs>	 (03Abandoned) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah)
[14:00:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo)
[14:00:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:02:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[14:05:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah)
[14:06:51] <wikibugs>	 (03CR) 10Jbond: service: Add ipip_encapsulation field to ServiceLVS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[14:08:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Cleanup obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/975265 (owner: 10Muehlenhoff)
[14:11:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[14:11:21] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah)
[14:13:25] <wikibugs>	 (03CR) 10Hnowlan: "This change is probably superseded by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/973362 as there's a slightly more in" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) (owner: 10Mvolz)
[14:13:59] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036)
[14:15:33] <wikibugs>	 (03PS1) 10Btullis: Fix the location of the matomo config override file [puppet] - 10https://gerrit.wikimedia.org/r/975283
[14:16:20] <wikibugs>	 (03PS1) 10Elukey: Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040
[14:16:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fix the location of the matomo config override file [puppet] - 10https://gerrit.wikimedia.org/r/975283 (owner: 10Btullis)
[14:17:11] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:14] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040 (owner: 10Elukey)
[14:17:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040 (owner: 10Elukey)
[14:17:59] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:09] <wikibugs>	 (03PS1) 10Klausman: Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041
[14:18:20] <wikibugs>	 (03PS2) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069)
[14:18:38] <wikibugs>	 (03CR) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[14:18:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041 (owner: 10Klausman)
[14:19:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041 (owner: 10Klausman)
[14:20:00] <wikibugs>	 (03CR) 10Marostegui: mariadb: prepare copy of db1142 to db1242 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[14:20:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1001.eqiad.wmnet
[14:20:44] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1001.eqiad.wmnet
[14:22:03] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036)
[14:24:54] <wikibugs>	 (03CR) 10Marostegui: mariadb: prepare copy of db1142 to db1242 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[14:25:49] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:24] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036)
[14:26:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931)
[14:27:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi)
[14:27:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931)
[14:28:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi)
[14:30:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[14:31:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[14:31:48] <wikibugs>	 (03PS3) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931)
[14:32:27] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[14:33:21] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: prepare copy of db1142 to db1242 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[14:35:04] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:25] <wikibugs>	 (03CR) 10Btullis: Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[14:38:21] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:58] <wikibugs>	 (03PS1) 10Elukey: Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278)
[14:39:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036
[14:39:37] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[14:39:46] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036
[14:39:50] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036
[14:40:01] <wikibugs>	 (03CR) 10Elukey: "Added Moritz and Filippo for the specific bits (bullseye/buster tracking and graphite)" [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:40:04] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036
[14:41:08] <wikibugs>	 (03CR) 10Muehlenhoff: Clean up ores configs not used anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:41:30] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:48] <wikibugs>	 (03PS2) 10JMeybohm: Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074)
[14:42:14] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:22] <wikibugs>	 (03CR) 10Klausman: "LGTM for everything except what Moritz noted." [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:42:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1142 in db1242 for T344036', diff saved to https://phabricator.wikimedia.org/P53547 and previous config saved to /var/cache/conftool/dbconfig/20231117-144234-arnaudb.json
[14:42:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:16] <wikibugs>	 (03PS2) 10Elukey: Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278)
[14:44:18] <wikibugs>	 (03CR) 10Elukey: Clean up ores configs not used anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:45:02] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1142.eqiad.wmnet onto db1242.eqiad.wmnet
[14:45:20] <wikibugs>	 (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[14:45:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[14:45:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:47:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:48:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 (10cmooney) p:05Triage→03Medium
[14:48:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:31] <wikibugs>	 (03Abandoned) 10Elukey: hiera/modules: remove references to ORES roles [puppet] - 10https://gerrit.wikimedia.org/r/963683 (owner: 10Klausman)
[14:48:36] <wikibugs>	 (03CR) 10Cathal Mooney: Add BGP to the contributing protocols for aggregate routes on CRs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney)
[14:48:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[14:50:20] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:54] <wikibugs>	 (03PS1) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095)
[14:52:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See https://phabricator.wikimedia.org/T320931#9340698 for a sample output" [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi)
[14:53:22] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:57:56] <icinga-wm>	 PROBLEM - Check systemd state on ml-cache2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:12] <wikibugs>	 (03CR) 10JHathaway: puppetserver::g10k: Ensure the control repo exists before we run g10k (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[14:58:29] <wikibugs>	 (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:00:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:43] <XioNoX>	 !log cr1-esams> request chassis fpc slot 1 online  - T351304
[15:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:02] <stashbot>	 T351304: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304
[15:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:10:42] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:10:54] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:14:02] <wikibugs>	 (03PS3) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272
[15:14:16] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[15:18:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 (10cmooney)
[15:18:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) 05Open→03Resolved Replaced.
[15:24:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon) a:05MatthewVernon→03RobH Hi @RobH. I think: hostnames: ms-be1076-1082 racking: no more than 1 server per rack, please (but they can go in racks that alread...
[15:25:14] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Nice!  Looks great." [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi)
[15:25:46] <wikibugs>	 (03PS4) 10Vgutierrez: interface: Allow creating IPIP interfaces w/o an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069)
[15:25:48] <wikibugs>	 (03PS8) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069)
[15:27:38] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/554/console" [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:30:25] <sukhe>	 --/win 14
[15:30:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[15:31:14] <wikibugs>	 (03PS1) 10Majavah: sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299
[15:32:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53549 and previous config saved to /var/cache/conftool/dbconfig/20231117-153225-arnaudb.json
[15:32:31] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:32:48] <wikibugs>	 (03PS9) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069)
[15:32:53] <wikibugs>	 (03PS2) 10Majavah: sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299
[15:33:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10MatthewVernon) a:05MatthewVernon→03RobH Hi. I think: hostnames: ms-be20[74-80] racking: not more than 1 per rack, please, though they can share with existing nodes (e.g....
[15:33:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:36:37] <wikibugs>	 (03PS54) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[15:37:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975299 (owner: 10Majavah)
[15:38:08] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550)
[15:38:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299 (owner: 10Majavah)
[15:38:51] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[15:38:57] <logmsgbot>	 !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:40:44] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:57] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Seems right to me, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:47:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P53550 and previous config saved to /var/cache/conftool/dbconfig/20231117-154731-arnaudb.json
[15:50:24] <wikibugs>	 (03CR) 10Jbond: puppet: update gat_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[15:50:30] <wikibugs>	 (03PS9) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[15:51:04] <wikibugs>	 (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:52:03] <wikibugs>	 (03CR) 10Bking: [C: 03+1] "self-merging, as change this was already approved by ServiceOps in I318e7557c72b71587dafc0d039e0c64493f865d1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:52:34] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Update the list of NavigationPopups gadget names (0313 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch)
[15:52:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) a:03Xqt
[15:53:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Revising a bit after discussion with bblack and how the etcd paths should look like." [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:54:39] <wikibugs>	 (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[15:55:07] <wikibugs>	 (03PS2) 10Ssingh: conftool: introduce schema and host file for dnsboxes [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054)
[15:55:08] <icinga-wm>	 RECOVERY - Check systemd state on ml-cache2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:03] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:56:11] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:56:18] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:57:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.888546458538797s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:57:46] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:57:55] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:58:02] <logmsgbot>	 !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:58:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond)
[15:58:08] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[15:58:09] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:58:15] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:59:59] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "response time was incredible:)  thanks. also noop in compiler: https://puppet-compiler.wmflabs.org/output/975093/557/" [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[16:01:01] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "I see puppet is disabled on matomo1002 - unrelated work?" [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[16:01:33] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM from a logical perspective" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[16:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 3.0904484311989004s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:02:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P53551 and previous config saved to /var/cache/conftool/dbconfig/20231117-160238-arnaudb.json
[16:02:46] <wikibugs>	 (03CR) 10Jbond: "lgtm suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway)
[16:03:08] <wikibugs>	 (03PS1) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910)
[16:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:03:52] <wikibugs>	 (03PS2) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910)
[16:04:07] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[16:05:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] piwik: avoid hardcoded PHP version string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[16:06:56] <wikibugs>	 (03PS3) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910)
[16:08:17] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/558/con" [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[16:08:50] <wikibugs>	 (03CR) 10Dreamy Jazz: [C: 03+1] "This config change makes sense to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan)
[16:09:12] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[16:10:05] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] piwik: avoid hardcoded PHP version string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[16:11:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[16:11:14] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320
[16:11:16] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321
[16:12:47] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320 (owner: 10Ebernhardson)
[16:13:40] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320 (owner: 10Ebernhardson)
[16:14:10] <wikibugs>	 (03PS1) 10Vgutierrez: interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069)
[16:16:23] <wikibugs>	 (03PS6) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[16:16:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:30] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Update the list of NavigationPopups gadget names (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch)
[16:17:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53552 and previous config saved to /var/cache/conftool/dbconfig/20231117-161744-arnaudb.json
[16:17:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[16:17:49] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:18:00] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[16:18:02] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[16:18:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53553 and previous config saved to /var/cache/conftool/dbconfig/20231117-161806-arnaudb.json
[16:18:59] <wikibugs>	 (03PS2) 10Vgutierrez: interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069)
[16:20:37] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/559/console" [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:21:02] <wikibugs>	 (03CR) 10Kosta Harlan: [betalabs] ReportIncident: Relax rate limiting for reportincident action (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan)
[16:24:11] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/974660/560/" [puppet] - 10https://gerrit.wikimedia.org/r/974660 (https://phabricator.wikimedia.org/T351333) (owner: 10Dzahn)
[16:26:14] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:26:29] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:27:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond)
[16:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.136149810350958s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:29:43] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:33:45] <wikibugs>	 (03PS7) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[16:36:40] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:38:31] <wikibugs>	 (03PS10) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496)
[16:39:54] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1012 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:40:23] <wikibugs>	 (03Abandoned) 10Mvolz: rest-gateway: fix citoid regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) (owner: 10Mvolz)
[16:40:36] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.32.145:7000 on aqs1012 is OK: SSL OK - Certificate aqs1012-b valid until 2024-05-19 08:40:12 +0000 (expires in 183 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:41:00] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321
[16:41:02] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333
[16:41:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 (owner: 10Ebernhardson)
[16:41:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson)
[16:43:14] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333
[16:43:23] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson)
[16:44:15] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson)
[16:46:44] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:46:59] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.2174800845539s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:11:49] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1142.eqiad.wmnet onto db1242.eqiad.wmnet
[17:12:44] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[17:13:10] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou)
[17:29:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti  for eqiad - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF)
[17:32:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10spatton) Hi @MatthewVernon, this is approved from my side, thanks much!
[17:34:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti  for eqiad - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) ganeti1035.eqiad.wmnet Service Tag: 6DN8PZ3  Asset: WMF11370 Express Service Code: 13885792383 Rack: A2 Position: U33 Port: 41 Cableid: 230304500230...
[17:40:45] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1035
[17:42:11] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1035
[17:43:24] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1036
[17:45:18] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1037
[17:45:58] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1036
[17:46:27] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1037
[17:46:38] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1038
[17:47:31] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[17:47:45] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1038
[17:48:36] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED
[17:49:15] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED
[17:50:27] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED
[17:56:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[17:58:43] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[17:59:05] <wikibugs>	 (03PS8) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[17:59:48] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED
[17:59:54] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED
[18:01:10] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED
[18:02:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:04:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH)
[18:04:59] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[18:05:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) a:05RobH→03None Updated task description with updated racking details and removing myself as assignee.  Once these arrive on-site, one of our #ops-codfw engineers w...
[18:05:43] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[18:13:16] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:16:56] <brett>	 VRiley: You forget to commit changes while working on ganeti?
[18:17:54] <VRiley>	 Oh, I test them before I commit them. But I'll double check 
[18:18:23] <robh>	 i often fire the dns cookbook
[18:18:28] <robh>	 swap tabs, and its just sitting there waiting for me to confirm
[18:18:38] <robh>	 when i have that dns error that is why 99% of the time ; D
[18:19:22] <brett>	 A variety of ganeti and wmf a/aaaa records are ready for committing
[18:19:42] <brett>	 (and ptr)
[18:33:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10RobH)
[18:34:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10RobH) a:05RobH→03VRiley-WMF
[18:53:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10MoritzMuehlenhoff) Please also enable virtualisation for these in the BIOS, they will serve as virt servers.
[19:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:08:36] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:10:46] <wikibugs>	 (03CR) 10Dzahn: "nothing seems to use it per openstack-browser, but simplelamp2 is used and was basically the same change and noop" [puppet] - 10https://gerrit.wikimedia.org/r/975094 (owner: 10Dzahn)
[19:11:29] <wikibugs>	 (03PS3) 10Krinkle: Set new $wgMicroStashType setting to "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01)
[19:11:51] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Approved for deployment at your earliest convenience." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01)
[19:16:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) I believe that the bug that caused this has been fixed in 21.4R3-S5 for EX4300 devices.
[19:18:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:43:34] <wikibugs>	 (03PS6) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[19:51:14] <bvibber>	 !log brion regenerating .m3u8 streaming manifests for all video files on mwmaint2002 (cleanup for T350996)
[19:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:19] <stashbot>	 T350996: HLS meta playlist .m3u8 includes not-yet-made transcodes - https://phabricator.wikimedia.org/T350996
[19:52:29] <wikibugs>	 (03PS7) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[19:53:02] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "mw.notify: Limit width of overlay to max-width-page-container" [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975366 (https://phabricator.wikimedia.org/T349622)
[19:57:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "I got it to be a noop now. could merge it like this without a change: https://puppet-compiler.wmflabs.org/output/964176/564/planet1002.eqi" [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[20:03:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:10:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:13:26] <wikibugs>	 (03PS1) 10Dduvall: gitlab_runner: Allow rsyncd access to zuul.devtools.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329)
[20:17:59] <wikibugs>	 (03CR) 10Dduvall: "I cherry picked this to puppetmaster-1001.devtools.eqiad1.wikimedia.cloud, applied, and was able to rsync from zuul.devtools.wmcloud.org w" [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) (owner: 10Dduvall)
[20:22:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10RobH) a:05RobH→03None
[20:30:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[20:38:30] <wikibugs>	 10SRE: molly-guard does not apply to `systemctl reboot` - https://phabricator.wikimedia.org/T351570 (10taavi)
[20:41:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-wikifunctions (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:46:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-wikifunctions (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:59:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:59:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:10:17] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[21:13:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "decided to just merge to move forward. we can always build on top of it now. complete noop on existing servers confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[21:14:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53556 and previous config saved to /var/cache/conftool/dbconfig/20231117-211428-arnaudb.json
[21:14:35] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:14:50] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[21:29:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P53557 and previous config saved to /var/cache/conftool/dbconfig/20231117-212935-arnaudb.json
[21:30:00] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[21:44:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P53558 and previous config saved to /var/cache/conftool/dbconfig/20231117-214441-arnaudb.json
[21:53:42] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Allow rsyncd access to zuul.devtools.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) (owner: 10Dduvall)
[21:59:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53559 and previous config saved to /var/cache/conftool/dbconfig/20231117-215947-arnaudb.json
[21:59:50] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[21:59:53] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:00:04] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[22:13:42] <wikibugs>	 (03CR) 10BCornwall: readme: Update repo location of varnishkafka (031 comment) [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (owner: 10BCornwall)
[22:13:55] <wikibugs>	 (03PS2) 10BCornwall: readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623)
[22:14:39] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall)
[22:14:41] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall)
[22:30:16] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094 (owner: 10Dzahn)
[22:31:00] <wikibugs>	 (03PS2) 10Krinkle: [BETA HACK] confd: Fix confd hostname [puppet] - 10https://gerrit.wikimedia.org/r/941478
[22:31:23] <wikibugs>	 (03Abandoned) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle)
[23:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:18:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:34:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:38:29] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED
[23:39:13] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED