[00:19:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552
[00:39:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552 (owner: 10TrainBranchBot)
[00:44:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[00:57:00] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552 (owner: 10TrainBranchBot)
[01:33:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:33:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:38:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:38:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:52:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:44:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[05:00:31] <wikibugs>	 (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669)
[05:01:07] <wikibugs>	 (03PS2) 10KartikMistry: Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922)
[05:06:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:07:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:18:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:47] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:19:43] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:25] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:27:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50135 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:27:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:27:57] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:29:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:57:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:15:31] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 60427
[06:16:41] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 60427
[06:18:14] <_joe_>	 !log killing a pod with consistently high haproxy queue for thumbor in codfw
[06:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:22] <wikibugs>	 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Joe) >>! In T329491#8898529, @Ladsgroup wrote: > So I looked at categorylinks tables everywhere. There are the top ten biggest ones: > ` > root@clouddb1021:/srv# ls -Ssh sqldata.s*/*/categorylinks.ibd | head...
[06:32:47] <wikibugs>	 (03CR) 10Elukey: "I like the approach! Left some ideas since the output of the template is not 100% correct at the moment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[06:39:00] <wikibugs>	 (03CR) 10Elukey: java: ensure wmf-certificates is installed, when required (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[06:52:13] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] [SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie)
[06:58:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Use the parsoid memory limit everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980)
[06:58:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Load and enable parsoid everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T0700)
[07:00:05] <jouncebot>	 kart_ and matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:14] <matthiasmullie>	 o/
[07:00:23] <taavi>	 o/ I can deploy
[07:00:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie)
[07:02:05] <wikibugs>	 (03Merged) 10jenkins-bot: [SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie)
[07:02:50] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]]
[07:02:54] <stashbot>	 T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870
[07:03:07] <kart_>	 Sorry, late :/
[07:03:15] <kart_>	 taavi: Let me know when done.
[07:03:20] <taavi>	 kart_: sure, will do
[07:09:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[07:10:28] <elukey>	 moritzm: o/ I was checking krb1001, there seem to be a lot of krb5-related log files in (deleted) state
[07:12:04] <logmsgbot>	 !log taavi@deploy1002 mlitn and taavi: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[07:12:08] <taavi>	 matthiasmullie: please test
[07:12:10] <stashbot>	 T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870
[07:12:13] <icinga-wm>	 RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:12:14] <matthiasmullie>	 checking
[07:12:59] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:13:10] <moritzm>	 elukey: krb5kdc or something else? might be the rotation didn't kick in correctly or left some behind? https://phabricator.wikimedia.org/T337906
[07:13:59] <moritzm>	 df showed it as full, but then the actual file system usage was only like 7G, will keep an eye on it after the reboot
[07:14:38] <matthiasmullie>	 taavi: LGTM!
[07:14:40] <moritzm>	 the disk ran full some time days ago and then we setup the increased rotation/compression for krb5kdc, so possibly it was still in a wedged state from the initial full disk
[07:14:45] <taavi>	 thx, syncing
[07:15:19] <icinga-wm>	 RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[07:15:21] <icinga-wm>	 RECOVERY - puppet last run on krb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:15:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[07:17:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 (owner: 10Elukey)
[07:18:20] <elukey>	 moritzm: I left a comment in https://phabricator.wikimedia.org/T337906#8901375, I think it maybe related to the new logrotate rule
[07:20:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:21:17] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]] (duration: 18m 27s)
[07:21:21] <stashbot>	 T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870
[07:21:21] <matthiasmullie>	 taavi: thanks!
[07:21:26] <taavi>	 yw
[07:22:07] <taavi>	 kart_: I'm done, feel free to go ahead
[07:23:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[07:23:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[07:23:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[07:24:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[07:25:14] <kart_>	 taavi: thanks
[07:25:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:25:45] <wikibugs>	 (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669)
[07:27:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry)
[07:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:28:06] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry)
[07:28:23] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]]
[07:28:26] <stashbot>	 T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669
[07:30:02] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[07:32:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:37:50] <wikibugs>	 (03PS7) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825)
[07:38:05] <wikibugs>	 (03PS11) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825)
[07:38:20] <wikibugs>	 (03PS7) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825)
[07:38:22] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]] (duration: 09m 58s)
[07:38:25] <stashbot>	 T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669
[07:38:36] <wikibugs>	 (03PS12) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825)
[07:38:43] <wikibugs>	 (03PS8) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825)
[07:42:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Fix profile::gitlab::active_host and profile::gitlab::passive_hosts for devtools [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) (owner: 10Ahmon Dancy)
[07:47:23] <wikibugs>	 (03CR) 10Muehlenhoff: gdnsd: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff)
[07:50:09] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:54:50] <moritzm>	 !log installing containerd security updates
[07:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:08] <wikibugs>	 (03CR) 10Hashar: "Amending to replace `::facts` by `$facts['networking']['fqdn']`  and I will rebase this change since Gerrit flags it as being in merge con" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[07:58:56] <wikibugs>	 (03CR) 10Hashar: "Addressing the few comments in next patchset. I am also rebasing the whole chain since at least the parent is marked as being as having a " [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[07:59:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle)
[08:01:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[08:02:32] <wikibugs>	 (03CR) 10Hashar: contint: set Jenkins agent username from hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:02:48] <wikibugs>	 (03PS2) 10Hashar: contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646)
[08:02:50] <wikibugs>	 (03PS6) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646)
[08:03:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:04:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi)
[08:05:15] <hashar>	 AH puppet and tox fails 
[08:05:16] <hashar>	 00:00:08.252 /usr/lib/python3/dist-packages/tox/config/__init__.py:579: UserWarning: conflicting basepython version (set 27, should be 2) for env 'py2-pep8';resolve conflict or set ignore_basepython_conflict
[08:05:16] <hashar>	 00:00:08.252   proposed_version, implied_version, testenv_config.envname
[08:05:17] <hashar>	 :)
[08:05:28] <hashar>	 00:00:17.814 KeyError: key not found: "PARALLEL_PID_FILE" :D
[08:05:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Remove remnants of coal and coal-web [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle)
[08:11:41] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[08:13:11] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[08:13:33] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:24] <wikibugs>	 (03PS3) 10Hashar: contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646)
[08:17:26] <wikibugs>	 (03PS7) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646)
[08:17:28] <wikibugs>	 (03PS5) 10Hashar: contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646)
[08:22:27] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:28:50] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:29:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add external swagger checks to all sites [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite)
[08:30:27] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:30:36] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:30:43] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:31:23] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Stop cadvisor from collecting extra metrics from docker - https://phabricator.wikimedia.org/T337856 (10fgiunchedi)
[08:31:36] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi)
[08:31:38] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Stop cadvisor from collecting extra metrics from docker - https://phabricator.wikimedia.org/T337856 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done!
[08:31:44] <wikibugs>	 (03PS2) 10Muehlenhoff: Cloud: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/925721
[08:32:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/926543 (https://phabricator.wikimedia.org/T337766) (owner: 10Cathal Mooney)
[08:34:39] <wikibugs>	 (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/922515/1889/" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:37:03] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "This one breaks on PCC https://puppet-compiler.wmflabs.org/output/922555/1888/ with:" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:38:07] <wikibugs>	 (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/922554/1890/" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:40:07] <hashar>	 jbond: jnuche: sorry for the spam on that series of changes, I think I screwed up the rebase :/
[08:40:53] <claime>	 !log power-cycling restbase1027 - T338122
[08:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:56] <stashbot>	 T338122: restbase1027.eqiad.wmnet down - https://phabricator.wikimedia.org/T338122
[08:41:08] <wikibugs>	 (03Abandoned) 10Klein Muçi: Content Translation: Set MT threshold to 90% for Albanian WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925574 (owner: 10Klein Muçi)
[08:42:48] <wikibugs>	 (03PS1) 10Jcrespo: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119
[08:43:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo)
[08:44:36] <jnuche>	 hashar: ah, no worries :)
[08:44:37] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:44:39] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:44:57] <wikibugs>	 (03PS6) 10Hashar: contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646)
[08:44:59] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[08:45:03] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:45:09] <icinga-wm>	 RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:45:16] <hashar>	 I screwed up the rebase somehow
[08:45:24] <wikibugs>	 (03PS2) 10Jcrespo: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119
[08:45:25] <icinga-wm>	 RECOVERY - Restbase root url on restbase1027 is OK: HTTP OK: HTTP/1.1 200 - 17613 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[08:45:39] <wikibugs>	 (03PS8) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646)
[08:45:49] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:45:54] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:46:03] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:46:03] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:46:15] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[08:47:31] <wikibugs>	 (03PS3) 10Jcrespo: backup: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119
[08:47:35] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:47:35] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:48:11] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1027 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:49:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: trafficserver: also match mobile domains in mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/924080
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.184 port 9042 https://phabricator.wikimedia.org/T93886
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.185 port 9042 https://phabricator.wikimedia.org/T93886
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.186 port 9042 https://phabricator.wikimedia.org/T93886
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2025-02-21 18:43:53 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2025-02-21 18:43:51 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:49:43] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2025-02-21 18:43:55 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[08:56:55] <wikibugs>	 (03PS1) 10Btullis: Update the abuse filter wikireplica view rules [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426)
[09:02:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/926421 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff)
[09:03:44] <wikibugs>	 (03CR) 10Hashar: "Sorry for the spam, I screwed the order of my changes when rebasing :/" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:04:03] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:04:59] <jinxer-wm>	 (PuppetDisabled) resolved: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[09:06:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] backup: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo)
[09:06:33] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to wikidataclient.dblis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789976 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang)
[09:07:42] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "PCC https://puppet-compiler.wmflabs.org/output/922555/1892/" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:07:55] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:09:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:11:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:14:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene)
[09:17:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang)
[09:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:17:30] <wikibugs>	 (03CR) 10Jbond: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:21:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:21:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:22:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:25:40] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:25:52] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Make outreachwiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140)
[09:25:55] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Add new stat1009 to the stat servers rsync hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene)
[09:26:54] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang)
[09:27:12] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase languageLinkSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789979 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang)
[09:27:15] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang)
[09:27:35] <wikibugs>	 (03PS2) 10Muehlenhoff: bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[09:27:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[09:29:14] <wikibugs>	 (03CR) 10Jbond: "lgtm once the file is renamed (feel free to assume a +1)" [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[09:30:07] <wikibugs>	 (03CR) 10Muehlenhoff: "(Rebased in PS2, since PCC failed after afb46a8742c4afe2a344790319e096e88dd36d57 was merged)" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[09:31:25] <xover>	 _joe_: I think T337649 needs some form of escalation. From a user perspective and for the relevant use case / user group it looks, in practical effect, as if "Commons is down".
[09:31:26] <stashbot>	 T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649
[09:31:47] <_joe_>	 xover: sorry, I disagree on that last statement.
[09:32:02] <xover>	 Ok?
[09:32:12] <_joe_>	 but while there is no reason to overdramatize, the situation is indeed not acceptable
[09:32:23] <_joe_>	 sadly there's only one SRE team right now on the hook for thumbor
[09:33:01] <_joe_>	 I was actually discussing this right now
[09:33:14] <_joe_>	 so I'm not sure how much more escalation we can do
[09:33:29] <_joe_>	 I mean how much escalation my team and I can do
[09:33:52] <xover>	 (not overdramatizing: I'm just saying that the way the symptoms are presenting, that is what it will *look like* to those users that are affected and for the kinds of files that are affected)
[09:34:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[09:34:57] <wikibugs>	 (03CR) 10Hashar: "Puppet fails on releases1003.eqiad.wmnet with:" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[09:35:20] <xover>	 Yeah, I realise manpower is an issue. That's why I'm trying to wave the red flag. 
[09:36:04] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[09:37:51] <_joe_>	 xover: to be clear - I am fully sympathetic with the issues you're encountering, and we're trying to get at least some stopgaps in place
[09:37:57] <claime>	 !log roll-restart thumbor in codfw - T337649
[09:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:01] <stashbot>	 T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649
[09:38:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[09:38:56] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=thumbor.*
[09:39:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[09:39:16] <wikibugs>	 (03PS1) 10Btullis: "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859
[09:39:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (owner: 10Btullis)
[09:39:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, please proceed with caution ;P" [puppet] - 10https://gerrit.wikimedia.org/r/924080 (owner: 10Giuseppe Lavagetto)
[09:39:53] <claime>	 !log roll-restart thumbor in eqiad - T337649
[09:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:56] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[09:40:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10SalimJah) Hi Aklapper. Thanks for coming back to us.   We are actually working to complete a research project that leverages 10 years worth of en:wiki data, documented here:...
[09:41:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[09:44:08] <wikibugs>	 (03PS2) 10Btullis: "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951)
[09:44:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[09:45:48] <wikibugs>	 (03PS3) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951)
[09:46:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[09:48:11] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016
[09:48:13] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718
[09:48:15] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213
[09:50:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Point codfw URL downloader to new bullseye host" [dns] - 10https://gerrit.wikimedia.org/r/927129
[09:50:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove option to manage sources.list [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562)
[09:51:53] <wikibugs>	 (03CR) 10Jelto: "For me two topics are not yet resolved:" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[09:51:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "Point codfw URL downloader to new bullseye host" [dns] - 10https://gerrit.wikimedia.org/r/927129 (owner: 10Muehlenhoff)
[09:52:15] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:55:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1000)
[10:00:59] <wikibugs>	 (03PS4) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951)
[10:02:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff)
[10:05:20] <_joe_>	 xover: can you confirm things are "better" now?
[10:05:32] <xover>	 _joe_: Will test.
[10:06:50] <godog>	 !log truncate xff.log and JobExecutor.log on mwlog1002 to reclaim space - T338127
[10:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:53] <stashbot>	 T338127: log rotation stopped on mwlog for all files but "api.log" - https://phabricator.wikimedia.org/T338127
[10:06:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:07:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:08:26] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[10:08:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[10:08:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:09:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:09:41] <icinga-wm>	 RECOVERY - Disk space on mwlog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwlog1002&var-datasource=eqiad+prometheus/ops
[10:09:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[10:11:11] <wikibugs>	 (03CR) 10Jbond: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar)
[10:11:31] <moritzm>	 !log installing openssl security updates on Bullseye
[10:11:31] <xover>	 _joe_: Seems better. Retesting previous files no thumbs failed and thumbs loaded in about 10s total. Testing not-previously-tested showed intermittent 429 (first thumb requested for the file, at 500px) and ~15s load time for subsequent thumbs. Testing only a very limited number of files.
[10:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:55] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:12:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff)
[10:13:03] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts - aborrero@cumin1001"
[10:13:27] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:14:06] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts - aborrero@cumin1001"
[10:14:06] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:15:01] <wikibugs>	 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Ladsgroup) Good question: Commons, arwiki, use the default and the rest don't `  'enwiki' => 'uca-default-u-kn', // T136150  'ruwiktionary' => 'uca-ru',  'frwiki' => 'uca-fr-u-kn', // T56680, T146675  'fawik...
[10:15:49] <wikibugs>	 (03CR) 10Hashar: releases: clone repos/releng/release from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[10:17:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Another backup failing ^" [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo)
[10:22:19] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:23:51] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:26:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM see nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez)
[10:26:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:28:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:28:53] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:29:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas
[10:30:27] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:31:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas
[10:31:43] <wikibugs>	 (03PS15) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683)
[10:33:54] <wikibugs>	 (03PS16) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683)
[10:33:56] <wikibugs>	 (03PS5) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683)
[10:33:58] <wikibugs>	 (03PS15) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061
[10:34:00] <wikibugs>	 (03PS17) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683)
[10:34:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:34:39] <wikibugs>	 (03CR) 10Jbond: profile::base::firewall: move to profile::firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[10:35:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[10:36:03] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:38:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[10:42:31] <wikibugs>	 (03PS1) 10Ladsgroup: Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698)
[10:45:24] <Amir1>	 jouncebot: nowandnext
[10:45:24] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1000)
[10:45:24] <jouncebot>	 In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300)
[10:46:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: also match mobile domains in mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/924080 (owner: 10Giuseppe Lavagetto)
[10:47:51] <wikibugs>	 (03PS1) 10Jelto: gitlab: run four backups per day [puppet] - 10https://gerrit.wikimedia.org/r/927139 (https://phabricator.wikimedia.org/T316935)
[10:48:06] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > What kind of secret do we need to add to private puppet for the new OIDC GitLab client?  you need to copy the secret from   `...
[10:49:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[10:55:13] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:58:17] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:00:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125)
[11:06:19] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:07:46] <wikibugs>	 (03PS1) 10Jbond: wmcs::firewall: use profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927143 (https://phabricator.wikimedia.org/T279683)
[11:07:53] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:08:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir
[11:11:24] <wikibugs>	 (03CR) 10Gmodena: "The change LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[11:11:51] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:13] <moritzm>	 !log bounced ferm on ml-serve2006 (race caused by firewall profile change)
[11:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:27] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:50] <jbond>	 thaks moritzm 
[11:14:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::firewall: use profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927143 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[11:15:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir
[11:16:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[11:19:17] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Pereibri) is there a way to have this map hosted somewhere else like google?  http://maps.wikimedia.org/osm-intl/%7Bz%7D/%7Bx%7D/%7By%7D.png
[11:21:02] <moritzm>	 !log restarting Exim on MXes to pick up OpenSSL updates
[11:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[11:22:48] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-06-05-111431-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337708)
[11:29:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:29:11] <wikibugs>	 (03PS3) 10Jbond: service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490)
[11:31:22] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF)
[11:32:12] <wikibugs>	 (03PS2) 10Jbond: puppetboard-next: add a new name for the puppet7 migration [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490)
[11:32:43] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:32:53] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: add basic firewall class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond)
[11:37:46] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:37:58] <jbond>	 this is me ^^
[11:38:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: api: cleanup unused network constant [puppet] - 10https://gerrit.wikimedia.org/r/927161
[11:39:11] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard-next
[11:42:46] <jinxer-wm>	 (ConfdResourceFailed) firing: (6) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:44:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/927161/41533/" [puppet] - 10https://gerrit.wikimedia.org/r/927161 (owner: 10Arturo Borrero Gonzalez)
[11:45:17] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125)
[11:45:20] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125)
[11:45:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge
[11:46:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge
[11:47:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard-next: add a new name for the puppet7 migration [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[11:49:20] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[11:52:51] <wikibugs>	 (03PS1) 10Jbond: firewall: update copmments to mention profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927163
[11:55:28] <wikibugs>	 (03PS1) 10Jbond: conftool-data: also add service to conftool [puppet] - 10https://gerrit.wikimedia.org/r/927164 (https://phabricator.wikimedia.org/T330490)
[11:55:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: update copmments to mention profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927163 (owner: 10Jbond)
[11:59:31] <wikibugs>	 (03PS1) 10Jbond: profile::firewall: add missing keys [puppet] - 10https://gerrit.wikimedia.org/r/927165
[12:00:02] <Lucas_WMDE>	 jouncebot: next
[12:00:02] <jouncebot>	 In 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300)
[12:00:05] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125)
[12:00:07] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125)
[12:00:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::firewall: add missing keys [puppet] - 10https://gerrit.wikimedia.org/r/927165 (owner: 10Jbond)
[12:01:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41535/console" [puppet] - 10https://gerrit.wikimedia.org/r/927165 (owner: 10Jbond)
[12:04:47] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:05:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:06:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:07:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.413 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:07:56] <wikibugs>	 (03PS1) 10Jbond: drop globale acl's from cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/927167
[12:08:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] drop globale acl's from cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/927167 (owner: 10Jbond)
[12:08:13] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50136 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:08:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:09:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] conftool-data: also add service to conftool [puppet] - 10https://gerrit.wikimedia.org/r/927164 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[12:10:01] <bblack>	 jouncebot: nowandnext
[12:10:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[12:10:01] <jouncebot>	 In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300)
[12:10:25] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:11:59] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:12:46] <jinxer-wm>	 (ConfdResourceFailed) resolved: (6) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:15:06] <logmsgbot>	 !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard-next
[12:15:55] <bblack>	 !log lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703
[12:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:58] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[12:17:05] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[12:17:08] <jynus>	 !log creating a copy of db1157 binlogs on dbprov1004 T338128
[12:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:10] <stashbot>	 T338128: Recovery text table in a couple of wikis - https://phabricator.wikimedia.org/T338128
[12:18:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "FYI this results in a relaxation of the firewall. But I don't think is very relevant. We control all IP addresses in the supernet." [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez)
[12:19:51] <wikibugs>	 (03PS2) 10Btullis: Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765)
[12:20:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis)
[12:22:03] <Amir1>	 jynus: don't know if it's related but all of s3 seems to be lagged now
[12:22:25] <Amir1>	 https://orchestrator.wikimedia.org/web/cluster/alias/s3
[12:22:34] <jynus>	 ok, I was going to kill the transfer, but it finished
[12:22:58] <jynus>	 it seems it caused some network issues
[12:23:02] <jynus>	 should be gone now
[12:23:15] <Amir1>	 ah, okay
[12:23:18] <Amir1>	 thanks
[12:23:39] <wikibugs>	 (03PS1) 10BBlack: pybal: configure advertised_instrumentation_ips [puppet] - 10https://gerrit.wikimedia.org/r/927168 (https://phabricator.wikimedia.org/T334703)
[12:23:41] <jynus>	 I will check those hosts, maybe thir replication connection broke
[12:25:19] <jynus>	 a spike of errors for a couple of minutes
[12:25:31] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:07] <bblack>	 is puppet compiler borked? https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41538/console
[12:27:17] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal: configure advertised_instrumentation_ips [puppet] - 10https://gerrit.wikimedia.org/r/927168 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[12:30:07] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: interface::route: add persist option [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758)
[12:30:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez)
[12:31:26] <Amir1>	 jouncebot: nowandnext
[12:31:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 28 minute(s)
[12:31:26] <jouncebot>	 In 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300)
[12:32:30] <wikibugs>	 (03PS3) 10Btullis: Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765)
[12:32:43] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[12:35:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[12:35:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:36:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:38:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[12:38:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[12:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:39:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe
[12:39:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48708 and previous config saved to /var/cache/conftool/dbconfig/20230605-123915-ladsgroup.json
[12:39:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10WMF-General-or-Unknown, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10MatthewVernon)
[12:39:19] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[12:41:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48709 and previous config saved to /var/cache/conftool/dbconfig/20230605-124124-ladsgroup.json
[12:41:29] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) Thank you for the supporting links.   Having discussed this internally with other Foundation staff, there seems to...
[12:42:22] <wikibugs>	 (03PS1) 10Jbond: trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490)
[12:42:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[12:42:50] <wikibugs>	 (03PS2) 10Jbond: trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490)
[12:43:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe
[12:44:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[12:44:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[12:44:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48710 and previous config saved to /var/cache/conftool/dbconfig/20230605-124444-ladsgroup.json
[12:44:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[12:49:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[12:51:46] <Amir1>	 !log killed prioritizeFilesWithTemplate.php, stopping depool maint.
[12:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:59] <Amir1>	 cc matthiasmullie and cormacparle  ^
[12:52:06] <Amir1>	 please add waitForReplication
[12:52:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[12:52:39] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083 (10MatthewVernon) 05Open→03Declined I'm afraid that "Wikimedia Maps may not be used by third-party services outside of the Wikimedia projects." (see [[ https://foundation.wikimedia.or...
[12:54:53] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:25] <matthiasmullie>	 ehh
[12:56:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P48711 and previous config saved to /var/cache/conftool/dbconfig/20230605-125630-ladsgroup.json
[12:56:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10MatthewVernon) @Isaac I'm afraid I'm still a bit confused as to what access is needed here (and/or which data set is being referred to); can you help, since I gather you dire...
[12:56:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[12:57:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48712 and previous config saved to /var/cache/conftool/dbconfig/20230605-125754-ladsgroup.json
[12:59:46] <matthiasmullie>	 Amir1: ok, caught up; that's a cron script in MachineVision - will look into it
[13:00:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Ladsgroup) I sponsored Zabe for production access and I can sponsor him for access to the analytics private data (without kerberos) as well. In reality it doesn't change any ac...
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300).
[13:00:05] <jouncebot>	 Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:17] <taavi>	 o/
[13:00:23] <taavi>	 Lucas_WMDE: I assume you will self-deploy?
[13:00:26] <Amir1>	 thanks. It needs a "$this->waitForReplication()" somewhere
[13:00:26] <Lucas_WMDE>	 o/
[13:00:27] <Lucas_WMDE>	 yup :)
[13:02:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[13:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[13:02:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48713 and previous config saved to /var/cache/conftool/dbconfig/20230605-130228-ladsgroup.json
[13:02:31] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[13:02:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140) (owner: 10Lucas Werkmeister (WMDE))
[13:03:01] <Lucas_WMDE>	 let’s see if it works as expected
[13:03:44] <wikibugs>	 (03Merged) 10jenkins-bot: Make outreachwiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140) (owner: 10Lucas Werkmeister (WMDE))
[13:04:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]]
[13:04:03] <stashbot>	 T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140
[13:05:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:05:32] <Lucas_WMDE>	 let’s se
[13:05:33] <Lucas_WMDE>	 *see
[13:07:19] <Lucas_WMDE>	 https://www.wikidata.org/w/index.php?title=Q4115189&diff=prev&oldid=1908813124 works
[13:07:44] <Lucas_WMDE>	 https://outreach.wikimedia.org/w/index.php?title=Wikimedia:Sandbox&diff=prev&oldid=250097 also works
[13:07:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48714 and previous config saved to /var/cache/conftool/dbconfig/20230605-130753-ladsgroup.json
[13:07:57] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[13:08:23] <Lucas_WMDE>	 language links on the outreachwiki page also look correct to me
[13:08:27] <Lucas_WMDE>	 good to go, I’ll sync
[13:09:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Ottomata) K, thanks @ladsgroup!  Approved.  In this case I think its okay to skip the MOU/expiry, since Zabe has shell access for other  reasons anyway, and doesn't have an exp...
[13:09:45] <bblack>	 !log lvs4* (ulsfo) - restart pybal for T334703 IPs
[13:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:48] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[13:10:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin alias for new cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/927177
[13:11:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P48715 and previous config saved to /var/cache/conftool/dbconfig/20230605-131136-ladsgroup.json
[13:13:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P48716 and previous config saved to /var/cache/conftool/dbconfig/20230605-131301-ladsgroup.json
[13:13:41] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]] (duration: 10m 06s)
[13:14:10] <stashbot>	 T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140
[13:14:15] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:14:31] <Lucas_WMDE>	 anything else to deploy?
[13:14:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:15:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/927177 (owner: 10Muehlenhoff)
[13:15:29] <bblack>	 !log lvs6* (drmrs) - restart pybal for T334703 IPs
[13:15:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:32] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[13:17:23] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for new cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/927177 (owner: 10Muehlenhoff)
[13:19:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180
[13:19:21] <bblack>	 !log lvs5* (eqsin) - restart pybal for T334703 IPs
[13:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff)
[13:21:13] <wikibugs>	 (03PS1) 10Hashar: rake_modules: apply early monkey patches earlier [puppet] - 10https://gerrit.wikimedia.org/r/927181
[13:21:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rake_modules: apply early monkey patches earlier [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar)
[13:21:44] <hashar>	 ...
[13:22:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix role name in alias [puppet] - 10https://gerrit.wikimedia.org/r/927184
[13:22:29] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[13:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P48717 and previous config saved to /var/cache/conftool/dbconfig/20230605-132259-ladsgroup.json
[13:23:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:23:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180
[13:23:27] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[13:25:00] <bblack>	 !log lvs3* (esams) - restart pybal for T334703 IPs
[13:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:04] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[13:25:54] <hashar>	 !log Restarted Zuul due to stall ssh connection # T309376
[13:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:57] <stashbot>	 T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376
[13:26:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix role name in alias [puppet] - 10https://gerrit.wikimedia.org/r/927184 (owner: 10Muehlenhoff)
[13:26:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48718 and previous config saved to /var/cache/conftool/dbconfig/20230605-132642-ladsgroup.json
[13:26:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:26:45] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[13:26:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:27:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48719 and previous config saved to /var/cache/conftool/dbconfig/20230605-132703-ladsgroup.json
[13:27:56] <wikibugs>	 (03CR) 10Hashar: "recheck due to T309376" [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar)
[13:28:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:28:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P48720 and previous config saved to /var/cache/conftool/dbconfig/20230605-132807-ladsgroup.json
[13:28:53] <wikibugs>	 (03CR) 10Hashar: "recheck cause I had to restart Zuul due to T309376" [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff)
[13:29:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48721 and previous config saved to /var/cache/conftool/dbconfig/20230605-132911-ladsgroup.json
[13:29:27] <logmsgbot>	 !log bblack@deploy1002 Locking from deployment [ALL REPOSITORIES]: temporary lock for LVS resarts in core DCs
[13:29:56] <bblack>	 !log lvs2* (codfw) - restart pybal for T334703 IPs
[13:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: remove leftover role hieradata [puppet] - 10https://gerrit.wikimedia.org/r/909714 (owner: 10Majavah)
[13:32:31] <bblack>	 !log lvs1* (eqiad) - restart pybal for T334703 IPs
[13:32:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:36] <stashbot>	 T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703
[13:32:44] <sukhe>	 jouncebot: nowandnext
[13:32:44] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300)
[13:32:44] <jouncebot>	 In 0 hour(s) and 27 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1400)
[13:33:13] <wikibugs>	 (03CR) 10Hashar: "And another follow up since `require 'puppet'` invokes `URI.escape` which triggers a ruby warning:" [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison)
[13:33:39] <wikibugs>	 (03CR) 10Hashar: "And another follow up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/889990 since `require 'puppet'` invokes `URI.escape` which t" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar)
[13:33:47] <RhinosF1>	 sukhe: Lucas_WMDE has finished the backport window as an fyi
[13:34:01] <sukhe>	 RhinosF1: thanks! bblack has the lock already so I will take it from him
[13:34:13] <sukhe>	 with the changes he is rolling out, no more locking required for LVS work anyway, so that's good :)
[13:35:21] <logmsgbot>	 !log bblack@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: temporary lock for LVS resarts in core DCs (duration: 05m 54s)
[13:35:21] <logmsgbot>	 !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937
[13:35:25] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[13:35:52] <sukhe>	 hmm older message, let's be correct
[13:35:53] <logmsgbot>	 !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 (duration: 01m 06s)
[13:36:02] <logmsgbot>	 !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T326767
[13:36:05] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[13:36:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] interface::route: add persist option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez)
[13:38:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P48722 and previous config saved to /var/cache/conftool/dbconfig/20230605-133805-ladsgroup.json
[13:38:11] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) @elukey  i am available any day this week except Thursday if you are available
[13:39:33] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) @Jclark-ctr Thanks! I have time today and tomorrow in my afternoon, lemme know what time works best for you!
[13:39:57] <wikibugs>	 (03CR) 10Hashar: "I have no idea what the monkey patch is exactly doing or what kind of side effect this can have, but that surely mutes the warning when do" [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar)
[13:40:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:41:08] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:41:16] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:43:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48723 and previous config saved to /var/cache/conftool/dbconfig/20230605-134313-ladsgroup.json
[13:43:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Host under maintenance
[13:44:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Host under maintenance
[13:44:14] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b4799674-ad70-4117-a653-cdeaad02c246) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenanc...
[13:44:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P48724 and previous config saved to /var/cache/conftool/dbconfig/20230605-134418-ladsgroup.json
[13:44:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance
[13:45:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance
[13:45:13] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43b4a369-edbc-4df6-b931-f35757b38bf1) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenanc...
[13:45:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:46:17] <moritzm>	 !log installing python-ipaddress security updates
[13:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:41] <wikibugs>	 (03PS1) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127)
[13:48:01] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[13:48:52] <wikibugs>	 (03PS2) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127)
[13:49:27] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[13:50:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Isaac) I think my name is being brought up based on [[https://lists.wikimedia.org/hyperkitty/list/wiki-research-l@lists.wikimedia.org/thread/MWXIGG3F7UXIWXYJWH3X47NWWQLGSJWF/...
[13:52:28] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks @jbond for the help! I added the secret to `profile::gitlab::omniauth_providers` in private puppet. After that puppet cre...
[13:53:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48725 and previous config saved to /var/cache/conftool/dbconfig/20230605-135311-ladsgroup.json
[13:53:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:53:16] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[13:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[13:53:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48726 and previous config saved to /var/cache/conftool/dbconfig/20230605-135332-ladsgroup.json
[13:53:36] <wikibugs>	 (03PS1) 10Hashar: rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194
[13:54:08] <wikibugs>	 (03PS1) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031)
[13:54:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194 (owner: 10Hashar)
[13:55:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: exclude kubelet hosts from cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027)
[13:55:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41539/console" [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey)
[13:56:08] <wikibugs>	 (03PS2) 10Hashar: rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194
[13:56:10] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[13:56:56] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:57:07] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:57:18] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "CCing Andrew since this change will impact eventgate-analytics-external." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[13:57:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41540/console" [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[13:57:48] <wikibugs>	 (03PS2) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031)
[13:58:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Untested but LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron)
[13:58:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48727 and previous config saved to /var/cache/conftool/dbconfig/20230605-135859-ladsgroup.json
[13:59:03] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[13:59:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P48728 and previous config saved to /var/cache/conftool/dbconfig/20230605-135924-ladsgroup.json
[13:59:49] <wikibugs>	 (03CR) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[13:59:51] <wikibugs>	 (03PS1) 10BBlack: wikidata maxlag maint script: use new pybal VIPs [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703)
[14:00:05] <jouncebot>	 sukhe: It is that lovely time of the day again! You are hereby commanded to deploy LVS maintenance. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1400).
[14:01:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:01:25] <wikibugs>	 (03PS3) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031)
[14:02:41] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41543/console" [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey)
[14:03:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) Removed gpu from dse-k8s-worker1002 installed gpu into ml-serve1001
[14:04:17] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] Add initial stream configs for Android article events using Metrics Platform Java client library (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[14:05:33] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[14:05:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[14:05:49] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:06:36] <sukhe>	 BGP and Pybal alerts in codfw expected
[14:06:41] <claime>	 ack
[14:07:07] <wikibugs>	 (03CR) 10Btullis: "This looks like it would work, but I wonder if it wouldn't be cleaner to use a systemd target to group all of the instances together, as o" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[14:07:37] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:07:51] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey)
[14:08:42] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:08:55] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:09:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff)
[14:09:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10SalimJah) @Isaac: thanks for your reply. You are correct.   We are reacting to this suggestion in the thread you mention, which we thought looked very efficient for our purpo...
[14:10:05] <wikibugs>	 (03CR) 10Elukey: varnishkafka: add catch all systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[14:10:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:10:25] <icinga-wm>	 PROBLEM - pybal on lvs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:10:47] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:13] <sukhe>	 ^ expected
[14:11:24] <wikibugs>	 (03PS4) 10AikoChou: changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899)
[14:14:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P48729 and previous config saved to /var/cache/conftool/dbconfig/20230605-141405-ladsgroup.json
[14:14:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48730 and previous config saved to /var/cache/conftool/dbconfig/20230605-141430-ladsgroup.json
[14:14:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[14:14:33] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[14:14:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[14:14:49] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal
[14:14:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48731 and previous config saved to /var/cache/conftool/dbconfig/20230605-141451-ladsgroup.json
[14:15:25] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:15:33] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:15:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48732 and previous config saved to /var/cache/conftool/dbconfig/20230605-141559-ladsgroup.json
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:41] <wikibugs>	 (03PS3) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127)
[14:18:00] <wikibugs>	 (03CR) 10Herron: mwlog: fix mw-log logrotate glob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron)
[14:18:33] <wikibugs>	 (03PS1) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204
[14:18:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron)
[14:19:06] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "thx for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron)
[14:19:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond)
[14:20:16] <wikibugs>	 (03CR) 10AikoChou: changeprop: allow match_not in match_config for liftwing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:20:20] <wikibugs>	 (03PS1) 10Ssingh: lvs2009: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/927206 (https://phabricator.wikimedia.org/T335777)
[14:20:49] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff)
[14:21:17] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/927208 (https://phabricator.wikimedia.org/T335777)
[14:21:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] interface::route: add persist option [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez)
[14:22:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:24:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Added Kamila since Hugh is out of the office :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[14:25:38] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey)
[14:27:03] <wikibugs>	 (03PS3) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742
[14:27:30] <wikibugs>	 (03PS2) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204
[14:27:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: persist static routes [puppet] - 10https://gerrit.wikimedia.org/r/927210 (https://phabricator.wikimedia.org/T337758)
[14:27:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond)
[14:28:53] <sukhe>	 !log codfw low-traffic LVS: set routing-options static route 10.2.1.0/24 next-hop 10.192.49.7
[14:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Cloud: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/925721 (owner: 10Muehlenhoff)
[14:29:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P48733 and previous config saved to /var/cache/conftool/dbconfig/20230605-142911-ladsgroup.json
[14:30:15] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: set nodes as k8s nodes for the gpu profile [puppet] - 10https://gerrit.wikimedia.org/r/927213
[14:30:35] <wikibugs>	 (03PS3) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204
[14:31:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P48734 and previous config saved to /var/cache/conftool/dbconfig/20230605-143105-ladsgroup.json
[14:31:24] <wikibugs>	 (03PS1) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/927214 (https://phabricator.wikimedia.org/T335777)
[14:32:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: set nodes as k8s nodes for the gpu profile [puppet] - 10https://gerrit.wikimedia.org/r/927213 (owner: 10Elukey)
[14:32:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2009.codfw.wmnet
[14:33:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond)
[14:33:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Allow HTTP PATCH requests on "beta" sites [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek)
[14:35:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove more cloud stretch support [puppet] - 10https://gerrit.wikimedia.org/r/927215
[14:40:32] <wikibugs>	 (03PS1) 10Klausman: Add rate limiting class for high-traffic internal services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121)
[14:41:32] <wikibugs>	 (03PS1) 10Ottomata: mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303)
[14:41:55] <wikibugs>	 (03PS2) 10Ottomata: mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303)
[14:42:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:43:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "We need to double check that this existing client modules/profile/manifests/cloudceph/osd.pp doesn't rely on this default semantic of not " [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond)
[14:44:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48735 and previous config saved to /var/cache/conftool/dbconfig/20230605-144417-ladsgroup.json
[14:44:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[14:44:21] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[14:44:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[14:44:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48736 and previous config saved to /var/cache/conftool/dbconfig/20230605-144438-ladsgroup.json
[14:44:58] <wikibugs>	 (03PS1) 10Ottomata: Remove dse mediawiki-page-content-change-enrichment and stream-enrichment-poc ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/927224 (https://phabricator.wikimedia.org/T325303)
[14:45:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:46:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P48737 and previous config saved to /var/cache/conftool/dbconfig/20230605-144611-ladsgroup.json
[14:47:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:47:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:47:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2009.codfw.wmnet
[14:47:13] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2009.codfw.wmnet` - lvs2009.codfw.wmnet (**WARN**)   - Downtimed ho...
[14:47:27] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/927208 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[14:48:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs2009: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/927206 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[14:48:55] <sukhe>	 !log homer "cr*-codfw*" commit "Gerrit: 927208 remove decommissioned host lvs2009": T335777
[14:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:58] <stashbot>	 T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777
[14:49:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[14:49:45] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:50:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48738 and previous config saved to /var/cache/conftool/dbconfig/20230605-145003-ladsgroup.json
[14:50:07] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[14:50:17] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:50:30] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:51:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove wmflib hack for logoutd scripts on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/927227
[14:52:05] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:52:18] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:54:21] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:55:05] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:55:12] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:55:17] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:57:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder)
[14:58:25] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:58:43] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:59:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, I don't think we need to cleanup those files manually either." [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff)
[14:59:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:00:33] <wikibugs>	 (03CR) 10Muehlenhoff: Remove wmflib hack for logoutd scripts on Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff)
[15:01:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove wmflib hack for logoutd scripts on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff)
[15:01:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48739 and previous config saved to /var/cache/conftool/dbconfig/20230605-150117-ladsgroup.json
[15:01:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[15:01:22] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:01:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[15:01:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48740 and previous config saved to /var/cache/conftool/dbconfig/20230605-150138-ladsgroup.json
[15:02:36] <icinga-wm>	 PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48741 and previous config saved to /var/cache/conftool/dbconfig/20230605-150347-ladsgroup.json
[15:04:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48742 and previous config saved to /var/cache/conftool/dbconfig/20230605-150509-ladsgroup.json
[15:05:16] <moritzm>	 !log installing avahi security updates
[15:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: pyrra: initial packaging for v0.6.2 (031 comment) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[15:06:01] <wikibugs>	 (03PS2) 10BBlack: wikidata maxlag maint script: use new pybal VIPs [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703)
[15:06:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Setup DNS for lvs2013 - pt1979@cumin2002"
[15:07:12] <wikibugs>	 (03PS1) 10Ayounsi: Netbox/Netbox-next: disable public /metrics [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703)
[15:07:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Setup DNS for lvs2013 - pt1979@cumin2002"
[15:07:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:07:42] <wikibugs>	 (03CR) 10Ayounsi: "Manually tested on netbox-next." [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi)
[15:07:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2013.mgmt.codfw.wmnet with reboot policy FORCED
[15:12:11] <wikibugs>	 (03CR) 10BBlack: "PCC looks right (although it's a little confusing right now - normally there's 2x --lb here for 1019 + 2009, but 2009 is currently being d" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[15:12:24] <wikibugs>	 (03CR) 10BBlack: "https://puppet-compiler.wmflabs.org/output/927200/41548/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[15:12:33] <wikibugs>	 (03CR) 10BCornwall: "Looking at some older, similar commits shows that text_envoy.yaml and text_haproxy.yaml was also updated for these things. Indeed, it look" [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[15:14:12] <icinga-wm>	 PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:44] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] "Pushing this now, because currently with the ongoing LVS replacement in T326767 , the wikidata maxlag calculation doesn't work at all beca" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[15:16:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul)
[15:18:49] <logmsgbot>	 !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T326767 (duration: 102m 46s)
[15:18:49] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@674ec0a]: (no justification provided)
[15:18:52] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[15:18:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P48744 and previous config saved to /var/cache/conftool/dbconfig/20230605-151853-ladsgroup.json
[15:19:01] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@674ec0a]: (no justification provided) (duration: 00m 17s)
[15:19:26] <moritzm>	 !log installing debian-archive-keyring updates on bullseye hosts
[15:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48745 and previous config saved to /var/cache/conftool/dbconfig/20230605-152015-ladsgroup.json
[15:24:23] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans)
[15:24:35] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Also LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans)
[15:24:50] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "...likewise :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans)
[15:26:24] <wikibugs>	 (03PS3) 10JHathaway: java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972)
[15:27:01] <wikibugs>	 (03CR) 10JHathaway: java: ensure wmf-certificates is installed, when required (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:27:12] <Amir1>	 !log on s3 master: update `text` set old_text = 'O:18:"historyblobcurstub":1:{s:6:"mCurId";i:5532;}', old_flags = 'object' where old_id= 14484; (T337700)
[15:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:16] <stashbot>	 T337700: Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T337700
[15:29:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: hiera type defs [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:29:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff)
[15:30:04] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1530).
[15:30:24] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Looks correct in relation to what's on the bug." [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis)
[15:30:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2013.mgmt.codfw.wmnet with reboot policy FORCED
[15:31:49] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: add more gpu settings [puppet] - 10https://gerrit.wikimedia.org/r/927232
[15:32:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: add more gpu settings [puppet] - 10https://gerrit.wikimedia.org/r/927232 (owner: 10Elukey)
[15:33:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance
[15:33:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance
[15:33:24] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2ef51d27-4384-414f-9fdf-8fe7b4c93b00) set by elukey@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: Host under maintenanc...
[15:33:50] <wikibugs>	 (03PS4) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204
[15:33:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P48746 and previous config saved to /var/cache/conftool/dbconfig/20230605-153359-ladsgroup.json
[15:34:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:35:09] <wikibugs>	 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10Gehel)
[15:35:14] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:35:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48747 and previous config saved to /var/cache/conftool/dbconfig/20230605-153521-ladsgroup.json
[15:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:35:25] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:35:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:35:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48748 and previous config saved to /var/cache/conftool/dbconfig/20230605-153542-ladsgroup.json
[15:35:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] interface::route: Make interface mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond)
[15:36:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2013']
[15:36:43] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "This should be good to go when 1.41.0-wmf.11 is out and stable..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[15:37:00] <wikibugs>	 (03PS1) 10Eigyan: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728)
[15:37:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2013']
[15:37:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[15:37:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2013']
[15:40:35] <wikibugs>	 (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[15:41:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48749 and previous config saved to /var/cache/conftool/dbconfig/20230605-154110-ladsgroup.json
[15:41:14] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:41:18] <wikibugs>	 (03PS2) 10Krinkle: Fix oversample naming to match schema. [extensions/NavigationTiming] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917736
[15:41:36] <wikibugs>	 (03Abandoned) 10Krinkle: Fix oversample naming to match schema. [extensions/NavigationTiming] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917736 (owner: 10Krinkle)
[15:44:54] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) 05Open→03Resolved I can confirm that the GPUs are working on ml-serve1001, thanks!
[15:44:58] <wikibugs>	 (03PS4) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797)
[15:46:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi)
[15:46:42] <wikibugs>	 (03PS2) 10Jkieserman: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[15:47:24] <wikibugs>	 (03PS5) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797)
[15:49:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[15:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48750 and previous config saved to /var/cache/conftool/dbconfig/20230605-154905-ladsgroup.json
[15:49:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[15:49:09] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[15:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[15:49:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48751 and previous config saved to /var/cache/conftool/dbconfig/20230605-154926-ladsgroup.json
[15:50:04] <wikibugs>	 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10Gehel)
[15:51:13] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2013']
[15:51:30] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41549/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[15:51:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48752 and previous config saved to /var/cache/conftool/dbconfig/20230605-155134-ladsgroup.json
[15:52:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[15:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:55:10] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack)
[15:55:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2013.codfw.wmnet with OS bullseye
[15:55:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye
[15:56:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48753 and previous config saved to /var/cache/conftool/dbconfig/20230605-155617-ladsgroup.json
[15:58:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:58:36] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583)
[15:59:22] <bblack>	 !log mw1419: manually executing a php restart to test new safe-service-restart
[15:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:29] <wikibugs>	 (03PS4) 10JHathaway: add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972)
[15:59:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[16:00:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] add container facts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:01:04] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:01:14] <wikibugs>	 10SRE, 10Traffic: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10Krinkle) >>! In T86651#973435, @mark wrote: > FWIW: An alternative sh implementation that I've written for an old kernel and fixes some of these issues (a looong time ago), lives [[ http://svn.wikimedia.org/viewvc/mediawi...
[16:02:10] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] don't export resources when wmflib::have_puppetdb() is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[16:02:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey)
[16:03:04] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder)
[16:03:38] <wikibugs>	 (03PS1) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366)
[16:04:37] <leszek_wmde>	 hello sukhe - mind if we tried merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/923427/ ?
[16:05:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox/Netbox-next: disable public /metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi)
[16:05:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:05:33] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:05:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[16:06:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[16:06:16] <sukhe>	 leszek_wmde: hello
[16:06:18] <sukhe>	 yes, let's do it 
[16:06:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:06:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the abuse filter wikireplica view rules [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis)
[16:06:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:06:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P48754 and previous config saved to /var/cache/conftool/dbconfig/20230605-160640-ladsgroup.json
[16:06:42] <leszek_wmde>	 sukhe: great!
[16:07:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Allow HTTP PATCH requests on "beta" sites [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek)
[16:08:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:11:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48755 and previous config saved to /var/cache/conftool/dbconfig/20230605-161123-ladsgroup.json
[16:11:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:12:35] <wikibugs>	 (03PS2) 10Klausman: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121)
[16:14:10] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41551/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[16:16:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:16:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage
[16:17:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10WMF-General-or-Unknown, and 4 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10sbassett)
[16:18:28] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) Current Logstash SLO appears to be measuring the //number// of events encountered in lagged state.  This SLO affords us...
[16:18:38] <wikibugs>	 (03PS1) 10Elukey: admin_ng: bump limits for ml-serve's experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/927237
[16:19:28] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
[16:20:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage
[16:21:04] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
[16:21:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P48756 and previous config saved to /var/cache/conftool/dbconfig/20230605-162147-ladsgroup.json
[16:21:56] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal
[16:22:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: bump limits for ml-serve's experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/927237 (owner: 10Elukey)
[16:23:46] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal
[16:24:05] <sukhe>	 ^ replicas
[16:26:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48757 and previous config saved to /var/cache/conftool/dbconfig/20230605-162629-ladsgroup.json
[16:26:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[16:26:33] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[16:26:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[16:26:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:27:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:27:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48758 and previous config saved to /var/cache/conftool/dbconfig/20230605-162707-ladsgroup.json
[16:27:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey)
[16:33:38] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:24] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863
[16:34:28] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863
[16:35:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:35:32] <wikibugs>	 (03PS1) 10CDanis: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024)
[16:35:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48759 and previous config saved to /var/cache/conftool/dbconfig/20230605-163545-ladsgroup.json
[16:35:48] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[16:36:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48760 and previous config saved to /var/cache/conftool/dbconfig/20230605-163653-ladsgroup.json
[16:36:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[16:37:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[16:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48761 and previous config saved to /var/cache/conftool/dbconfig/20230605-163714-ladsgroup.json
[16:37:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul)
[16:37:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:37:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2013.codfw.wmnet with OS bullseye
[16:37:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye completed...
[16:44:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48762 and previous config saved to /var/cache/conftool/dbconfig/20230605-164423-ladsgroup.json
[16:44:27] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[16:46:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "One nit, feel free to ignore.  LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis)
[16:49:34] <wikibugs>	 (03PS1) 10Herron: add 0.6.2 ui/package.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240
[16:49:58] <wikibugs>	 (03PS7) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995)
[16:50:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48763 and previous config saved to /var/cache/conftool/dbconfig/20230605-165051-ladsgroup.json
[16:51:19] <wikibugs>	 (03CR) 10CDanis: Enable user network probe events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis)
[16:56:13] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "Note that lvs2010's PCC diff shows that one instance of mh will be reverted to sh. The temporary hack to enable mh had switched sh→mh on l" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[16:57:28] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[16:58:48] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:59:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P48764 and previous config saved to /var/cache/conftool/dbconfig/20230605-165929-ladsgroup.json
[16:59:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700)
[17:00:04] <jouncebot>	 ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700). nyaa~
[17:00:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 (owner: 10Ottomata)
[17:00:34] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10herron) a:05herron→03None
[17:01:21] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 (owner: 10Ottomata)
[17:02:16] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul)
[17:05:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48765 and previous config saved to /var/cache/conftool/dbconfig/20230605-170557-ladsgroup.json
[17:06:35] <cdanis>	 jouncebot: nowandnext
[17:06:35] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700)
[17:06:35] <jouncebot>	 For the next 0 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700)
[17:06:35] <jouncebot>	 In 2 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000)
[17:09:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cdanis@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis)
[17:09:35] <wikibugs>	 (03PS2) 10CDanis: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024)
[17:09:50] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cdanis@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis)
[17:10:44] <wikibugs>	 (03Merged) 10jenkins-bot: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis)
[17:11:21] <cdanis>	 ottomata: were you just about to deploy your change to mediawiki-config .... ?
[17:11:33] <wikibugs>	 (03CR) 10Dzahn: releases: clone repos/releng/release from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[17:11:36] <ottomata>	 yes...trying to verify in beta first but it is taking longer than I expected
[17:11:46] <cdanis>	 ack
[17:11:49] <ottomata>	 it should be a no-op, but last week it wasn't (i believe because train hadn' been fully deployed)
[17:12:00] <ottomata>	 i can try on a mwdebug host..
[17:12:30] <logmsgbot>	 !log cdanis@deploy1002 Backport cancelled.
[17:13:07] <ottomata>	 oh, am I in the way of a deploy?
[17:13:44] <ottomata>	 cdanis: let me revert again, didn't realize, i didn't see any changes listed, but I see now that for this window they don't need to be?
[17:14:08] <cdanis>	 ottomata: no I was sneaking in my patch during a quiet window :)
[17:14:10] <cdanis>	 np
[17:14:12] <cdanis>	 not exactly, I was sneaking my patch in
[17:14:14] <cdanis>	 we just had the same idea at the same time
[17:14:26] <ottomata>	 ha ok let me try real quick on mwdebug...if it doesn't work there i'll revert
[17:14:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P48766 and previous config saved to /var/cache/conftool/dbconfig/20230605-171436-ladsgroup.json
[17:16:32] <ottomata>	 looking fine, i'm proceeding with my deployemnt
[17:16:36] <cdanis>	 ty!
[17:17:12] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064)
[17:17:16] <wikibugs>	 (03PS3) 10Jkieserman: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[17:18:30] <wikibugs>	 (03CR) 10Jkieserman: [C: 03+1] "Doesn't look like I have merge rights on this repo..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[17:19:17] <wikibugs>	 (03PS3) 10Dzahn: trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064)
[17:19:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+1] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[17:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48767 and previous config saved to /var/cache/conftool/dbconfig/20230605-172103-ladsgroup.json
[17:21:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[17:21:07] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[17:21:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[17:21:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48768 and previous config saved to /var/cache/conftool/dbconfig/20230605-172124-ladsgroup.json
[17:23:26] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[17:24:41] <cdanis>	 btw ottomata next time you should try the new `scap backport` command :) https://phabricator.wikimedia.org/phame/post/view/297/scap_backport_makes_deployments_easy/
[17:24:54] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[17:25:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:26:33] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: no-op: Remove undeeded wgEventBusStreamNamesMap override setting (take 2) - T336817 (duration: 09m 25s)
[17:26:36] <stashbot>	 T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817
[17:26:59] <logmsgbot>	 !log cdanis@deploy1002 Backport cancelled.
[17:27:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48769 and previous config saved to /var/cache/conftool/dbconfig/20230605-172700-ladsgroup.json
[17:27:04] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[17:27:34] <brett>	 jouncebot: nowandnext
[17:27:35] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700)
[17:27:35] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700)
[17:27:35] <jouncebot>	 In 2 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000)
[17:28:14] <logmsgbot>	 !log cdanis@deploy1002 Started scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]]
[17:28:17] <stashbot>	 T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024
[17:28:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] varnish: remove/adjust rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[17:29:02] <sukhe>	 are deployments done?
[17:29:15] <cdanis>	 sukhe: I have one running now
[17:29:21] <sukhe>	 cdanis: ok! thanks
[17:29:24] <cdanis>	 it should be a no-op though
[17:29:30] <sukhe>	 ok
[17:29:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48770 and previous config saved to /var/cache/conftool/dbconfig/20230605-172942-ladsgroup.json
[17:29:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[17:29:56] <sukhe>	 cdanis: not urgent at all on our side
[17:29:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[17:30:01] <logmsgbot>	 !log cdanis@deploy1002 cdanis: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[17:30:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48771 and previous config saved to /var/cache/conftool/dbconfig/20230605-173002-ladsgroup.json
[17:30:24] <wikibugs>	 (03CR) 10Dzahn: "I will amend to it to switch it to the "insetup" role. That way gerrit role can be removed before decom cookbook destroyed server. @hashar" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[17:30:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:33:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48772 and previous config saved to /var/cache/conftool/dbconfig/20230605-173356-ladsgroup.json
[17:34:00] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[17:36:32] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:38:16] <logmsgbot>	 !log cdanis@deploy1002 Finished scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] (duration: 10m 02s)
[17:38:19] <stashbot>	 T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024
[17:39:28] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) a:03colewhite
[17:42:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48773 and previous config saved to /var/cache/conftool/dbconfig/20230605-174206-ladsgroup.json
[17:42:44] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:36] <wikibugs>	 (03PS2) 10Herron: add 0.6.2 ui/package*.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240
[17:46:48] <wikibugs>	 (03PS8) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995)
[17:47:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:47:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:49:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P48774 and previous config saved to /var/cache/conftool/dbconfig/20230605-174902-ladsgroup.json
[17:49:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:03] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: no-op: Remove unused page_change rc streams - T336817 (duration: 20m 11s)
[17:50:06] <stashbot>	 T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817
[17:52:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:54:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:51] <taavi>	 btullis: https://phabricator.wikimedia.org/T338172 seems related to the wiki replica view changes
[17:56:13] <wikibugs>	 (03CR) 10Dzahn: "oh right, once I put it back in "setup" role it will also remove shell access except for global roots, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[17:57:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48775 and previous config saved to /var/cache/conftool/dbconfig/20230605-175712-ladsgroup.json
[17:57:33] <wikibugs>	 (03PS1) 10Ssingh: Revert "Allow HTTP PATCH requests on "beta" sites" [puppet] - 10https://gerrit.wikimedia.org/r/926864
[17:58:26] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
[17:58:43] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
[17:59:31] <wikibugs>	 (03PS3) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427)
[18:00:54] <wikibugs>	 (03CR) 10Dzahn: "ok with your shell access being removed at this point? data is copied to gerrit1003 and bacula, as long as it's under /srv/gerrit or /var/" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:03:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "not yet, this will happen after it is disabled in trafficserver for some time" [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[18:04:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P48776 and previous config saved to /var/cache/conftool/dbconfig/20230605-180408-ladsgroup.json
[18:04:12] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @DAlangi_WMF we talked about this the other day, can you sahre your...
[18:04:27] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) a:03DAlangi_WMF
[18:09:35] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Declare Metrics Platform stream for wikifunctionswiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922569 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[18:10:20] <wikibugs>	 (03Merged) 10jenkins-bot: Declare Metrics Platform stream for wikifunctionswiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922569 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin)
[18:10:27] <wikibugs>	 (03PS3) 10Jforrester: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin)
[18:10:31] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin)
[18:11:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin)
[18:12:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48777 and previous config saved to /var/cache/conftool/dbconfig/20230605-181219-ladsgroup.json
[18:12:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:12:24] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[18:12:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:14:01] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817)
[18:14:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata)
[18:15:20] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817)
[18:16:44] <wikibugs>	 (03PS1) 10Dzahn: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427)
[18:17:19] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata)
[18:18:04] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata)
[18:19:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48778 and previous config saved to /var/cache/conftool/dbconfig/20230605-181915-ladsgroup.json
[18:19:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[18:19:18] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[18:19:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[18:19:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48779 and previous config saved to /var/cache/conftool/dbconfig/20230605-181935-ladsgroup.json
[18:19:48] <wikibugs>	 (03CR) 10Dzahn: "CC: this is kind of a hard step to disable "gerrit-old.wikimedia.org" but of course it's revertable. so just fyi" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:21:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48780 and previous config saved to /var/cache/conftool/dbconfig/20230605-182144-ladsgroup.json
[18:21:59] <wikibugs>	 (03PS2) 10Dzahn: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427)
[18:22:22] <wikibugs>	 (03PS1) 10BCornwall: pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797)
[18:22:38] <wikibugs>	 (03CR) 10Dzahn: "I see more reviewers coming from the bot. so TLDR "what this really means is "cloud can't talk to gerrit1001 anymore" but gerrit on that h" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:25:01] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41555/console" [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[18:25:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:25:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:25:48] <wikibugs>	 (03Merged) 10jenkins-bot: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:26:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[18:26:16] <brett>	 jouncebot: nowandnext
[18:26:16] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 33 minute(s)
[18:26:16] <jouncebot>	 In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000)
[18:27:06] <wikibugs>	 (03CR) 10BBlack: "These rewrites were always a very ugly hack, given the interaction of the sitemaps scheme with our vcl-switching code (which is why they'r" [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[18:28:12] <brett>	 !log Maglev LVS scheduler rollout in eqiad (puppet disabled) - T263797
[18:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:15] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[18:28:50] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[18:28:57] <wikibugs>	 (03Abandoned) 10Ssingh: Revert "Allow HTTP PATCH requests on "beta" sites" [puppet] - 10https://gerrit.wikimedia.org/r/926864 (owner: 10Ssingh)
[18:29:46] <sukhe>	 !log homer "cr*-eqiad*" commit "Gerrit: 927246 remove old gerrit service IP"
[18:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:30:46] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: revert - Remove unused page_change rc streams - T336817 (duration: 11m 23s)
[18:30:48] <stashbot>	 T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817
[18:31:52] <wikibugs>	 (03PS2) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064)
[18:32:22] <inflatador>	 !log bking@cumin1001 depooling wdqs2010 for fw update T331297
[18:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:26] <stashbot>	 T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts - https://phabricator.wikimedia.org/T331297
[18:32:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] "Changes merged after homer run. Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:33:13] <wikibugs>	 (03PS1) 10Ottomata: Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248
[18:33:27] <wikibugs>	 (03PS3) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064)
[18:34:13] <wikibugs>	 (03CR) 10Dzahn: "@bblack ACK! thank you for the review, I amended to completely remove it. Is it right this way to remove the entire "sub_cluster" in both " [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[18:35:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2010.codfw.wmnet
[18:35:42] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:36:21] <wikibugs>	 (03CR) 10Eigyan: Deploy GDI safety survey to JA and RU wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan)
[18:36:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P48781 and previous config saved to /var/cache/conftool/dbconfig/20230605-183650-ladsgroup.json
[18:38:04] <wikibugs>	 (03CR) 10Dzahn: "thanks for review and deployment, that was super quick, appreciated" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[18:38:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248 (owner: 10Ottomata)
[18:39:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248 (owner: 10Ottomata)
[18:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:39:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:42:30] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:43:16] <wikibugs>	 (03CR) 10BBlack: varnish: remove rewrites and tests for sitemaps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[18:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:45:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2010.codfw.wmnet
[18:47:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[18:48:03] <inflatador>	 !log bking@cumin1001 repooling wdqs2010 T331297
[18:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:07] <stashbot>	 T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts - https://phabricator.wikimedia.org/T331297
[18:48:30] <wikibugs>	 (03PS4) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064)
[18:48:36] <inflatador>	 !log bking@cumin1001 depooling wdqs2011for fw update T331297
[18:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:42] <icinga-wm>	 RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:48:50] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[18:48:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet
[18:49:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:50:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:51:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P48782 and previous config saved to /var/cache/conftool/dbconfig/20230605-185156-ladsgroup.json
[18:52:00] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Perfect.  It feels good to see 67 lines of VCL varnish into the ether 😊" [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[18:52:09] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: no-op: revert - remove undeeded wgEventBusStreamNamesMap override setting (take 2) - T336817 (duration: 11m 54s)
[18:52:12] <stashbot>	 T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817
[18:53:14] <wikibugs>	 (03PS5) 10JHathaway: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972)
[18:54:18] <wikibugs>	 (03CR) 10JHathaway: puppetserver: add additional config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[18:55:00] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[18:56:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet
[18:57:17] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder)
[18:58:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2011.codfw.wmnet
[18:59:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[19:03:45] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet
[19:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[19:05:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[19:05:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48783 and previous config saved to /var/cache/conftool/dbconfig/20230605-190528-ladsgroup.json
[19:05:32] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[19:05:36] <wikibugs>	 (03Abandoned) 10TChin: Fix overlapping names edge case in flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin)
[19:07:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48784 and previous config saved to /var/cache/conftool/dbconfig/20230605-190702-ladsgroup.json
[19:07:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[19:07:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[19:10:24] <icinga-wm>	 PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:11:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[19:12:28] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet
[19:12:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs2011.codfw.wmnet
[19:13:36] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[19:16:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Isaac) Excellent -- sounds like this task can be resolved then. I'll allow SRE to handle that in case they have a specific process but good luck @SalimJah and don't hesitate...
[19:17:05] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: subtract keys rather than passing to dump_params [puppet] - 10https://gerrit.wikimedia.org/r/927250
[19:19:34] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: subtract keys rather than passing to dump_params [puppet] - 10https://gerrit.wikimedia.org/r/927250 (owner: 10JHathaway)
[19:23:12] <wikibugs>	 (03PS9) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995)
[19:24:22] <icinga-wm>	 RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:24:42] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal
[19:25:02] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:26:26] <wikibugs>	 (03CR) 10JHathaway: bookworm: Change to deb822 format for sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[19:28:28] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:30:07] <wikibugs>	 (03PS3) 10JHathaway: bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495)
[19:32:29] <brett>	 !log Maglev LVS scheduler rollout in eqiad finished (puppet re-enabled) - T263797
[19:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:33] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[19:32:59] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[19:38:55] <wikibugs>	 (03PS10) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995)
[19:42:05] <wikibugs>	 (03CR) 10Herron: pyrra: initial packaging for v0.6.2 (031 comment) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:43:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48785 and previous config saved to /var/cache/conftool/dbconfig/20230605-194336-ladsgroup.json
[19:43:40] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[19:58:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P48786 and previous config saved to /var/cache/conftool/dbconfig/20230605-195842-ladsgroup.json
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000).
[20:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <icinga-wm>	 RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:21] <cjming>	 i'll deploy since i'm the only one with patches in the queue
[20:00:45] <urbanecm>	 cjming: would you mind pinging me once done?
[20:00:53] <urbanecm>	 I have some deployments to do :))
[20:01:30] <cjming>	 urbanecm: sure thing!
[20:01:45] <urbanecm>	 Ty! 
[20:01:54] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[20:03:18] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder)
[20:03:23] <wikibugs>	 (03PS5) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355)
[20:04:42] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[20:04:52] <icinga-wm>	 PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:05:37] <wikibugs>	 (03PS4) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742
[20:05:45] <wikibugs>	 (03Merged) 10jenkins-bot: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming)
[20:06:35] <cjming>	 hmm - never encountered this before "backport failed: <CalledProcessError> Command '['git', '-C', '/srv/mediawiki-staging/php-1.41.0-wmf.11', 'fetch']' returned non-zero exit status 1."
[20:06:46] <cjming>	 should i retry?
[20:07:30] <urbanecm>	 cjming: let me see what's happening
[20:07:37] <urbanecm>	 does it include any details above it?
[20:07:39] <cjming>	 here's the error I'm seeing: error: insufficient permission for adding an object to repository database /srv/mediawiki-staging/php-1.41.0-wmf.11/.git/modules/extensions/DonationInterface/objects
[20:07:39] <cjming>	 fatal: failed to write object
[20:07:39] <cjming>	 fatal: unpack-objects failed
[20:07:55] <urbanecm>	 ah, okay. i know how to fix that one :)
[20:08:07] <cjming>	 phew - thanks! curious what the fix is
[20:08:41] <cjming>	 that error was preceded by "Fetching submodule extensions/DonationInterface"
[20:09:28] <urbanecm>	 !log [urbanecm@deploy1002 ~]$ sudo /usr/local/sbin/fix-staging-perms # attempt to fix permission errors when doing a backport
[20:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:32] <urbanecm>	 cjming: can you try again?
[20:09:36] <cjming>	 yup
[20:10:00] <urbanecm>	 fwiw, the fix should be to run `sudo /usr/local/sbin/fix-staging-perms`, which is supposed to fix permissions on the deployment host.
[20:10:12] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]]
[20:10:15] <stashbot>	 T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355
[20:10:19] <urbanecm>	 seems like a better outcome this time!
[20:10:21] <cjming>	 things look more promising - thanks!
[20:10:26] <urbanecm>	 any time
[20:13:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P48787 and previous config saved to /var/cache/conftool/dbconfig/20230605-201349-ladsgroup.json
[20:21:19] <wikibugs>	 10SRE: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Urbanecm)
[20:23:13] <logmsgbot>	 !log cjming@deploy1002 cjming: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:23:16] <stashbot>	 T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355
[20:24:58] <wikibugs>	 (03PS1) 10Dzahn: delete gerrit-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/927267 (https://phabricator.wikimedia.org/T336427)
[20:28:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48788 and previous config saved to /var/cache/conftool/dbconfig/20230605-202855-ladsgroup.json
[20:28:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[20:28:59] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[20:29:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[20:29:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48789 and previous config saved to /var/cache/conftool/dbconfig/20230605-202916-ladsgroup.json
[20:29:35] <wikibugs>	 (03PS1) 10Urbanecm: fix-stagging-perms: Fix group owner change for /srv/patches [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180)
[20:31:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[20:35:10] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]] (duration: 24m 57s)
[20:35:13] <stashbot>	 T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355
[20:35:30] <wikibugs>	 (03PS7) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797)
[20:35:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 (owner: 10Clare Ming)
[20:36:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Urbanecm) Hi @SlimJah,  not sure if this is helpful, but in addition to what @Isaac mentioned, there is also https://dumps.wikimedia.org/other/mediawiki_history/, which inclu...
[20:36:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 (owner: 10Clare Ming)
[20:36:45] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]]
[20:37:33] <wikibugs>	 (03PS1) 10Urbanecm: NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085)
[20:37:56] <urbanecm>	 cjming: would it be ok if i +2 my backport while your config deployment finishes, to save a bit time on CI? :)
[20:38:24] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41556/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[20:38:36] <cjming>	 urbanecm: of course! np
[20:38:40] <urbanecm>	 ty
[20:38:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085) (owner: 10Urbanecm)
[20:38:49] <logmsgbot>	 !log cjming@deploy1002 cjming: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:39:07] <wikibugs>	 (03Abandoned) 10BCornwall: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/924561 (https://phabricator.wikimedia.org/T263797) (owner: 10Ssingh)
[20:41:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "confirmed, obvious typo" [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180) (owner: 10Urbanecm)
[20:42:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] fix-stagging-perms: Fix group owner change for /srv/patches [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180) (owner: 10Urbanecm)
[20:42:35] <cjming>	 urbanecm: out of curiosity, i always meant to ask someone about this, so +2ing a backport manually for long-running CI on extensions means that wherever the deployer is in the process, there will be a notice at some point that there are diffs - as long as they are expected, i'm assuming it's ok to carry on with scap'ing -- i guess my Q is if there is ever need to revert during a window, does manually +2ing cause 
[20:42:35] <cjming>	 problems?
[20:43:32] <wikibugs>	 (03PS1) 10JHathaway: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495)
[20:44:34] <urbanecm>	 cjming: so, scap will tell you if it ever fetches a commit that you didn't specify on the commandline. it then gives you a chance to review and decide whether to continue. personally, i manually +2 to speed things up after i start i tell scap to sync to the whole fleet, as then it's highly unlikely i'll need to revert
[20:44:58] <mutante>	 Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/owner: owner changed 'cjming' to 'helm' (corrective)
[20:45:00] <urbanecm>	 even if i did need to revert, so long it didn't get past mwdebug, i can deploy both the revert and the newly-merged patch together
[20:45:01] <mutante>	 Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/group: group changed 'wikidev' to 'deployment' (corrective)
[20:45:04] <mutante>	 Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/mode: mode changed '0644' to '0775' (corrective)
[20:45:07] <mutante>	 Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/File[/usr/local/sbin/fix-staging-perms]/content: 
[20:45:10] <mutante>	 +find /srv/patches -not -group wikidev -print0 | xargs -0 -r chgrp wikidev
[20:45:33] <cjming>	 urbanecm: got it - thanks - that makes sense
[20:45:36] <mutante>	 Profile::Mediawiki::Deployment::Server/File[/usr/local/sbin/fix-staging-perms]/content: content changed
[20:45:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[20:46:23] <urbanecm>	 mutante: seems like meaningful changes to me (assuming it's diff for the change i proposed few mins ago).
[20:46:38] <mutante>	 urbanecm: yea, I am sharing with you that it's done
[20:46:42] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]] (duration: 09m 57s)
[20:46:49] <mutante>	 merged and deployed the fix
[20:46:52] <urbanecm>	 thanks mutante!
[20:46:58] <cjming>	 urbanecm: all done - all yours
[20:47:01] <urbanecm>	 thanks!
[20:47:06] <cjming>	 np!
[20:47:10] <mutante>	 urbanecm: yw, also re: "scap will tell you.." , can you see https://phabricator.wikimedia.org/T338168
[20:47:24] <mutante>	 that ticket came out of incident review meeting today
[20:47:33] <urbanecm>	 now it (fix-staging-perms) finished without errors!
[20:47:35] <mutante>	 is it maybe already doing what is requested
[20:47:42] <mutante>	 great, ok
[20:47:51] <mutante>	 yea, typo was obvious
[20:47:54] <urbanecm>	 !log [urbanecm@deploy1002 ~]$ sudo /usr/local/sbin/fix-staging-perms # verify T338180 fix
[20:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:56] <stashbot>	 T338180: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180
[20:48:06] <cjming>	 !log end of UTC late backport window
[20:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:10] <urbanecm>	 i'll check the ticket after my deployment :)
[20:48:27] <wikibugs>	 (03PS2) 10JHathaway: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495)
[20:48:29] <mutante>	 of course, thanks
[20:48:47] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[20:49:34] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093)
[20:50:00] <wikibugs>	 (03CR) 10jenkins-bot: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[20:50:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm)
[20:50:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm)
[20:50:53] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway)
[20:51:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm)
[20:51:29] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:926560|Update interwiki cache (T338093)]]
[20:51:32] <stashbot>	 T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093
[20:53:01] <wikibugs>	 10SRE, 10Patch-For-Review: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Dzahn) deployed in puppet and now:  ` [deploy1002:~] $ /usr/local/sbin/fix-staging-perms [deploy1002:~] $  `
[21:00:02] <icinga-wm>	 RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2100).
[21:00:48] <urbanecm>	 i'm still scap'ing, it is unreasonably slow :-/
[21:01:16] <wikibugs>	 (03Merged) 10jenkins-bot: NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085) (owner: 10Urbanecm)
[21:03:24] <wikibugs>	 (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094)
[21:04:42] <icinga-wm>	 PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:03] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:926560|Update interwiki cache (T338093)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:05:06] <urbanecm>	 scap is being UNREASONABLY slow :-/
[21:05:07] <stashbot>	 T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093
[21:05:29] <wikibugs>	 (03PS1) 10TheDJ: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183)
[21:06:00] <mutante>	 urbanecm: I ran the fix-perms script when you were already using scap. but seems unrelated
[21:06:19] <urbanecm>	 yeah, it was slow even with cj.ming's patch.
[21:06:24] <mutante>	 ok
[21:08:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48790 and previous config saved to /var/cache/conftool/dbconfig/20230605-210827-ladsgroup.json
[21:08:31] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[21:11:01] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[21:14:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm)
[21:14:40] <wikibugs>	 10SRE, 10User-Urbanecm: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Urbanecm) 05Open→03Resolved a:03Urbanecm
[21:14:58] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm)
[21:15:18] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[21:16:03] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:926560|Update interwiki cache (T338093)]] (duration: 24m 34s)
[21:16:06] <stashbot>	 T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093
[21:16:47] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[21:16:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn)
[21:17:08] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]]
[21:17:11] <stashbot>	 T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085
[21:18:15] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[21:19:13] <Amir1>	 jouncebot: nowandnext
[21:19:13] <jouncebot>	 For the next 1 hour(s) and 40 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2100)
[21:19:13] <jouncebot>	 In 4 hour(s) and 40 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0200)
[21:19:34] <Amir1>	 urbanecm: please tell me once you're done!
[21:19:36] <urbanecm>	 will do
[21:19:50] <urbanecm>	 probably in ~30 minutes if scap backport doesn't speed up itself.
[21:22:17] <Amir1>	 no worries
[21:23:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1015.eqiad.wmnet
[21:23:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P48791 and previous config saved to /var/cache/conftool/dbconfig/20230605-212333-ladsgroup.json
[21:25:38] <wikibugs>	 (03PS5) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064)
[21:25:45] <wikibugs>	 (03PS1) 10Urbanecm: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094)
[21:25:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1015.eqiad.wmnet
[21:25:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm)
[21:27:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm)
[21:27:56] <wikibugs>	 (03PS4) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[21:28:20] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[21:29:18] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[21:29:46] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[21:30:41] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:30:44] <stashbot>	 T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085
[21:31:09] <urbanecm>	 works, proceeding
[21:31:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[21:31:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[21:31:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[21:31:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[21:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48792 and previous config saved to /var/cache/conftool/dbconfig/20230605-213202-ladsgroup.json
[21:35:35] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1015.eqiad.wmnet
[21:35:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1015.eqiad.wmnet
[21:36:52] <wikibugs>	 (03PS4) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427)
[21:37:58] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "This will work solely cause the target repositories have been fixed manually." [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[21:38:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48793 and previous config saved to /var/cache/conftool/dbconfig/20230605-213819-ladsgroup.json
[21:38:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P48794 and previous config saved to /var/cache/conftool/dbconfig/20230605-213839-ladsgroup.json
[21:38:41] <wikibugs>	 (03PS6) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond)
[21:42:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]] (duration: 25m 38s)
[21:42:50] <stashbot>	 T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085
[21:42:51] <urbanecm>	 finally
[21:42:54] <urbanecm>	 Amir1: stage's yours :)
[21:43:12] <Amir1>	 awesome
[21:43:28] <wikibugs>	 (03CR) 10Dzahn: "would it be helpful if we made this an "if bullseye" thing for the migration period? so basically use gitlab on new hosts, don't touch old" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy)
[21:44:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup)
[21:46:04] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "fresh compiled. all looks good to me, including no change on doc hosts: https://puppet-compiler.wmflabs.org/output/914731/41557/" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[21:47:49] <wikibugs>	 (03CR) 10BCornwall: "(rebased off of master so my broken ATS cookbook isn't included)." [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond)
[21:50:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1016.eqiad.wmnet
[21:51:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1016.eqiad.wmnet
[21:52:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup)
[21:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48795 and previous config saved to /var/cache/conftool/dbconfig/20230605-215326-ladsgroup.json
[21:53:30] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[21:53:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48796 and previous config saved to /var/cache/conftool/dbconfig/20230605-215345-ladsgroup.json
[21:53:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[21:53:48] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[21:54:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[21:58:30] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-e3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[22:00:50] <wikibugs>	 (03CR) 10JHathaway: "lgtm, ensureable not withstanding ;)" [puppet] - 10https://gerrit.wikimedia.org/r/926464 (owner: 10Jbond)
[22:00:57] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] puppetserver::git: make ensureable [puppet] - 10https://gerrit.wikimedia.org/r/926464 (owner: 10Jbond)
[22:01:24] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1016.eqiad.wmnet
[22:01:26] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1016.eqiad.wmnet
[22:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup)
[22:03:50] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]]
[22:03:53] <stashbot>	 T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698
[22:05:30] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[22:06:55] <wikibugs>	 (03PS1) 10Ladsgroup: moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700)
[22:07:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700) (owner: 10Ladsgroup)
[22:08:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48797 and previous config saved to /var/cache/conftool/dbconfig/20230605-220833-ladsgroup.json
[22:08:49] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] trafficserver::backend: Add a cache config for puppetboard-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[22:13:39] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] (duration: 09m 48s)
[22:13:42] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:42] <stashbot>	 T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698
[22:15:52] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288
[22:15:58] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288
[22:16:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup)
[22:16:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup)
[22:16:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup)
[22:17:13] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]]
[22:18:41] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[22:20:56] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:23:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48798 and previous config saved to /var/cache/conftool/dbconfig/20230605-222339-ladsgroup.json
[22:24:43] <wikibugs>	 (03Merged) 10jenkins-bot: moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700) (owner: 10Ladsgroup)
[22:24:54] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]] (duration: 07m 40s)
[22:27:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[22:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[22:27:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48799 and previous config saved to /var/cache/conftool/dbconfig/20230605-222745-ladsgroup.json
[22:27:48] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[22:28:36] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]]
[22:28:38] <stashbot>	 T337700: Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via LqtVIew) - https://phabricator.wikimedia.org/T337700
[22:29:55] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:30:06] <wikibugs>	 (03PS5) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:30:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[22:30:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[22:30:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48800 and previous config saved to /var/cache/conftool/dbconfig/20230605-223035-ladsgroup.json
[22:32:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:33:45] <wikibugs>	 (03PS6) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:34:03] <zabe>	 Amir1: could you ping me when you are done?
[22:34:16] <Amir1>	 sure, almost done
[22:36:52] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:37:40] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]] (duration: 09m 04s)
[22:37:43] <stashbot>	 T337700: Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via LqtVIew) - https://phabricator.wikimedia.org/T337700
[22:37:48] <Amir1>	 zabe: done ^
[22:38:33] <wikibugs>	 (03PS2) 10Zabe: Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954)
[22:39:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "step 1: deployed on doc2002, then doc1003. noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:39:22] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[22:40:11] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[22:40:28] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]]
[22:40:30] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[22:41:48] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[22:42:06] * Amir1 grabs popcorn
[22:45:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48801 and previous config saved to /var/cache/conftool/dbconfig/20230605-224528-ladsgroup.json
[22:45:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "step 2: deployed on contint2002 (bullseye, not prod) - first puppet run errors, after second puppet run ok (dependencies). Otherwise looki" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:45:32] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[22:46:32] <mutante>	 !log contint2002, contint1002 - upgrading PHP from 7.3 to 7.4
[22:46:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:40] <James_F>	 mutante: Oooh.
[22:48:10] <mutante>	 James_F: 2001 as the actual prod server would be last. you can still veto it there :)
[22:48:16] <James_F>	 :-D
[22:48:22] <James_F>	 No no, I'm very happy it's happening.
[22:48:22] <mutante>	 or I jfdi
[22:48:25] <mutante>	 ok
[22:48:31] <James_F>	 Go go go.
[22:48:41] <James_F>	 Then I can land https://gerrit.wikimedia.org/r/c/integration/config/+/909388/
[22:48:43] <mutante>	 :) thanks, and also for the gerrit IP thing
[22:48:52] <mutante>	 nice
[22:49:05] <mutante>	 I want us to switch 2001 to 2002
[22:49:11] <James_F>	 Ack.
[22:49:20] <mutante>	 but if the PHP upgrade makes us feel better about it.. sure. we do that now
[22:49:27] <James_F>	 :-D
[22:49:41] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]] (duration: 09m 13s)
[22:49:44] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[22:51:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "step 4: deployed to contint1002, buster. 2 puppet runs needed, then manually removing 7.3 packages as above and restarting apache2" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[22:52:06] <mutante>	 James_F: do you know if contint1002 is used by anything? buster/prod but not the "main CI server"
[22:52:16] <mutante>	 I am not sure right now
[22:52:22] <mutante>	 regardless it is done there
[22:52:28] <mutante>	 now doing the main one
[22:52:34] <James_F>	 mutante: I *think* it's not currently used, but maybe it's used by releases-jenkins?
[22:52:56] <mutante>	 I had similar thoughts there. and a bit like gerrit-replica
[22:53:06] <James_F>	 Yeah.
[22:53:08] <mutante>	 going ahead
[22:53:39] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] prometheus: add external swagger checks to all sites [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite)
[22:53:43] <mutante>	 !log contint2001 (prod main CI server) - upgrading PHP 7.3 to 7.4
[22:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:19] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@8255d99]: I6c757561deb14e84a95ef9fc68053b3e48ff941c for T337425
[22:55:22] <stashbot>	 T337425: Re-implement post-merge publication of code coverage for Wikifunctions's repos on GitLab - https://phabricator.wikimedia.org/T337425
[22:55:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite)
[22:55:33] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@8255d99]: I6c757561deb14e84a95ef9fc68053b3e48ff941c for T337425 (duration: 00m 13s)
[22:56:02] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927308
[22:56:04] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927309
[22:56:23] <mutante>	 we shouldn't be using mod_php there
[22:56:35] <mutante>	 php-fpm... but also .. not worth it, afaict
[22:56:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927308 (owner: 10Andrew Bogott)
[22:57:00] <mutante>	 !log contint2001 - sudo apt-get remove --purge libapache2-mod-php7.3 php7.3-cli php7.3-common php7.3-json php7.3-opcache php7.3-readline
[22:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927309 (owner: 10Andrew Bogott)
[22:57:35] <mutante>	 !log contint2001 - sudo systemctl restart apache2
[22:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:52] <James_F>	 mutante: Old stuff that got copy-pasted into the present, or actually used?
[22:58:36] <mutante>	 James_F: which part do you mean? that we use mod_php ?
[22:58:41] <mutante>	 James_F: it is done 
[22:58:42] <James_F>	 mutante: Yeah.
[22:58:47] <James_F>	 \o/
[22:58:52] <mutante>	 https://integration.wikimedia.org/ is up
[22:58:55] <mutante>	 but is that the right test
[22:59:24] <mutante>	 wanna recheck your related change or something?
[22:59:24] <James_F>	 I've got a patch that'll stress-test it. ;-)
[22:59:30] <mutante>	 perfect
[22:59:51] <mutante>	 James_F: I think old stuff, just never migrated
[23:00:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P48802 and previous config saved to /var/cache/conftool/dbconfig/20230605-230034-ladsgroup.json
[23:00:44] <mutante>	 unsure how much "but soon all different anyways" applies for it :)
[23:01:14] <mutante>	 focuses on shutting down buster things
[23:02:01] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder)
[23:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48803 and previous config saved to /var/cache/conftool/dbconfig/20230605-230752-ladsgroup.json
[23:07:56] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[23:09:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "step 5: deployed to contint2001, main prod host, same. 2 puppet runs, remove old packages, restart apache2.. integration.wikimedia.org loo" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[23:09:34] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@ab77611]: Idf6c7ad01ed18785b850967252c6867d7871e902
[23:09:40] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:42] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@ab77611]: Idf6c7ad01ed18785b850967252c6867d7871e902 (duration: 00m 08s)
[23:10:56] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@6eefe56]: I5c1b92322ae59bfe8a9233ad23c3c89b844f5fb7 for T334492
[23:10:59] <stashbot>	 T334492: Create a new phan config file to make usage for libraries easier - https://phabricator.wikimedia.org/T334492
[23:11:02] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@6eefe56]: I5c1b92322ae59bfe8a9233ad23c3c89b844f5fb7 for T334492 (duration: 00m 05s)
[23:12:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:13:38] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:14:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove mgmt DNS for ssw1-a1 for testing - pt1979@cumin2002"
[23:15:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove mgmt DNS for ssw1-a1 for testing - pt1979@cumin2002"
[23:15:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:15:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P48804 and previous config saved to /var/cache/conftool/dbconfig/20230605-231540-ladsgroup.json
[23:15:57] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[23:22:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[23:22:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:22:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P48805 and previous config saved to /var/cache/conftool/dbconfig/20230605-232258-ladsgroup.json
[23:24:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002"
[23:25:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002"
[23:25:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:30:11] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) Please also see T334517#8904608 for a plan on how to proceed with contint* upgrades.  Also today we upgraded PHP from 7....
[23:30:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48806 and previous config saved to /var/cache/conftool/dbconfig/20230605-233046-ladsgroup.json
[23:30:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[23:30:50] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[23:31:01] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) @hashar This new machine is on buster. Somehow I thought we did bullseye from the start. I suggest we reimage it. See li...
[23:31:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[23:31:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T336886)', diff saved to https://phabricator.wikimedia.org/P48807 and previous config saved to /var/cache/conftool/dbconfig/20230605-233107-ladsgroup.json
[23:33:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T336886)', diff saved to https://phabricator.wikimedia.org/P48808 and previous config saved to /var/cache/conftool/dbconfig/20230605-233318-ladsgroup.json
[23:38:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P48809 and previous config saved to /var/cache/conftool/dbconfig/20230605-233804-ladsgroup.json
[23:39:12] <wikibugs>	 (03PS1) 10Zabe: Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954)
[23:41:25] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[23:42:11] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[23:42:27] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]]
[23:42:30] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[23:43:47] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[23:48:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P48810 and previous config saved to /var/cache/conftool/dbconfig/20230605-234824-ladsgroup.json
[23:49:29] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]] (duration: 07m 02s)
[23:49:32] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[23:53:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48811 and previous config saved to /var/cache/conftool/dbconfig/20230605-235310-ladsgroup.json
[23:53:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[23:53:14] <stashbot>	 T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886
[23:53:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[23:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:53:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[23:53:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T336886)', diff saved to https://phabricator.wikimedia.org/P48812 and previous config saved to /var/cache/conftool/dbconfig/20230605-235346-ladsgroup.json