[00:19:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552 [00:39:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552 (owner: 10TrainBranchBot) [00:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:57:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/926552 (owner: 10TrainBranchBot) [01:33:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:33:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:38:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:52:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [05:00:31] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) [05:01:07] (03PS2) 10KartikMistry: Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) [05:06:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:43] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50135 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:17] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:57:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:15:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 60427 [06:16:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 60427 [06:18:14] <_joe_> !log killing a pod with consistently high haproxy queue for thumbor in codfw [06:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:22] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Joe) >>! In T329491#8898529, @Ladsgroup wrote: > So I looked at categorylinks tables everywhere. There are the top ten biggest ones: > ` > root@clouddb1021:/srv# ls -Ssh sqldata.s*/*/categorylinks.ibd | head... [06:32:47] (03CR) 10Elukey: "I like the approach! Left some ideas since the output of the template is not 100% correct at the moment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [06:39:00] (03CR) 10Elukey: java: ensure wmf-certificates is installed, when required (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [06:52:13] (03CR) 10Matthias Mullie: [C: 03+1] [SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie) [06:58:11] (03PS1) 10Giuseppe Lavagetto: Use the parsoid memory limit everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927115 (https://phabricator.wikimedia.org/T334980) [06:58:13] (03PS1) 10Giuseppe Lavagetto: Load and enable parsoid everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927116 (https://phabricator.wikimedia.org/T334980) [07:00:05] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T0700) [07:00:05] kart_ and matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] o/ [07:00:23] o/ I can deploy [07:00:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie) [07:02:05] (03Merged) 10jenkins-bot: [SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926497 (https://phabricator.wikimedia.org/T336870) (owner: 10Matthias Mullie) [07:02:50] !log taavi@deploy1002 Started scap: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]] [07:02:54] T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870 [07:03:07] Sorry, late :/ [07:03:15] taavi: Let me know when done. [07:03:20] kart_: sure, will do [07:09:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [07:10:28] moritzm: o/ I was checking krb1001, there seem to be a lot of krb5-related log files in (deleted) state [07:12:04] !log taavi@deploy1002 mlitn and taavi: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [07:12:08] matthiasmullie: please test [07:12:10] T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870 [07:12:13] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:14] checking [07:12:59] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:10] elukey: krb5kdc or something else? might be the rotation didn't kick in correctly or left some behind? https://phabricator.wikimedia.org/T337906 [07:13:59] df showed it as full, but then the actual file system usage was only like 7G, will keep an eye on it after the reboot [07:14:38] taavi: LGTM! [07:14:40] the disk ran full some time days ago and then we setup the increased rotation/compression for krb5kdc, so possibly it was still in a wedged state from the initial full disk [07:14:45] thx, syncing [07:15:19] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [07:15:21] RECOVERY - puppet last run on krb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:15:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [07:17:15] (03CR) 10Elukey: [C: 03+2] kserve-inference: use dict instead of lists for inference_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/925844 (owner: 10Elukey) [07:18:20] moritzm: I left a comment in https://phabricator.wikimedia.org/T337906#8901375, I think it maybe related to the new logrotate rule [07:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:21:17] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:926497|[SearchVue] Enable on Norwegian, Hungarian, Catalan, Dutch, and Ukrainian (T336870)]] (duration: 18m 27s) [07:21:21] T336870: [S] Deploy Search Preview in 5 new wikis - https://phabricator.wikimedia.org/T336870 [07:21:21] taavi: thanks! [07:21:26] yw [07:22:07] kart_: I'm done, feel free to go ahead [07:23:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [07:23:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [07:23:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [07:24:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:25:14] taavi: thanks [07:25:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:25:45] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) [07:27:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:27:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:28:06] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926833 (https://phabricator.wikimedia.org/T337669) (owner: 10KartikMistry) [07:28:23] !log kartik@deploy1002 Started scap: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]] [07:28:26] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:30:02] !log kartik@deploy1002 kartik: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:32:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:37:50] (03PS7) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) [07:38:05] (03PS11) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825) [07:38:20] (03PS7) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) [07:38:22] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:926833|testwiki: Enable Section Translation for 10 Wikipedias (T337669)]] (duration: 09m 58s) [07:38:25] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [07:38:36] (03PS12) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825) [07:38:43] (03PS8) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) [07:42:08] (03CR) 10Jelto: [C: 03+2] Fix profile::gitlab::active_host and profile::gitlab::passive_hosts for devtools [puppet] - 10https://gerrit.wikimedia.org/r/926544 (https://phabricator.wikimedia.org/T338044) (owner: 10Ahmon Dancy) [07:47:23] (03CR) 10Muehlenhoff: gdnsd: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926509 (owner: 10Muehlenhoff) [07:50:09] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:54:50] !log installing containerd security updates [07:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:08] (03CR) 10Hashar: "Amending to replace `::facts` by `$facts['networking']['fqdn']` and I will rebase this change since Gerrit flags it as being in merge con" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:58:56] (03CR) 10Hashar: "Addressing the few comments in next patchset. I am also rebasing the whole chain since at least the parent is marked as being as having a " [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:59:31] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/926425 (https://phabricator.wikimedia.org/T338010) (owner: 10Cparle) [08:01:14] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [08:02:32] (03CR) 10Hashar: contint: set Jenkins agent username from hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:02:48] (03PS2) 10Hashar: contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) [08:02:50] (03PS6) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) [08:03:21] (03CR) 10CI reject: [V: 04-1] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:04:25] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: drop k8s pods-related metrics from cadvisor in 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/925781 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [08:05:15] AH puppet and tox fails [08:05:16] 00:00:08.252 /usr/lib/python3/dist-packages/tox/config/__init__.py:579: UserWarning: conflicting basepython version (set 27, should be 2) for env 'py2-pep8';resolve conflict or set ignore_basepython_conflict [08:05:16] 00:00:08.252 proposed_version, implied_version, testenv_config.envname [08:05:17] :) [08:05:28] 00:00:17.814 KeyError: key not found: "PARALLEL_PID_FILE" :D [08:05:32] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Remove remnants of coal and coal-web [puppet] - 10https://gerrit.wikimedia.org/r/925918 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [08:11:41] (03CR) 10DCausse: [C: 03+1] mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [08:13:11] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [08:13:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:24] (03PS3) 10Hashar: contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) [08:17:26] (03PS7) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) [08:17:28] (03PS5) 10Hashar: contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) [08:22:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:28:50] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:29:10] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add external swagger checks to all sites [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [08:30:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:30:36] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:30:43] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:31:23] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Stop cadvisor from collecting extra metrics from docker - https://phabricator.wikimedia.org/T337856 (10fgiunchedi) [08:31:36] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [08:31:38] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Stop cadvisor from collecting extra metrics from docker - https://phabricator.wikimedia.org/T337856 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! [08:31:44] (03PS2) 10Muehlenhoff: Cloud: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/925721 [08:32:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/926543 (https://phabricator.wikimedia.org/T337766) (owner: 10Cathal Mooney) [08:34:39] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/922515/1889/" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:37:03] (03CR) 10Hashar: [C: 04-1] "This one breaks on PCC https://puppet-compiler.wmflabs.org/output/922555/1888/ with:" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:38:07] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/922554/1890/" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:40:07] jbond: jnuche: sorry for the spam on that series of changes, I think I screwed up the rebase :/ [08:40:53] !log power-cycling restbase1027 - T338122 [08:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:56] T338122: restbase1027.eqiad.wmnet down - https://phabricator.wikimedia.org/T338122 [08:41:08] (03Abandoned) 10Klein Muçi: Content Translation: Set MT threshold to 90% for Albanian WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925574 (owner: 10Klein Muçi) [08:42:48] (03PS1) 10Jcrespo: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 [08:43:00] (03CR) 10CI reject: [V: 04-1] Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo) [08:44:36] hashar: ah, no worries :) [08:44:37] PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:44:39] PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:44:57] (03PS6) 10Hashar: contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) [08:44:59] (PuppetDisabled) firing: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:45:03] PROBLEM - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:45:09] RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:16] I screwed up the rebase somehow [08:45:24] (03PS2) 10Jcrespo: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 [08:45:25] RECOVERY - Restbase root url on restbase1027 is OK: HTTP OK: HTTP/1.1 200 - 17613 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/RESTBase [08:45:39] (03PS8) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) [08:45:49] PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:45:54] (03CR) 10Hashar: [C: 03+1] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:46:03] PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:46:03] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:46:15] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:47:31] (03PS3) 10Jcrespo: backup: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 [08:47:35] RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:47:35] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:48:11] RECOVERY - cassandra-b service on restbase1027 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:49:18] (03PS2) 10Giuseppe Lavagetto: trafficserver: also match mobile domains in mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/924080 [08:49:43] RECOVERY - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.184 port 9042 https://phabricator.wikimedia.org/T93886 [08:49:43] RECOVERY - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.185 port 9042 https://phabricator.wikimedia.org/T93886 [08:49:43] RECOVERY - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.186 port 9042 https://phabricator.wikimedia.org/T93886 [08:49:43] RECOVERY - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2025-02-21 18:43:53 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:49:43] RECOVERY - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2025-02-21 18:43:51 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:49:43] RECOVERY - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2025-02-21 18:43:55 +0000 (expires in 627 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:56:55] (03PS1) 10Btullis: Update the abuse filter wikireplica view rules [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426) [09:02:35] (03CR) 10Muehlenhoff: [C: 03+2] Point codfw URL downloader to new bullseye host [dns] - 10https://gerrit.wikimedia.org/r/926421 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [09:03:44] (03CR) 10Hashar: "Sorry for the spam, I screwed the order of my changes when rebasing :/" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:04:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:04:59] (PuppetDisabled) resolved: Puppet disabled on puppetmaster2004:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:06:12] (03CR) 10Jcrespo: [C: 03+2] backup: Add cloudcontrol2004-dev to the list of backup jobs to ignore [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo) [09:06:33] (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to wikidataclient.dblis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789976 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang) [09:07:42] (03CR) 10Hashar: [C: 04-1] "PCC https://puppet-compiler.wmflabs.org/output/922555/1892/" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:07:55] RECOVERY - puppet last run on puppetmaster2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:09:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:11:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:14:47] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [09:17:11] (03CR) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang) [09:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:17:30] (03CR) 10Jbond: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:21:32] (03CR) 10Jbond: [C: 03+2] contint: Jenkins slave > agent [puppet] - 10https://gerrit.wikimedia.org/r/922515 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:21:35] (03CR) 10Jbond: [C: 03+2] contint: set Jenkins agent username from hiera [puppet] - 10https://gerrit.wikimedia.org/r/922554 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:22:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:25:40] (03CR) 10Hashar: [C: 04-1] contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:25:52] (03PS1) 10Lucas Werkmeister (WMDE): Make outreachwiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140) [09:25:55] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Add new stat1009 to the stat servers rsync hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/926422 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [09:26:54] (03CR) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang) [09:27:12] (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase languageLinkSiteGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789979 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang) [09:27:15] (03Abandoned) 10Lucas Werkmeister (WMDE): Add outreachwiki to Wikibase SpecialSiteLinkGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789977 (https://phabricator.wikimedia.org/T171140) (owner: 10Stang) [09:27:35] (03PS2) 10Muehlenhoff: bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [09:27:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:29:14] (03CR) 10Jbond: "lgtm once the file is renamed (feel free to assume a +1)" [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:30:07] (03CR) 10Muehlenhoff: "(Rebased in PS2, since PCC failed after afb46a8742c4afe2a344790319e096e88dd36d57 was merged)" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [09:31:25] _joe_: I think T337649 needs some form of escalation. From a user perspective and for the relevant use case / user group it looks, in practical effect, as if "Commons is down". [09:31:26] T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 [09:31:47] <_joe_> xover: sorry, I disagree on that last statement. [09:32:02] Ok? [09:32:12] <_joe_> but while there is no reason to overdramatize, the situation is indeed not acceptable [09:32:23] <_joe_> sadly there's only one SRE team right now on the hook for thumbor [09:33:01] <_joe_> I was actually discussing this right now [09:33:14] <_joe_> so I'm not sure how much more escalation we can do [09:33:29] <_joe_> I mean how much escalation my team and I can do [09:33:52] (not overdramatizing: I'm just saying that the way the symptoms are presenting, that is what it will *look like* to those users that are affected and for the kinds of files that are affected) [09:34:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:34:57] (03CR) 10Hashar: "Puppet fails on releases1003.eqiad.wmnet with:" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [09:35:20] Yeah, I realise manpower is an issue. That's why I'm trying to wave the red flag. [09:36:04] (03CR) 10Hashar: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:37:51] <_joe_> xover: to be clear - I am fully sympathetic with the issues you're encountering, and we're trying to get at least some stopgaps in place [09:37:57] !log roll-restart thumbor in codfw - T337649 [09:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:01] T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 [09:38:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [09:38:56] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=thumbor.* [09:39:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [09:39:16] (03PS1) 10Btullis: "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 [09:39:38] (03CR) 10CI reject: [V: 04-1] "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (owner: 10Btullis) [09:39:53] (03CR) 10Vgutierrez: [C: 03+1] "looks good, please proceed with caution ;P" [puppet] - 10https://gerrit.wikimedia.org/r/924080 (owner: 10Giuseppe Lavagetto) [09:39:53] !log roll-restart thumbor in eqiad - T337649 [09:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:56] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [09:40:28] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10SalimJah) Hi Aklapper. Thanks for coming back to us. We are actually working to complete a research project that leverages 10 years worth of en:wiki data, documented here:... [09:41:04] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [09:44:08] (03PS2) 10Btullis: "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) [09:44:31] (03CR) 10CI reject: [V: 04-1] "Add an extra property 'CollectMode' to each user's jupyter service"" [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [09:45:48] (03PS3) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) [09:46:10] (03CR) 10CI reject: [V: 04-1] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [09:48:11] (03PS12) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [09:48:13] (03PS10) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 [09:48:15] (03PS3) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213 [09:50:04] (03PS1) 10Muehlenhoff: Revert "Point codfw URL downloader to new bullseye host" [dns] - 10https://gerrit.wikimedia.org/r/927129 [09:50:24] (03PS1) 10Muehlenhoff: Remove option to manage sources.list [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) [09:51:53] (03CR) 10Jelto: "For me two topics are not yet resolved:" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [09:51:56] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Point codfw URL downloader to new bullseye host" [dns] - 10https://gerrit.wikimedia.org/r/927129 (owner: 10Muehlenhoff) [09:52:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:55:13] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1000) [10:00:59] (03PS4) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) [10:02:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [10:05:20] <_joe_> xover: can you confirm things are "better" now? [10:05:32] _joe_: Will test. [10:06:50] !log truncate xff.log and JobExecutor.log on mwlog1002 to reclaim space - T338127 [10:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:53] T338127: log rotation stopped on mwlog for all files but "api.log" - https://phabricator.wikimedia.org/T338127 [10:06:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:07:59] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:26] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [10:08:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:08:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:09:31] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:09:41] RECOVERY - Disk space on mwlog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwlog1002&var-datasource=eqiad+prometheus/ops [10:09:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [10:11:11] (03CR) 10Jbond: contint: rename jenkins-slave to jenkins-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [10:11:31] !log installing openssl security updates on Bullseye [10:11:31] _joe_: Seems better. Retesting previous files no thumbs failed and thumbs loaded in about 10s total. Testing not-previously-tested showed intermittent 429 (first thumb requested for the file, at 500px) and ~15s load time for subsequent thumbs. Testing only a very limited number of files. [10:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:12:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927130 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [10:13:03] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts - aborrero@cumin1001" [10:13:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:14:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts - aborrero@cumin1001" [10:14:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:15:01] 10SRE, 10serviceops-radar: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Ladsgroup) Good question: Commons, arwiki, use the default and the rest don't ` 'enwiki' => 'uca-default-u-kn', // T136150 'ruwiktionary' => 'uca-ru', 'frwiki' => 'uca-fr-u-kn', // T56680, T146675 'fawik... [10:15:49] (03CR) 10Hashar: releases: clone repos/releng/release from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [10:17:36] (03CR) 10Jcrespo: [C: 03+2] "Another backup failing ^" [puppet] - 10https://gerrit.wikimedia.org/r/927119 (owner: 10Jcrespo) [10:22:19] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:51] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:26:53] (03CR) 10Jbond: [C: 03+1] "LGTM see nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez) [10:26:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:28:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:28:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:29:25] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas [10:30:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:31:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas [10:31:43] (03PS15) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [10:33:54] (03PS16) 10Jbond: profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) [10:33:56] (03PS5) 10Jbond: base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) [10:33:58] (03PS15) 10Jbond: firewall: add basic firewall class [puppet] - 10https://gerrit.wikimedia.org/r/919061 [10:34:00] (03PS17) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) [10:34:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:34:39] (03CR) 10Jbond: profile::base::firewall: move to profile::firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:35:38] (03CR) 10CI reject: [V: 04-1] firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:36:03] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:38:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:42:31] (03PS1) 10Ladsgroup: Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) [10:45:24] jouncebot: nowandnext [10:45:24] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1000) [10:45:24] In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300) [10:46:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: also match mobile domains in mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/924080 (owner: 10Giuseppe Lavagetto) [10:47:51] (03PS1) 10Jelto: gitlab: run four backups per day [puppet] - 10https://gerrit.wikimedia.org/r/927139 (https://phabricator.wikimedia.org/T316935) [10:48:06] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) > What kind of secret do we need to add to private puppet for the new OIDC GitLab client? you need to copy the secret from `... [10:49:28] (03CR) 10Jbond: [C: 03+2] profile::base::firewall: move to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:55:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:58:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:00:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) [11:06:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:07:46] (03PS1) 10Jbond: wmcs::firewall: use profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927143 (https://phabricator.wikimedia.org/T279683) [11:07:53] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:08:23] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [11:11:24] (03CR) 10Gmodena: "The change LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [11:11:51] PROBLEM - Check systemd state on ml-serve2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:13] !log bounced ferm on ml-serve2006 (race caused by firewall profile change) [11:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:27] RECOVERY - Check systemd state on ml-serve2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:50] thaks moritzm [11:14:47] (03CR) 10Jbond: [C: 03+2] wmcs::firewall: use profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927143 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:15:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [11:16:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:19:17] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Pereibri) is there a way to have this map hosted somewhere else like google? http://maps.wikimedia.org/osm-intl/%7Bz%7D/%7Bx%7D/%7By%7D.png [11:21:02] !log restarting Exim on MXes to pick up OpenSSL updates [11:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] (03CR) 10Jbond: [C: 03+2] base::firewall: remove the old firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/922816 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:22:48] (03PS1) 10KartikMistry: Update MinT to 2023-06-05-111431-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337708) [11:29:01] (03CR) 10Jbond: [C: 03+2] service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:29:11] (03PS3) 10Jbond: service::catalog: Add puppetboard-next service for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/925846 (https://phabricator.wikimedia.org/T330490) [11:31:22] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [11:32:12] (03PS2) 10Jbond: puppetboard-next: add a new name for the puppet7 migration [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490) [11:32:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:53] (03CR) 10Muehlenhoff: firewall: add basic firewall class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [11:37:46] (ConfdResourceFailed) firing: (4) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:37:58] this is me ^^ [11:38:05] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: api: cleanup unused network constant [puppet] - 10https://gerrit.wikimedia.org/r/927161 [11:39:11] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard-next [11:42:46] (ConfdResourceFailed) firing: (6) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:44:04] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/927161/41533/" [puppet] - 10https://gerrit.wikimedia.org/r/927161 (owner: 10Arturo Borrero Gonzalez) [11:45:17] (03PS2) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125) [11:45:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) [11:45:59] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge [11:46:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge [11:47:55] (03CR) 10Jbond: [C: 03+2] puppetboard-next: add a new name for the puppet7 migration [dns] - 10https://gerrit.wikimedia.org/r/925845 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:49:20] (03CR) 10Btullis: [C: 03+2] Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/926859 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [11:52:51] (03PS1) 10Jbond: firewall: update copmments to mention profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927163 [11:55:28] (03PS1) 10Jbond: conftool-data: also add service to conftool [puppet] - 10https://gerrit.wikimedia.org/r/927164 (https://phabricator.wikimedia.org/T330490) [11:55:50] (03CR) 10Jbond: [C: 03+2] firewall: update copmments to mention profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/927163 (owner: 10Jbond) [11:59:31] (03PS1) 10Jbond: profile::firewall: add missing keys [puppet] - 10https://gerrit.wikimedia.org/r/927165 [12:00:02] jouncebot: next [12:00:02] In 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300) [12:00:05] (03PS3) 10Arturo Borrero Gonzalez: cloudvirt: codfw1dev: add cloud_private_subnet [puppet] - 10https://gerrit.wikimedia.org/r/927131 (https://phabricator.wikimedia.org/T338125) [12:00:07] (03PS3) 10Arturo Borrero Gonzalez: openstack: rabbitmq: simplify cloud-private-subnet firewalling support [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) [12:00:40] (03CR) 10Jbond: [C: 03+2] profile::firewall: add missing keys [puppet] - 10https://gerrit.wikimedia.org/r/927165 (owner: 10Jbond) [12:01:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41535/console" [puppet] - 10https://gerrit.wikimedia.org/r/927165 (owner: 10Jbond) [12:04:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:05:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:43] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:07:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.413 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:56] (03PS1) 10Jbond: drop globale acl's from cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/927167 [12:08:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] drop globale acl's from cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/927167 (owner: 10Jbond) [12:08:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50136 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:08:17] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:09:22] (03CR) 10Jbond: [C: 03+2] conftool-data: also add service to conftool [puppet] - 10https://gerrit.wikimedia.org/r/927164 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:10:01] jouncebot: nowandnext [12:10:01] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [12:10:01] In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300) [12:10:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:59] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:46] (ConfdResourceFailed) resolved: (6) confd resource _var_lib_gdnsd_discovery-puppetboard-next.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:15:06] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=puppetboard-next [12:15:55] !log lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703 [12:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:58] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [12:17:05] (03CR) 10BBlack: [C: 03+2] pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [12:17:08] !log creating a copy of db1157 binlogs on dbprov1004 T338128 [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:10] T338128: Recovery text table in a couple of wikis - https://phabricator.wikimedia.org/T338128 [12:18:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "FYI this results in a relaxation of the firewall. But I don't think is very relevant. We control all IP addresses in the supernet." [puppet] - 10https://gerrit.wikimedia.org/r/927140 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [12:19:51] (03PS2) 10Btullis: Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) [12:20:15] (03CR) 10CI reject: [V: 04-1] Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) (owner: 10Btullis) [12:22:03] jynus: don't know if it's related but all of s3 seems to be lagged now [12:22:25] https://orchestrator.wikimedia.org/web/cluster/alias/s3 [12:22:34] ok, I was going to kill the transfer, but it finished [12:22:58] it seems it caused some network issues [12:23:02] should be gone now [12:23:15] ah, okay [12:23:18] thanks [12:23:39] (03PS1) 10BBlack: pybal: configure advertised_instrumentation_ips [puppet] - 10https://gerrit.wikimedia.org/r/927168 (https://phabricator.wikimedia.org/T334703) [12:23:41] I will check those hosts, maybe thir replication connection broke [12:25:19] a spike of errors for a couple of minutes [12:25:31] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:07] is puppet compiler borked? https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41538/console [12:27:17] (03CR) 10BBlack: [C: 03+2] pybal: configure advertised_instrumentation_ips [puppet] - 10https://gerrit.wikimedia.org/r/927168 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [12:30:07] (03PS2) 10Arturo Borrero Gonzalez: interface::route: add persist option [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) [12:30:51] (03CR) 10Arturo Borrero Gonzalez: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez) [12:31:26] jouncebot: nowandnext [12:31:26] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [12:31:26] In 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300) [12:32:30] (03PS3) 10Btullis: Fix the script to install the spark3 yarn shuffler jar symlink [puppet] - 10https://gerrit.wikimedia.org/r/922585 (https://phabricator.wikimedia.org/T332765) [12:32:43] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:35:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:35:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:35:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:36:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:38:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:38:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:38:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:07] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [12:39:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48708 and previous config saved to /var/cache/conftool/dbconfig/20230605-123915-ladsgroup.json [12:39:18] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10WMF-General-or-Unknown, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10MatthewVernon) [12:39:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48709 and previous config saved to /var/cache/conftool/dbconfig/20230605-124124-ladsgroup.json [12:41:29] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) Thank you for the supporting links. Having discussed this internally with other Foundation staff, there seems to... [12:42:22] (03PS1) 10Jbond: trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) [12:42:33] (03CR) 10CI reject: [V: 04-1] trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:42:50] (03PS2) 10Jbond: trafficserver::backend: Add a cache config for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) [12:43:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [12:44:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [12:44:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [12:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48710 and previous config saved to /var/cache/conftool/dbconfig/20230605-124444-ladsgroup.json [12:44:58] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [12:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [12:51:46] !log killed prioritizeFilesWithTemplate.php, stopping depool maint. [12:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:59] cc matthiasmullie and cormacparle ^ [12:52:06] please add waitForReplication [12:52:20] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [12:52:39] 10SRE, 10Maps: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083 (10MatthewVernon) 05Open→03Declined I'm afraid that "Wikimedia Maps may not be used by third-party services outside of the Wikimedia projects." (see [[ https://foundation.wikimedia.or... [12:54:53] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:25] ehh [12:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P48711 and previous config saved to /var/cache/conftool/dbconfig/20230605-125630-ladsgroup.json [12:56:45] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10MatthewVernon) @Isaac I'm afraid I'm still a bit confused as to what access is needed here (and/or which data set is being referred to); can you help, since I gather you dire... [12:56:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [12:57:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48712 and previous config saved to /var/cache/conftool/dbconfig/20230605-125754-ladsgroup.json [12:59:46] Amir1: ok, caught up; that's a cron script in MachineVision - will look into it [13:00:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Ladsgroup) I sponsored Zabe for production access and I can sponsor him for access to the analytics private data (without kerberos) as well. In reality it doesn't change any ac... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] o/ [13:00:23] Lucas_WMDE: I assume you will self-deploy? [13:00:26] thanks. It needs a "$this->waitForReplication()" somewhere [13:00:26] o/ [13:00:27] yup :) [13:02:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:02:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:02:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48713 and previous config saved to /var/cache/conftool/dbconfig/20230605-130228-ladsgroup.json [13:02:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:02:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140) (owner: 10Lucas Werkmeister (WMDE)) [13:03:01] let’s see if it works as expected [13:03:44] (03Merged) 10jenkins-bot: Make outreachwiki a multilingual Wikidata client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927127 (https://phabricator.wikimedia.org/T171140) (owner: 10Lucas Werkmeister (WMDE)) [13:04:00] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]] [13:04:03] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [13:05:25] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:05:32] let’s se [13:05:33] *see [13:07:19] https://www.wikidata.org/w/index.php?title=Q4115189&diff=prev&oldid=1908813124 works [13:07:44] https://outreach.wikimedia.org/w/index.php?title=Wikimedia:Sandbox&diff=prev&oldid=250097 also works [13:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48714 and previous config saved to /var/cache/conftool/dbconfig/20230605-130753-ladsgroup.json [13:07:57] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:08:23] language links on the outreachwiki page also look correct to me [13:08:27] good to go, I’ll sync [13:09:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Zabe - https://phabricator.wikimedia.org/T337703 (10Ottomata) K, thanks @ladsgroup! Approved. In this case I think its okay to skip the MOU/expiry, since Zabe has shell access for other reasons anyway, and doesn't have an exp... [13:09:45] !log lvs4* (ulsfo) - restart pybal for T334703 IPs [13:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:48] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [13:10:01] (03PS1) 10Muehlenhoff: Add Cumin alias for new cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/927177 [13:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P48715 and previous config saved to /var/cache/conftool/dbconfig/20230605-131136-ladsgroup.json [13:13:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P48716 and previous config saved to /var/cache/conftool/dbconfig/20230605-131301-ladsgroup.json [13:13:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:07] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:927127|Make outreachwiki a multilingual Wikidata client (T171140)]] (duration: 10m 06s) [13:14:10] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [13:14:15] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:14:31] anything else to deploy? [13:14:53] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:15:05] (03CR) 10Btullis: [C: 03+1] "Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/927177 (owner: 10Muehlenhoff) [13:15:29] !log lvs6* (drmrs) - restart pybal for T334703 IPs [13:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:32] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [13:17:23] !log UTC afternoon backport+config window done [13:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:08] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for new cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/927177 (owner: 10Muehlenhoff) [13:19:16] (03PS1) 10Muehlenhoff: Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 [13:19:21] !log lvs5* (eqsin) - restart pybal for T334703 IPs [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] (03CR) 10CI reject: [V: 04-1] Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [13:21:13] (03PS1) 10Hashar: rake_modules: apply early monkey patches earlier [puppet] - 10https://gerrit.wikimedia.org/r/927181 [13:21:24] (03CR) 10CI reject: [V: 04-1] rake_modules: apply early monkey patches earlier [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar) [13:21:44] ... [13:22:06] (03PS1) 10Muehlenhoff: Fix role name in alias [puppet] - 10https://gerrit.wikimedia.org/r/927184 [13:22:29] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [13:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P48717 and previous config saved to /var/cache/conftool/dbconfig/20230605-132259-ladsgroup.json [13:23:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:14] (03PS2) 10Muehlenhoff: Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 [13:23:27] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - enable upgradeMode: savepoint, and take periodic savepoints. [deployment-charts] - 10https://gerrit.wikimedia.org/r/926601 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [13:25:00] !log lvs3* (esams) - restart pybal for T334703 IPs [13:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:04] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [13:25:54] !log Restarted Zuul due to stall ssh connection # T309376 [13:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] T309376: Zuul jenkins-bot user holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [13:26:00] (03CR) 10Muehlenhoff: [C: 03+2] Fix role name in alias [puppet] - 10https://gerrit.wikimedia.org/r/927184 (owner: 10Muehlenhoff) [13:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T336886)', diff saved to https://phabricator.wikimedia.org/P48718 and previous config saved to /var/cache/conftool/dbconfig/20230605-132642-ladsgroup.json [13:26:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:26:45] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:26:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:27:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48719 and previous config saved to /var/cache/conftool/dbconfig/20230605-132703-ladsgroup.json [13:27:56] (03CR) 10Hashar: "recheck due to T309376" [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar) [13:28:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P48720 and previous config saved to /var/cache/conftool/dbconfig/20230605-132807-ladsgroup.json [13:28:53] (03CR) 10Hashar: "recheck cause I had to restart Zuul due to T309376" [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [13:29:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48721 and previous config saved to /var/cache/conftool/dbconfig/20230605-132911-ladsgroup.json [13:29:27] !log bblack@deploy1002 Locking from deployment [ALL REPOSITORIES]: temporary lock for LVS resarts in core DCs [13:29:56] !log lvs2* (codfw) - restart pybal for T334703 IPs [13:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:15] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: remove leftover role hieradata [puppet] - 10https://gerrit.wikimedia.org/r/909714 (owner: 10Majavah) [13:32:31] !log lvs1* (eqiad) - restart pybal for T334703 IPs [13:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [13:32:44] jouncebot: nowandnext [13:32:44] For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1300) [13:32:44] In 0 hour(s) and 27 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1400) [13:33:13] (03CR) 10Hashar: "And another follow up since `require 'puppet'` invokes `URI.escape` which triggers a ruby warning:" [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [13:33:39] (03CR) 10Hashar: "And another follow up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/889990 since `require 'puppet'` invokes `URI.escape` which t" [puppet] - 10https://gerrit.wikimedia.org/r/922565 (owner: 10Hashar) [13:33:47] sukhe: Lucas_WMDE has finished the backport window as an fyi [13:34:01] RhinosF1: thanks! bblack has the lock already so I will take it from him [13:34:13] with the changes he is rolling out, no more locking required for LVS work anyway, so that's good :) [13:35:21] !log bblack@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: temporary lock for LVS resarts in core DCs (duration: 05m 54s) [13:35:21] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 [13:35:25] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [13:35:52] hmm older message, let's be correct [13:35:53] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 (duration: 01m 06s) [13:36:02] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T326767 [13:36:05] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [13:36:23] (03CR) 10Jbond: [C: 03+1] interface::route: add persist option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez) [13:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P48722 and previous config saved to /var/cache/conftool/dbconfig/20230605-133805-ladsgroup.json [13:38:11] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) @elukey i am available any day this week except Thursday if you are available [13:39:33] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) @Jclark-ctr Thanks! I have time today and tomorrow in my afternoon, lemme know what time works best for you! [13:39:57] (03CR) 10Hashar: "I have no idea what the monkey patch is exactly doing or what kind of side effect this can have, but that surely mutes the warning when do" [puppet] - 10https://gerrit.wikimedia.org/r/927181 (owner: 10Hashar) [13:40:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:08] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:41:16] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335845)', diff saved to https://phabricator.wikimedia.org/P48723 and previous config saved to /var/cache/conftool/dbconfig/20230605-134313-ladsgroup.json [13:43:52] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Host under maintenance [13:44:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1002.eqiad.wmnet with reason: Host under maintenance [13:44:14] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b4799674-ad70-4117-a653-cdeaad02c246) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenanc... [13:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P48724 and previous config saved to /var/cache/conftool/dbconfig/20230605-134418-ladsgroup.json [13:44:54] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance [13:45:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance [13:45:13] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43b4a369-edbc-4df6-b931-f35757b38bf1) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenanc... [13:45:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:17] !log installing python-ipaddress security updates [13:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] (03PS1) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) [13:48:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:48:52] (03PS2) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) [13:49:27] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:50:12] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Isaac) I think my name is being brought up based on [[https://lists.wikimedia.org/hyperkitty/list/wiki-research-l@lists.wikimedia.org/thread/MWXIGG3F7UXIWXYJWH3X47NWWQLGSJWF/... [13:52:28] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks @jbond for the help! I added the secret to `profile::gitlab::omniauth_providers` in private puppet. After that puppet cre... [13:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T336886)', diff saved to https://phabricator.wikimedia.org/P48725 and previous config saved to /var/cache/conftool/dbconfig/20230605-135311-ladsgroup.json [13:53:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:53:16] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48726 and previous config saved to /var/cache/conftool/dbconfig/20230605-135332-ladsgroup.json [13:53:36] (03PS1) 10Hashar: rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194 [13:54:08] (03PS1) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) [13:54:12] (03CR) 10CI reject: [V: 04-1] rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194 (owner: 10Hashar) [13:55:00] (03PS1) 10Filippo Giunchedi: profile: exclude kubelet hosts from cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) [13:55:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41539/console" [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey) [13:56:08] (03PS2) 10Hashar: rake_modules: early monkey patch URI.unescape [puppet] - 10https://gerrit.wikimedia.org/r/927194 [13:56:10] (03CR) 10Gmodena: [C: 03+1] Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [13:56:56] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:57:07] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:57:18] (03CR) 10Gmodena: [C: 03+1] "CCing Andrew since this change will impact eventgate-analytics-external." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [13:57:20] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41540/console" [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:57:48] (03PS2) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) [13:58:15] (03CR) 10Filippo Giunchedi: "Untested but LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [13:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48727 and previous config saved to /var/cache/conftool/dbconfig/20230605-135859-ladsgroup.json [13:59:03] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:59:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P48728 and previous config saved to /var/cache/conftool/dbconfig/20230605-135924-ladsgroup.json [13:59:49] (03CR) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [13:59:51] (03PS1) 10BBlack: wikidata maxlag maint script: use new pybal VIPs [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) [14:00:05] sukhe: It is that lovely time of the day again! You are hereby commanded to deploy LVS maintenance. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1400). [14:01:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:01:25] (03PS3) 10Elukey: Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) [14:02:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41543/console" [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey) [14:03:01] 10SRE, 10ops-eqiad, 10Patch-For-Review: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) Removed gpu from dse-k8s-worker1002 installed gpu into ml-serve1001 [14:04:17] (03CR) 10Gmodena: [C: 03+1] Add initial stream configs for Android article events using Metrics Platform Java client library (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [14:05:33] (03CR) 10Ottomata: [C: 03+1] varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:05:40] (03CR) 10Ottomata: [C: 03+1] Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:05:49] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:36] BGP and Pybal alerts in codfw expected [14:06:41] ack [14:07:07] (03CR) 10Btullis: "This looks like it would work, but I wonder if it wouldn't be cleaner to use a systemd target to group all of the instances together, as o" [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:07:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:51] (03CR) 10Klausman: [C: 03+1] Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey) [14:08:42] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:08:55] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:09:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete sre.o11y.roll-restart-reboot-thanos-fe cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [14:09:16] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10SalimJah) @Isaac: thanks for your reply. You are correct. We are reacting to this suggestion in the thread you mention, which we thought looked very efficient for our purpo... [14:10:05] (03CR) 10Elukey: varnishkafka: add catch all systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:10:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:10:25] PROBLEM - pybal on lvs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:10:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:13] ^ expected [14:11:24] (03PS4) 10AikoChou: changeprop: allow match_not in match_config for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) [14:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P48729 and previous config saved to /var/cache/conftool/dbconfig/20230605-141405-ladsgroup.json [14:14:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T336886)', diff saved to https://phabricator.wikimedia.org/P48730 and previous config saved to /var/cache/conftool/dbconfig/20230605-141430-ladsgroup.json [14:14:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:14:33] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:14:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:14:49] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [14:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48731 and previous config saved to /var/cache/conftool/dbconfig/20230605-141451-ladsgroup.json [14:15:25] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:15:33] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:15:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48732 and previous config saved to /var/cache/conftool/dbconfig/20230605-141559-ladsgroup.json [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:41] (03PS3) 10Herron: mwlog: fix mw-log logrotate glob [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) [14:18:00] (03CR) 10Herron: mwlog: fix mw-log logrotate glob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [14:18:33] (03PS1) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 [14:18:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [14:19:06] (03CR) 10Herron: [C: 03+2] "thx for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/927187 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [14:19:17] (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:20:16] (03CR) 10AikoChou: changeprop: allow match_not in match_config for liftwing (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:20:20] (03PS1) 10Ssingh: lvs2009: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/927206 (https://phabricator.wikimedia.org/T335777) [14:20:49] (03CR) 10MVernon: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/927180 (owner: 10Muehlenhoff) [14:21:17] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/927208 (https://phabricator.wikimedia.org/T335777) [14:21:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] interface::route: add persist option [puppet] - 10https://gerrit.wikimedia.org/r/926433 (https://phabricator.wikimedia.org/T337758) (owner: 10Arturo Borrero Gonzalez) [14:22:40] (03CR) 10Elukey: [C: 03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:24:33] (03CR) 10Elukey: [C: 03+1] "Added Kamila since Hugh is out of the office :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925852 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:25:38] (03CR) 10Elukey: [V: 03+1 C: 03+2] Update kubernetes nodes with GPU settings [puppet] - 10https://gerrit.wikimedia.org/r/927197 (https://phabricator.wikimedia.org/T335031) (owner: 10Elukey) [14:27:03] (03PS3) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 [14:27:30] (03PS2) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 [14:27:53] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: persist static routes [puppet] - 10https://gerrit.wikimedia.org/r/927210 (https://phabricator.wikimedia.org/T337758) [14:27:55] (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:28:53] !log codfw low-traffic LVS: set routing-options static route 10.2.1.0/24 next-hop 10.192.49.7 [14:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:11] (03CR) 10Muehlenhoff: [C: 03+2] Cloud: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/925721 (owner: 10Muehlenhoff) [14:29:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P48733 and previous config saved to /var/cache/conftool/dbconfig/20230605-142911-ladsgroup.json [14:30:15] (03PS1) 10Elukey: role::ml_k8s::worker: set nodes as k8s nodes for the gpu profile [puppet] - 10https://gerrit.wikimedia.org/r/927213 [14:30:35] (03PS3) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 [14:31:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P48734 and previous config saved to /var/cache/conftool/dbconfig/20230605-143105-ladsgroup.json [14:31:24] (03PS1) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/927214 (https://phabricator.wikimedia.org/T335777) [14:32:14] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: set nodes as k8s nodes for the gpu profile [puppet] - 10https://gerrit.wikimedia.org/r/927213 (owner: 10Elukey) [14:32:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2009.codfw.wmnet [14:33:08] (03CR) 10CI reject: [V: 04-1] interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:33:22] (03CR) 10Ssingh: [C: 03+1] Allow HTTP PATCH requests on "beta" sites [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek) [14:35:50] (03PS1) 10Muehlenhoff: Remove more cloud stretch support [puppet] - 10https://gerrit.wikimedia.org/r/927215 [14:40:32] (03PS1) 10Klausman: Add rate limiting class for high-traffic internal services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) [14:41:32] (03PS1) 10Ottomata: mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303) [14:41:55] (03PS2) 10Ottomata: mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303) [14:42:38] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:43:38] (03CR) 10Arturo Borrero Gonzalez: "We need to double check that this existing client modules/profile/manifests/cloudceph/osd.pp doesn't rely on this default semantic of not " [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [14:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T336886)', diff saved to https://phabricator.wikimedia.org/P48735 and previous config saved to /var/cache/conftool/dbconfig/20230605-144417-ladsgroup.json [14:44:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:44:21] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:44:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:44:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48736 and previous config saved to /var/cache/conftool/dbconfig/20230605-144438-ladsgroup.json [14:44:58] (03PS1) 10Ottomata: Remove dse mediawiki-page-content-change-enrichment and stream-enrichment-poc ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/927224 (https://phabricator.wikimedia.org/T325303) [14:45:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:03] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P48737 and previous config saved to /var/cache/conftool/dbconfig/20230605-144611-ladsgroup.json [14:47:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:47:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2009.codfw.wmnet [14:47:13] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2009.codfw.wmnet` - lvs2009.codfw.wmnet (**WARN**) - Downtimed ho... [14:47:27] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/927208 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:48:09] (03CR) 10Ssingh: [C: 03+2] lvs2009: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/927206 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:48:55] !log homer "cr*-codfw*" commit "Gerrit: 927208 remove decommissioned host lvs2009": T335777 [14:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 [14:49:14] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - use kafka at least once delivery guarantee [deployment-charts] - 10https://gerrit.wikimedia.org/r/927219 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [14:49:45] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48738 and previous config saved to /var/cache/conftool/dbconfig/20230605-145003-ladsgroup.json [14:50:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:50:17] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:50:30] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:51:04] (03PS1) 10Muehlenhoff: Remove wmflib hack for logoutd scripts on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/927227 [14:52:05] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [14:52:18] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:54:21] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:05] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [14:55:12] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:55:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:57:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [14:58:25] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:58:43] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:59:05] (03CR) 10Volans: [C: 03+1] "LGTM, I don't think we need to cleanup those files manually either." [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff) [14:59:58] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:33] (03CR) 10Muehlenhoff: Remove wmflib hack for logoutd scripts on Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff) [15:01:05] (03CR) 10Muehlenhoff: [C: 03+2] Remove wmflib hack for logoutd scripts on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/927227 (owner: 10Muehlenhoff) [15:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T336886)', diff saved to https://phabricator.wikimedia.org/P48739 and previous config saved to /var/cache/conftool/dbconfig/20230605-150117-ladsgroup.json [15:01:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:01:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:01:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:01:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48740 and previous config saved to /var/cache/conftool/dbconfig/20230605-150138-ladsgroup.json [15:02:36] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48741 and previous config saved to /var/cache/conftool/dbconfig/20230605-150347-ladsgroup.json [15:04:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48742 and previous config saved to /var/cache/conftool/dbconfig/20230605-150509-ladsgroup.json [15:05:16] !log installing avahi security updates [15:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:51] (03CR) 10Filippo Giunchedi: pyrra: initial packaging for v0.6.2 (031 comment) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:06:01] (03PS2) 10BBlack: wikidata maxlag maint script: use new pybal VIPs [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) [15:06:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Setup DNS for lvs2013 - pt1979@cumin2002" [15:07:12] (03PS1) 10Ayounsi: Netbox/Netbox-next: disable public /metrics [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) [15:07:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Setup DNS for lvs2013 - pt1979@cumin2002" [15:07:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:42] (03CR) 10Ayounsi: "Manually tested on netbox-next." [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi) [15:07:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2013.mgmt.codfw.wmnet with reboot policy FORCED [15:12:11] (03CR) 10BBlack: "PCC looks right (although it's a little confusing right now - normally there's 2x --lb here for 1019 + 2009, but 2009 is currently being d" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:12:24] (03CR) 10BBlack: "https://puppet-compiler.wmflabs.org/output/927200/41548/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:12:33] (03CR) 10BCornwall: "Looking at some older, similar commits shows that text_envoy.yaml and text_haproxy.yaml was also updated for these things. Indeed, it look" [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:14:12] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:44] (03CR) 10BBlack: [C: 03+2] "Pushing this now, because currently with the ongoing LVS replacement in T326767 , the wikidata maxlag calculation doesn't work at all beca" [puppet] - 10https://gerrit.wikimedia.org/r/927200 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:16:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [15:18:49] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T326767 (duration: 102m 46s) [15:18:49] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@674ec0a]: (no justification provided) [15:18:52] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [15:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P48744 and previous config saved to /var/cache/conftool/dbconfig/20230605-151853-ladsgroup.json [15:19:01] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@674ec0a]: (no justification provided) (duration: 00m 17s) [15:19:26] !log installing debian-archive-keyring updates on bullseye hosts [15:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P48745 and previous config saved to /var/cache/conftool/dbconfig/20230605-152015-ladsgroup.json [15:24:23] (03CR) 10MVernon: [C: 03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:24:35] (03CR) 10MVernon: [C: 03+1] "Also LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:24:50] (03CR) 10MVernon: [C: 03+1] "...likewise :-)" [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:26:24] (03PS3) 10JHathaway: java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) [15:27:01] (03CR) 10JHathaway: java: ensure wmf-certificates is installed, when required (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:27:12] !log on s3 master: update `text` set old_text = 'O:18:"historyblobcurstub":1:{s:6:"mCurId";i:5532;}', old_flags = 'object' where old_id= 14484; (T337700) [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:16] T337700: Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T337700 [15:29:28] (03CR) 10JHathaway: [C: 03+2] puppetserver: hiera type defs [puppet] - 10https://gerrit.wikimedia.org/r/925893 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:29:50] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1530). [15:30:24] (03CR) 10SBassett: [C: 03+1] "Looks correct in relation to what's on the bug." [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [15:30:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2013.mgmt.codfw.wmnet with reboot policy FORCED [15:31:49] (03PS1) 10Elukey: role::ml_k8s::worker: add more gpu settings [puppet] - 10https://gerrit.wikimedia.org/r/927232 [15:32:53] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: add more gpu settings [puppet] - 10https://gerrit.wikimedia.org/r/927232 (owner: 10Elukey) [15:33:17] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance [15:33:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve1001.eqiad.wmnet with reason: Host under maintenance [15:33:24] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2ef51d27-4384-414f-9fdf-8fe7b4c93b00) set by elukey@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: Host under maintenanc... [15:33:50] (03PS4) 10Jbond: interface::route: Make interface mandatory [puppet] - 10https://gerrit.wikimedia.org/r/927204 [15:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P48746 and previous config saved to /var/cache/conftool/dbconfig/20230605-153359-ladsgroup.json [15:34:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:35:09] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10Gehel) [15:35:14] (03CR) 10JHathaway: [C: 03+2] java: ensure wmf-certificates is installed, when required [puppet] - 10https://gerrit.wikimedia.org/r/925873 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T336886)', diff saved to https://phabricator.wikimedia.org/P48747 and previous config saved to /var/cache/conftool/dbconfig/20230605-153521-ladsgroup.json [15:35:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:35:25] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:35:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:35:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48748 and previous config saved to /var/cache/conftool/dbconfig/20230605-153542-ladsgroup.json [15:35:50] (03CR) 10Jbond: [C: 04-1] interface::route: Make interface mandatory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927204 (owner: 10Jbond) [15:36:09] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2013'] [15:36:43] (03CR) 10Reedy: [C: 04-1] "This should be good to go when 1.41.0-wmf.11 is out and stable..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [15:37:00] (03PS1) 10Eigyan: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) [15:37:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2013'] [15:37:08] (03CR) 10CI reject: [V: 04-1] Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [15:37:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2013'] [15:40:35] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:41:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48749 and previous config saved to /var/cache/conftool/dbconfig/20230605-154110-ladsgroup.json [15:41:14] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:41:18] (03PS2) 10Krinkle: Fix oversample naming to match schema. [extensions/NavigationTiming] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917736 [15:41:36] (03Abandoned) 10Krinkle: Fix oversample naming to match schema. [extensions/NavigationTiming] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/917736 (owner: 10Krinkle) [15:44:54] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) 05Open→03Resolved I can confirm that the GPUs are working on ml-serve1001, thanks! [15:44:58] (03PS4) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) [15:46:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi) [15:46:42] (03PS2) 10Jkieserman: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [15:47:24] (03PS5) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) [15:49:04] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T336886)', diff saved to https://phabricator.wikimedia.org/P48750 and previous config saved to /var/cache/conftool/dbconfig/20230605-154905-ladsgroup.json [15:49:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:49:09] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:49:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48751 and previous config saved to /var/cache/conftool/dbconfig/20230605-154926-ladsgroup.json [15:50:04] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10Gehel) [15:51:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2013'] [15:51:30] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41549/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:51:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48752 and previous config saved to /var/cache/conftool/dbconfig/20230605-155134-ladsgroup.json [15:52:48] (03CR) 10Clément Goubert: [C: 03+1] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:10] (03CR) 10BBlack: [C: 03+2] safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:55:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2013.codfw.wmnet with OS bullseye [15:55:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye [15:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48753 and previous config saved to /var/cache/conftool/dbconfig/20230605-155617-ladsgroup.json [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:36] (03PS1) 10Elukey: admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583) [15:59:22] !log mw1419: manually executing a php restart to test new safe-service-restart [15:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:29] (03PS4) 10JHathaway: add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) [15:59:56] (03CR) 10Klausman: [C: 03+1] admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:00:15] (03CR) 10JHathaway: [C: 03+2] add container facts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:01:04] (03CR) 10JHathaway: [V: 03+2 C: 03+2] add container facts [puppet] - 10https://gerrit.wikimedia.org/r/925935 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:01:14] 10SRE, 10Traffic: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651 (10Krinkle) >>! In T86651#973435, @mark wrote: > FWIW: An alternative sh implementation that I've written for an old kernel and fixes some of these issues (a looong time ago), lives [[ http://svn.wikimedia.org/viewvc/mediawi... [16:02:10] (03CR) 10JHathaway: [C: 03+2] don't export resources when wmflib::have_puppetdb() is false [puppet] - 10https://gerrit.wikimedia.org/r/925968 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:02:26] (03CR) 10Elukey: [C: 03+2] admin_ng: add the ml-serve experimental namespace to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927235 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:03:04] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [16:03:38] (03PS1) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927236 (https://phabricator.wikimedia.org/T329366) [16:04:37] hello sukhe - mind if we tried merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/923427/ ? [16:05:05] (03CR) 10Ayounsi: [C: 03+2] Netbox/Netbox-next: disable public /metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927229 (https://phabricator.wikimedia.org/T309703) (owner: 10Ayounsi) [16:05:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:05:33] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:05:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:06:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:06:16] leszek_wmde: hello [16:06:18] yes, let's do it [16:06:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:06:33] (03CR) 10Btullis: [C: 03+2] Update the abuse filter wikireplica view rules [puppet] - 10https://gerrit.wikimedia.org/r/927120 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [16:06:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:06:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P48754 and previous config saved to /var/cache/conftool/dbconfig/20230605-160640-ladsgroup.json [16:06:42] sukhe: great! [16:07:30] (03CR) 10Ssingh: [C: 03+2] Allow HTTP PATCH requests on "beta" sites [puppet] - 10https://gerrit.wikimedia.org/r/923427 (https://phabricator.wikimedia.org/T336659) (owner: 10WMDE-leszek) [16:08:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:11:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P48755 and previous config saved to /var/cache/conftool/dbconfig/20230605-161123-ladsgroup.json [16:11:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:12:35] (03PS2) 10Klausman: Add rate limiting class for WME using LiftWing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927218 (https://phabricator.wikimedia.org/T338121) [16:14:10] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41551/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [16:16:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:16:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [16:17:12] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10WMF-General-or-Unknown, and 4 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10sbassett) [16:18:28] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) Current Logstash SLO appears to be measuring the //number// of events encountered in lagged state. This SLO affords us... [16:18:38] (03PS1) 10Elukey: admin_ng: bump limits for ml-serve's experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/927237 [16:19:28] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [16:20:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2013.codfw.wmnet with reason: host reimage [16:21:04] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [16:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P48756 and previous config saved to /var/cache/conftool/dbconfig/20230605-162147-ladsgroup.json [16:21:56] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal [16:22:50] (03CR) 10Elukey: [C: 03+2] admin_ng: bump limits for ml-serve's experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/927237 (owner: 10Elukey) [16:23:46] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal [16:24:05] ^ replicas [16:26:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T336886)', diff saved to https://phabricator.wikimedia.org/P48757 and previous config saved to /var/cache/conftool/dbconfig/20230605-162629-ladsgroup.json [16:26:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:26:33] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:26:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [16:26:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:27:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:27:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48758 and previous config saved to /var/cache/conftool/dbconfig/20230605-162707-ladsgroup.json [16:27:36] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [16:33:38] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:24] (03PS1) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 [16:34:28] (03PS2) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 [16:35:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:35:32] (03PS1) 10CDanis: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) [16:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48759 and previous config saved to /var/cache/conftool/dbconfig/20230605-163545-ladsgroup.json [16:35:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:36:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T336886)', diff saved to https://phabricator.wikimedia.org/P48760 and previous config saved to /var/cache/conftool/dbconfig/20230605-163653-ladsgroup.json [16:36:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:37:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:37:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48761 and previous config saved to /var/cache/conftool/dbconfig/20230605-163714-ladsgroup.json [16:37:19] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:37:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:37:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2013.codfw.wmnet with OS bullseye [16:37:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2013.codfw.wmnet with OS bullseye completed... [16:44:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48762 and previous config saved to /var/cache/conftool/dbconfig/20230605-164423-ladsgroup.json [16:44:27] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:46:45] (03CR) 10Ottomata: [C: 03+1] "One nit, feel free to ignore. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis) [16:49:34] (03PS1) 10Herron: add 0.6.2 ui/package.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240 [16:49:58] (03PS7) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) [16:50:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48763 and previous config saved to /var/cache/conftool/dbconfig/20230605-165051-ladsgroup.json [16:51:19] (03CR) 10CDanis: Enable user network probe events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis) [16:56:13] (03CR) 10BCornwall: [V: 03+1] "Note that lvs2010's PCC diff shows that one instance of mh will be reverted to sh. The temporary hack to enable mh had switched sh→mh on l" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [16:57:28] (03CR) 10EllenR: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [16:58:48] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:59:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P48764 and previous config saved to /var/cache/conftool/dbconfig/20230605-165929-ladsgroup.json [16:59:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700) [17:00:04] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700). nyaa~ [17:00:26] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 (owner: 10Ottomata) [17:00:34] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10herron) a:05herron→03None [17:01:21] (03Merged) 10jenkins-bot: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926863 (owner: 10Ottomata) [17:02:16] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [17:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P48765 and previous config saved to /var/cache/conftool/dbconfig/20230605-170557-ladsgroup.json [17:06:35] jouncebot: nowandnext [17:06:35] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700) [17:06:35] For the next 0 hour(s) and 23 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700) [17:06:35] In 2 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000) [17:09:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cdanis@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis) [17:09:35] (03PS2) 10CDanis: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) [17:09:50] (03CR) 10TrainBranchBot: "Approved by cdanis@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis) [17:10:44] (03Merged) 10jenkins-bot: Enable user network probe events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927238 (https://phabricator.wikimedia.org/T332024) (owner: 10CDanis) [17:11:21] ottomata: were you just about to deploy your change to mediawiki-config .... ? [17:11:33] (03CR) 10Dzahn: releases: clone repos/releng/release from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [17:11:36] yes...trying to verify in beta first but it is taking longer than I expected [17:11:46] ack [17:11:49] it should be a no-op, but last week it wasn't (i believe because train hadn' been fully deployed) [17:12:00] i can try on a mwdebug host.. [17:12:30] !log cdanis@deploy1002 Backport cancelled. [17:13:07] oh, am I in the way of a deploy? [17:13:44] cdanis: let me revert again, didn't realize, i didn't see any changes listed, but I see now that for this window they don't need to be? [17:14:08] ottomata: no I was sneaking in my patch during a quiet window :) [17:14:10] np [17:14:12] not exactly, I was sneaking my patch in [17:14:14] we just had the same idea at the same time [17:14:26] ha ok let me try real quick on mwdebug...if it doesn't work there i'll revert [17:14:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P48766 and previous config saved to /var/cache/conftool/dbconfig/20230605-171436-ladsgroup.json [17:16:32] looking fine, i'm proceeding with my deployemnt [17:16:36] ty! [17:17:12] (03PS2) 10Dzahn: trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) [17:17:16] (03PS3) 10Jkieserman: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [17:18:30] (03CR) 10Jkieserman: [C: 03+1] "Doesn't look like I have merge rights on this repo..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [17:19:17] (03PS3) 10Dzahn: trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) [17:19:59] (03CR) 10Ssingh: [V: 03+1 C: 03+1] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [17:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T336886)', diff saved to https://phabricator.wikimedia.org/P48767 and previous config saved to /var/cache/conftool/dbconfig/20230605-172103-ladsgroup.json [17:21:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:21:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:21:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [17:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48768 and previous config saved to /var/cache/conftool/dbconfig/20230605-172124-ladsgroup.json [17:23:26] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:24:41] btw ottomata next time you should try the new `scap backport` command :) https://phabricator.wikimedia.org/phame/post/view/297/scap_backport_makes_deployments_easy/ [17:24:54] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:26:33] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: no-op: Remove undeeded wgEventBusStreamNamesMap override setting (take 2) - T336817 (duration: 09m 25s) [17:26:36] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [17:26:59] !log cdanis@deploy1002 Backport cancelled. [17:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48769 and previous config saved to /var/cache/conftool/dbconfig/20230605-172700-ladsgroup.json [17:27:04] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:27:34] jouncebot: nowandnext [17:27:35] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700) [17:27:35] For the next 0 hour(s) and 2 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T1700) [17:27:35] In 2 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000) [17:28:14] !log cdanis@deploy1002 Started scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] [17:28:17] T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 [17:28:21] (03CR) 10Ssingh: [C: 03+1] varnish: remove/adjust rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [17:29:02] are deployments done? [17:29:15] sukhe: I have one running now [17:29:21] cdanis: ok! thanks [17:29:24] it should be a no-op though [17:29:30] ok [17:29:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T336886)', diff saved to https://phabricator.wikimedia.org/P48770 and previous config saved to /var/cache/conftool/dbconfig/20230605-172942-ladsgroup.json [17:29:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:29:56] cdanis: not urgent at all on our side [17:29:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:30:01] !log cdanis@deploy1002 cdanis: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [17:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48771 and previous config saved to /var/cache/conftool/dbconfig/20230605-173002-ladsgroup.json [17:30:24] (03CR) 10Dzahn: "I will amend to it to switch it to the "insetup" role. That way gerrit role can be removed before decom cookbook destroyed server. @hashar" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [17:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48772 and previous config saved to /var/cache/conftool/dbconfig/20230605-173356-ladsgroup.json [17:34:00] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:36:32] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:38:16] !log cdanis@deploy1002 Finished scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] (duration: 10m 02s) [17:38:19] T332024: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 [17:39:28] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) a:03colewhite [17:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48773 and previous config saved to /var/cache/conftool/dbconfig/20230605-174206-ladsgroup.json [17:42:44] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:36] (03PS2) 10Herron: add 0.6.2 ui/package*.json [debs/pyrra] - 10https://gerrit.wikimedia.org/r/927240 [17:46:48] (03PS8) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) [17:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:47:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:49:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P48774 and previous config saved to /var/cache/conftool/dbconfig/20230605-174902-ladsgroup.json [17:49:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:03] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: no-op: Remove unused page_change rc streams - T336817 (duration: 20m 11s) [17:50:06] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [17:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:54:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:51] btullis: https://phabricator.wikimedia.org/T338172 seems related to the wiki replica view changes [17:56:13] (03CR) 10Dzahn: "oh right, once I put it back in "setup" role it will also remove shell access except for global roots, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [17:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P48775 and previous config saved to /var/cache/conftool/dbconfig/20230605-175712-ladsgroup.json [17:57:33] (03PS1) 10Ssingh: Revert "Allow HTTP PATCH requests on "beta" sites" [puppet] - 10https://gerrit.wikimedia.org/r/926864 [17:58:26] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [17:58:43] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [17:59:31] (03PS3) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) [18:00:54] (03CR) 10Dzahn: "ok with your shell access being removed at this point? data is copied to gerrit1003 and bacula, as long as it's under /srv/gerrit or /var/" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:03:32] (03CR) 10Dzahn: [C: 04-2] "not yet, this will happen after it is disabled in trafficserver for some time" [puppet] - 10https://gerrit.wikimedia.org/r/926606 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P48776 and previous config saved to /var/cache/conftool/dbconfig/20230605-180408-ladsgroup.json [18:04:12] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) @DAlangi_WMF we talked about this the other day, can you sahre your... [18:04:27] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) a:03DAlangi_WMF [18:09:35] (03CR) 10Jforrester: [C: 03+2] Declare Metrics Platform stream for wikifunctionswiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922569 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [18:10:20] (03Merged) 10jenkins-bot: Declare Metrics Platform stream for wikifunctionswiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922569 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [18:10:27] (03PS3) 10Jforrester: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin) [18:10:31] (03CR) 10Jforrester: [C: 03+2] Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin) [18:11:25] (03Merged) 10jenkins-bot: Add a comment about the need to specify logstash=>debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900752 (owner: 10David Martin) [18:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T336886)', diff saved to https://phabricator.wikimedia.org/P48777 and previous config saved to /var/cache/conftool/dbconfig/20230605-181219-ladsgroup.json [18:12:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:12:24] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [18:12:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:14:01] (03PS1) 10Ottomata: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) [18:14:41] (03CR) 10CI reject: [V: 04-1] EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [18:15:20] (03PS2) 10Ottomata: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) [18:16:44] (03PS1) 10Dzahn: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) [18:17:19] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [18:18:04] (03Merged) 10jenkins-bot: EventStreamConfig - revert page_change changes, somehow this broke [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927245 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [18:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T336886)', diff saved to https://phabricator.wikimedia.org/P48778 and previous config saved to /var/cache/conftool/dbconfig/20230605-181915-ladsgroup.json [18:19:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [18:19:18] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [18:19:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1224.eqiad.wmnet with reason: Maintenance [18:19:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48779 and previous config saved to /var/cache/conftool/dbconfig/20230605-181935-ladsgroup.json [18:19:48] (03CR) 10Dzahn: "CC: this is kind of a hard step to disable "gerrit-old.wikimedia.org" but of course it's revertable. so just fyi" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48780 and previous config saved to /var/cache/conftool/dbconfig/20230605-182144-ladsgroup.json [18:21:59] (03PS2) 10Dzahn: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) [18:22:22] (03PS1) 10BCornwall: pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) [18:22:38] (03CR) 10Dzahn: "I see more reviewers coming from the bot. so TLDR "what this really means is "cloud can't talk to gerrit1001 anymore" but gerrit on that h" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:25:01] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41555/console" [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [18:25:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:11] (03CR) 10Ssingh: [C: 03+2] remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:25:48] (03Merged) 10jenkins-bot: remove old gerrit service IP from static definitions [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:26:06] (03CR) 10Ssingh: [C: 03+1] pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [18:26:16] jouncebot: nowandnext [18:26:16] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [18:26:16] In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000) [18:27:06] (03CR) 10BBlack: "These rewrites were always a very ugly hack, given the interaction of the sitemaps scheme with our vcl-switching code (which is why they'r" [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:28:12] !log Maglev LVS scheduler rollout in eqiad (puppet disabled) - T263797 [18:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:15] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [18:28:50] (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch eqiad LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/927247 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [18:28:57] (03Abandoned) 10Ssingh: Revert "Allow HTTP PATCH requests on "beta" sites" [puppet] - 10https://gerrit.wikimedia.org/r/926864 (owner: 10Ssingh) [18:29:46] !log homer "cr*-eqiad*" commit "Gerrit: 927246 remove old gerrit service IP" [18:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:46] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: revert - Remove unused page_change rc streams - T336817 (duration: 11m 23s) [18:30:48] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [18:31:52] (03PS2) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [18:32:22] !log bking@cumin1001 depooling wdqs2010 for fw update T331297 [18:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:26] T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts - https://phabricator.wikimedia.org/T331297 [18:32:31] (03CR) 10Ssingh: [C: 03+2] "Changes merged after homer run. Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:33:13] (03PS1) 10Ottomata: Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248 [18:33:27] (03PS3) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [18:34:13] (03CR) 10Dzahn: "@bblack ACK! thank you for the review, I amended to completely remove it. Is it right this way to remove the entire "sub_cluster" in both " [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:35:40] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2010.codfw.wmnet [18:35:42] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:36:21] (03CR) 10Eigyan: Deploy GDI safety survey to JA and RU wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [18:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P48781 and previous config saved to /var/cache/conftool/dbconfig/20230605-183650-ladsgroup.json [18:38:04] (03CR) 10Dzahn: "thanks for review and deployment, that was super quick, appreciated" [homer/public] - 10https://gerrit.wikimedia.org/r/927246 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:38:17] (03CR) 10Ottomata: [C: 03+2] Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248 (owner: 10Ottomata) [18:39:02] (03Merged) 10jenkins-bot: Revert - bring back wgEventBusStreamNamesMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927248 (owner: 10Ottomata) [18:39:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:39:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:42:30] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:43:16] (03CR) 10BBlack: varnish: remove rewrites and tests for sitemaps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:45:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:45:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2010.codfw.wmnet [18:47:54] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:48:03] !log bking@cumin1001 repooling wdqs2010 T331297 [18:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:07] T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts - https://phabricator.wikimedia.org/T331297 [18:48:30] (03PS4) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [18:48:36] !log bking@cumin1001 depooling wdqs2011for fw update T331297 [18:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:42] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:48:50] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:48:56] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet [18:49:28] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:50:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:51:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P48782 and previous config saved to /var/cache/conftool/dbconfig/20230605-185156-ladsgroup.json [18:52:00] (03CR) 10BBlack: [C: 03+1] "Perfect. It feels good to see 67 lines of VCL varnish into the ether 😊" [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [18:52:09] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: no-op: revert - remove undeeded wgEventBusStreamNamesMap override setting (take 2) - T336817 (duration: 11m 54s) [18:52:12] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [18:53:14] (03PS5) 10JHathaway: puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) [18:54:18] (03CR) 10JHathaway: puppetserver: add additional config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [18:55:00] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:56:40] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet [18:57:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [18:58:08] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs2011.codfw.wmnet [18:59:12] (03CR) 10JHathaway: [C: 03+2] puppetserver: add additional config options [puppet] - 10https://gerrit.wikimedia.org/r/925919 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:03:45] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet [19:05:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:05:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48783 and previous config saved to /var/cache/conftool/dbconfig/20230605-190528-ladsgroup.json [19:05:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:05:36] (03Abandoned) 10TChin: Fix overlapping names edge case in flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [19:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T336886)', diff saved to https://phabricator.wikimedia.org/P48784 and previous config saved to /var/cache/conftool/dbconfig/20230605-190702-ladsgroup.json [19:07:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [19:07:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [19:10:24] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:11:04] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:12:28] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet [19:12:31] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs2011.codfw.wmnet [19:13:36] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [19:16:51] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Isaac) Excellent -- sounds like this task can be resolved then. I'll allow SRE to handle that in case they have a specific process but good luck @SalimJah and don't hesitate... [19:17:05] (03PS1) 10JHathaway: puppetserver: subtract keys rather than passing to dump_params [puppet] - 10https://gerrit.wikimedia.org/r/927250 [19:19:34] (03CR) 10JHathaway: [C: 03+2] puppetserver: subtract keys rather than passing to dump_params [puppet] - 10https://gerrit.wikimedia.org/r/927250 (owner: 10JHathaway) [19:23:12] (03PS9) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) [19:24:22] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:24:42] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 34 connections established with conf1007.eqiad.wmnet:4001 (min=34) https://wikitech.wikimedia.org/wiki/PyBal [19:25:02] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:26:26] (03CR) 10JHathaway: bookworm: Change to deb822 format for sources.list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [19:28:28] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:30:07] (03PS3) 10JHathaway: bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) [19:32:29] !log Maglev LVS scheduler rollout in eqiad finished (puppet re-enabled) - T263797 [19:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:33] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [19:32:59] (03CR) 10JHathaway: [C: 03+2] bookworm: Change to deb822 format for sources.list [puppet] - 10https://gerrit.wikimedia.org/r/925878 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [19:38:55] (03PS10) 10Herron: pyrra: initial packaging for v0.6.2 [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) [19:42:05] (03CR) 10Herron: pyrra: initial packaging for v0.6.2 (031 comment) [debs/pyrra] - 10https://gerrit.wikimedia.org/r/922608 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:43:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48785 and previous config saved to /var/cache/conftool/dbconfig/20230605-194336-ladsgroup.json [19:43:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P48786 and previous config saved to /var/cache/conftool/dbconfig/20230605-195842-ladsgroup.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2000). [20:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:21] i'll deploy since i'm the only one with patches in the queue [20:00:45] cjming: would you mind pinging me once done? [20:00:53] I have some deployments to do :)) [20:01:30] urbanecm: sure thing! [20:01:45] Ty! [20:01:54] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [20:03:18] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [20:03:23] (03PS5) 10Clare Ming: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) [20:04:42] (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [20:04:52] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:37] (03PS4) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 [20:05:45] (03Merged) 10jenkins-bot: Add initial stream configs for Android article events using Metrics Platform Java client library [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926617 (https://phabricator.wikimedia.org/T330355) (owner: 10Clare Ming) [20:06:35] hmm - never encountered this before "backport failed: Command '['git', '-C', '/srv/mediawiki-staging/php-1.41.0-wmf.11', 'fetch']' returned non-zero exit status 1." [20:06:46] should i retry? [20:07:30] cjming: let me see what's happening [20:07:37] does it include any details above it? [20:07:39] here's the error I'm seeing: error: insufficient permission for adding an object to repository database /srv/mediawiki-staging/php-1.41.0-wmf.11/.git/modules/extensions/DonationInterface/objects [20:07:39] fatal: failed to write object [20:07:39] fatal: unpack-objects failed [20:07:55] ah, okay. i know how to fix that one :) [20:08:07] phew - thanks! curious what the fix is [20:08:41] that error was preceded by "Fetching submodule extensions/DonationInterface" [20:09:28] !log [urbanecm@deploy1002 ~]$ sudo /usr/local/sbin/fix-staging-perms # attempt to fix permission errors when doing a backport [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:32] cjming: can you try again? [20:09:36] yup [20:10:00] fwiw, the fix should be to run `sudo /usr/local/sbin/fix-staging-perms`, which is supposed to fix permissions on the deployment host. [20:10:12] !log cjming@deploy1002 Started scap: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]] [20:10:15] T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355 [20:10:19] seems like a better outcome this time! [20:10:21] things look more promising - thanks! [20:10:26] any time [20:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P48787 and previous config saved to /var/cache/conftool/dbconfig/20230605-201349-ladsgroup.json [20:21:19] 10SRE: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Urbanecm) [20:23:13] !log cjming@deploy1002 cjming: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:23:16] T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355 [20:24:58] (03PS1) 10Dzahn: delete gerrit-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/927267 (https://phabricator.wikimedia.org/T336427) [20:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T336886)', diff saved to https://phabricator.wikimedia.org/P48788 and previous config saved to /var/cache/conftool/dbconfig/20230605-202855-ladsgroup.json [20:28:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [20:28:59] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:29:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [20:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48789 and previous config saved to /var/cache/conftool/dbconfig/20230605-202916-ladsgroup.json [20:29:35] (03PS1) 10Urbanecm: fix-stagging-perms: Fix group owner change for /srv/patches [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180) [20:31:05] (03CR) 10BCornwall: [C: 03+1] varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [20:35:10] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:926617|Add initial stream configs for Android article events using Metrics Platform Java client library (T330355)]] (duration: 24m 57s) [20:35:13] T330355: Incorporate librarized Metrics Platform Java client into the Android app - https://phabricator.wikimedia.org/T330355 [20:35:30] (03PS7) 10BCornwall: lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) [20:35:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 (owner: 10Clare Ming) [20:36:05] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Urbanecm) Hi @SlimJah, not sure if this is helpful, but in addition to what @Isaac mentioned, there is also https://dumps.wikimedia.org/other/mediawiki_history/, which inclu... [20:36:24] (03Merged) 10jenkins-bot: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 (owner: 10Clare Ming) [20:36:45] !log cjming@deploy1002 Started scap: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]] [20:37:33] (03PS1) 10Urbanecm: NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085) [20:37:56] cjming: would it be ok if i +2 my backport while your config deployment finishes, to save a bit time on CI? :) [20:38:24] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41556/console" [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [20:38:36] urbanecm: of course! np [20:38:40] ty [20:38:47] (03CR) 10Urbanecm: [C: 03+2] NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085) (owner: 10Urbanecm) [20:38:49] !log cjming@deploy1002 cjming: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:39:07] (03Abandoned) 10BCornwall: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/924561 (https://phabricator.wikimedia.org/T263797) (owner: 10Ssingh) [20:41:57] (03CR) 10Dzahn: [C: 03+1] "confirmed, obvious typo" [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180) (owner: 10Urbanecm) [20:42:24] (03CR) 10Dzahn: [C: 03+2] fix-stagging-perms: Fix group owner change for /srv/patches [puppet] - 10https://gerrit.wikimedia.org/r/927269 (https://phabricator.wikimedia.org/T338180) (owner: 10Urbanecm) [20:42:35] urbanecm: out of curiosity, i always meant to ask someone about this, so +2ing a backport manually for long-running CI on extensions means that wherever the deployer is in the process, there will be a notice at some point that there are diffs - as long as they are expected, i'm assuming it's ok to carry on with scap'ing -- i guess my Q is if there is ever need to revert during a window, does manually +2ing cause [20:42:35] problems? [20:43:32] (03PS1) 10JHathaway: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) [20:44:34] cjming: so, scap will tell you if it ever fetches a commit that you didn't specify on the commandline. it then gives you a chance to review and decide whether to continue. personally, i manually +2 to speed things up after i start i tell scap to sync to the whole fleet, as then it's highly unlikely i'll need to revert [20:44:58] Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/owner: owner changed 'cjming' to 'helm' (corrective) [20:45:00] even if i did need to revert, so long it didn't get past mwdebug, i can deploy both the revert and the newly-merged patch together [20:45:01] Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/group: group changed 'wikidev' to 'deployment' (corrective) [20:45:04] Notice: /Stage[main]/Helm/File[/var/cache/helm/repository/mediawiki-0.4.15.tgz]/mode: mode changed '0644' to '0775' (corrective) [20:45:07] Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/File[/usr/local/sbin/fix-staging-perms]/content: [20:45:10] +find /srv/patches -not -group wikidev -print0 | xargs -0 -r chgrp wikidev [20:45:33] urbanecm: got it - thanks - that makes sense [20:45:36] Profile::Mediawiki::Deployment::Server/File[/usr/local/sbin/fix-staging-perms]/content: content changed [20:45:51] (03CR) 10CI reject: [V: 04-1] bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [20:46:23] mutante: seems like meaningful changes to me (assuming it's diff for the change i proposed few mins ago). [20:46:38] urbanecm: yea, I am sharing with you that it's done [20:46:42] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:920742|Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere""]] (duration: 09m 57s) [20:46:49] merged and deployed the fix [20:46:52] thanks mutante! [20:46:58] urbanecm: all done - all yours [20:47:01] thanks! [20:47:06] np! [20:47:10] urbanecm: yw, also re: "scap will tell you.." , can you see https://phabricator.wikimedia.org/T338168 [20:47:24] that ticket came out of incident review meeting today [20:47:33] now it (fix-staging-perms) finished without errors! [20:47:35] is it maybe already doing what is requested [20:47:42] great, ok [20:47:51] yea, typo was obvious [20:47:54] !log [urbanecm@deploy1002 ~]$ sudo /usr/local/sbin/fix-staging-perms # verify T338180 fix [20:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:56] T338180: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 [20:48:06] !log end of UTC late backport window [20:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:10] i'll check the ticket after my deployment :) [20:48:27] (03PS2) 10JHathaway: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) [20:48:29] of course, thanks [20:48:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [20:49:34] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) [20:50:00] (03CR) 10jenkins-bot: bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [20:50:11] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm) [20:50:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm) [20:50:53] (03CR) 10JHathaway: [C: 03+2] bookworm: improve deb822 puppet warning [puppet] - 10https://gerrit.wikimedia.org/r/927270 (https://phabricator.wikimedia.org/T330495) (owner: 10JHathaway) [20:51:02] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926560 (https://phabricator.wikimedia.org/T338093) (owner: 10Urbanecm) [20:51:29] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:926560|Update interwiki cache (T338093)]] [20:51:32] T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093 [20:53:01] 10SRE, 10Patch-For-Review: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Dzahn) deployed in puppet and now: ` [deploy1002:~] $ /usr/local/sbin/fix-staging-perms [deploy1002:~] $ ` [21:00:02] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2100). [21:00:48] i'm still scap'ing, it is unreasonably slow :-/ [21:01:16] (03Merged) 10jenkins-bot: NewImpact: Fix renderMode parsing for Special:Impact [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926865 (https://phabricator.wikimedia.org/T338085) (owner: 10Urbanecm) [21:03:24] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094) [21:04:42] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:03] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:926560|Update interwiki cache (T338093)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:05:06] scap is being UNREASONABLY slow :-/ [21:05:07] T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093 [21:05:29] (03PS1) 10TheDJ: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) [21:06:00] urbanecm: I ran the fix-perms script when you were already using scap. but seems unrelated [21:06:19] yeah, it was slow even with cj.ming's patch. [21:06:24] ok [21:08:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48790 and previous config saved to /var/cache/conftool/dbconfig/20230605-210827-ladsgroup.json [21:08:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:11:01] (03CR) 10BBlack: [C: 03+1] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [21:14:13] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [21:14:40] 10SRE, 10User-Urbanecm: fix-stagging-perms errors out with "find: paths must precede expression: `group'" - https://phabricator.wikimedia.org/T338180 (10Urbanecm) 05Open→03Resolved a:03Urbanecm [21:14:58] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/927277 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [21:15:18] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [21:16:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:926560|Update interwiki cache (T338093)]] (duration: 24m 34s) [21:16:06] T338093: Interwiki map update required - https://phabricator.wikimedia.org/T338093 [21:16:47] (03CR) 10Krinkle: [C: 03+1] trafficserver: remove map for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926605 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [21:16:53] (03CR) 10Krinkle: [C: 03+1] varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) (owner: 10Dzahn) [21:17:08] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]] [21:17:11] T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085 [21:18:15] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [21:19:13] jouncebot: nowandnext [21:19:13] For the next 1 hour(s) and 40 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230605T2100) [21:19:13] In 4 hour(s) and 40 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230606T0200) [21:19:34] urbanecm: please tell me once you're done! [21:19:36] will do [21:19:50] probably in ~30 minutes if scap backport doesn't speed up itself. [21:22:17] no worries [21:23:27] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1015.eqiad.wmnet [21:23:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P48791 and previous config saved to /var/cache/conftool/dbconfig/20230605-212333-ladsgroup.json [21:25:38] (03PS5) 10Dzahn: varnish: remove rewrites and tests for sitemaps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/926611 (https://phabricator.wikimedia.org/T338064) [21:25:45] (03PS1) 10Urbanecm: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094) [21:25:48] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1015.eqiad.wmnet [21:25:57] (03CR) 10Urbanecm: [C: 03+2] Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [21:27:10] (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927286 (https://phabricator.wikimedia.org/T338094) (owner: 10Urbanecm) [21:27:56] (03PS4) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [21:28:20] (03CR) 10BCornwall: [V: 03+1 C: 03+2] lvs: Switch text/upload 'sh' schedulers to 'mh' [puppet] - 10https://gerrit.wikimedia.org/r/911350 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [21:29:18] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [21:29:46] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [21:30:41] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:30:44] T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085 [21:31:09] works, proceeding [21:31:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [21:31:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [21:31:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:31:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [21:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48792 and previous config saved to /var/cache/conftool/dbconfig/20230605-213202-ladsgroup.json [21:35:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1015.eqiad.wmnet [21:35:37] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1015.eqiad.wmnet [21:36:52] (03PS4) 10Dzahn: site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) [21:37:58] (03CR) 10Hashar: [C: 04-1] "This will work solely cause the target repositories have been fixed manually." [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [21:38:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48793 and previous config saved to /var/cache/conftool/dbconfig/20230605-213819-ladsgroup.json [21:38:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P48794 and previous config saved to /var/cache/conftool/dbconfig/20230605-213839-ladsgroup.json [21:38:41] (03PS6) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [21:42:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:926865|NewImpact: Fix renderMode parsing for Special:Impact (T338085)]] (duration: 25m 38s) [21:42:50] T338085: Special:Impact fails to load - https://phabricator.wikimedia.org/T338085 [21:42:51] finally [21:42:54] Amir1: stage's yours :) [21:43:12] awesome [21:43:28] (03CR) 10Dzahn: "would it be helpful if we made this an "if bullseye" thing for the migration period? so basically use gitlab on new hosts, don't touch old" [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [21:44:34] (03CR) 10Ladsgroup: [C: 03+2] Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [21:46:04] (03CR) 10Dzahn: [V: 03+1] "fresh compiled. all looks good to me, including no change on doc hosts: https://puppet-compiler.wmflabs.org/output/914731/41557/" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [21:47:49] (03CR) 10BCornwall: "(rebased off of master so my broken ATS cookbook isn't included)." [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [21:50:19] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1016.eqiad.wmnet [21:51:01] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1016.eqiad.wmnet [21:52:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [21:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48795 and previous config saved to /var/cache/conftool/dbconfig/20230605-215326-ladsgroup.json [21:53:30] (Device rebooted) firing: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:53:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48796 and previous config saved to /var/cache/conftool/dbconfig/20230605-215345-ladsgroup.json [21:53:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:53:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:54:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [21:58:30] (Device rebooted) resolved: Device ps1-e3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:00:50] (03CR) 10JHathaway: "lgtm, ensureable not withstanding ;)" [puppet] - 10https://gerrit.wikimedia.org/r/926464 (owner: 10Jbond) [22:00:57] (03CR) 10JHathaway: [C: 03+1] puppetserver::git: make ensureable [puppet] - 10https://gerrit.wikimedia.org/r/926464 (owner: 10Jbond) [22:01:24] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1016.eqiad.wmnet [22:01:26] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1016.eqiad.wmnet [22:03:29] (03Merged) 10jenkins-bot: Help measure the impact of saneitizer jobs [extensions/CirrusSearch] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/926860 (https://phabricator.wikimedia.org/T336698) (owner: 10Ladsgroup) [22:03:50] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] [22:03:53] T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698 [22:05:30] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [22:06:55] (03PS1) 10Ladsgroup: moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700) [22:07:15] (03CR) 10Ladsgroup: [C: 03+2] moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700) (owner: 10Ladsgroup) [22:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P48797 and previous config saved to /var/cache/conftool/dbconfig/20230605-220833-ladsgroup.json [22:08:49] (03CR) 10BCornwall: [C: 03+1] trafficserver::backend: Add a cache config for puppetboard-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927172 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [22:13:39] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:926860|Help measure the impact of saneitizer jobs (T336698)]] (duration: 09m 48s) [22:13:42] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:42] T336698: Reduce the load of CirrusSearch update jobs on MW jobrunners - https://phabricator.wikimedia.org/T336698 [22:15:52] (03PS1) 10Ladsgroup: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 [22:15:58] (03PS2) 10Ladsgroup: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 [22:16:04] (03CR) 10Ladsgroup: [C: 03+2] Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup) [22:16:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup) [22:16:58] (03Merged) 10jenkins-bot: Revert "Remove legacy encoding option from dawiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927288 (owner: 10Ladsgroup) [22:17:13] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]] [22:18:41] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [22:20:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:23:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T335845)', diff saved to https://phabricator.wikimedia.org/P48798 and previous config saved to /var/cache/conftool/dbconfig/20230605-222339-ladsgroup.json [22:24:43] (03Merged) 10jenkins-bot: moveToExternal: Actually convert encoding of cur_text [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/927287 (https://phabricator.wikimedia.org/T337700) (owner: 10Ladsgroup) [22:24:54] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:927288|Revert "Remove legacy encoding option from dawiktionary"]] (duration: 07m 40s) [22:27:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [22:27:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [22:27:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48799 and previous config saved to /var/cache/conftool/dbconfig/20230605-222745-ladsgroup.json [22:27:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:28:36] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]] [22:28:38] T337700: Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via LqtVIew) - https://phabricator.wikimedia.org/T337700 [22:29:55] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:30:06] (03PS5) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:30:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:30:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:30:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48800 and previous config saved to /var/cache/conftool/dbconfig/20230605-223035-ladsgroup.json [22:32:29] (03CR) 10CI reject: [V: 04-1] Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:33:45] (03PS6) 10Dzahn: Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:34:03] Amir1: could you ping me when you are done? [22:34:16] sure, almost done [22:36:52] (03CR) 10Dzahn: [C: 03+2] Use same php version for doc and integration websites [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:37:40] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:927287|moveToExternal: Actually convert encoding of cur_text (T337700)]] (duration: 09m 04s) [22:37:43] T337700: Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via LqtVIew) - https://phabricator.wikimedia.org/T337700 [22:37:48] zabe: done ^ [22:38:33] (03PS2) 10Zabe: Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) [22:39:12] (03CR) 10Dzahn: [C: 03+2] "step 1: deployed on doc2002, then doc1003. noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:39:22] (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:40:11] (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:40:28] !log zabe@deploy1002 Started scap: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]] [22:40:30] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:41:48] !log zabe@deploy1002 zabe: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [22:42:06] * Amir1 grabs popcorn [22:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48801 and previous config saved to /var/cache/conftool/dbconfig/20230605-224528-ladsgroup.json [22:45:30] (03CR) 10Dzahn: [C: 03+2] "step 2: deployed on contint2002 (bullseye, not prod) - first puppet run errors, after second puppet run ok (dependencies). Otherwise looki" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:45:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:46:32] !log contint2002, contint1002 - upgrading PHP from 7.3 to 7.4 [22:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:40] mutante: Oooh. [22:48:10] James_F: 2001 as the actual prod server would be last. you can still veto it there :) [22:48:16] :-D [22:48:22] No no, I'm very happy it's happening. [22:48:22] or I jfdi [22:48:25] ok [22:48:31] Go go go. [22:48:41] Then I can land https://gerrit.wikimedia.org/r/c/integration/config/+/909388/ [22:48:43] :) thanks, and also for the gerrit IP thing [22:48:52] nice [22:49:05] I want us to switch 2001 to 2002 [22:49:11] Ack. [22:49:20] but if the PHP upgrade makes us feel better about it.. sure. we do that now [22:49:27] :-D [22:49:41] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:925047|Stop writing to revision_comment_temp in testwiki (T299954)]] (duration: 09m 13s) [22:49:44] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:51:05] (03CR) 10Dzahn: [C: 03+2] "step 4: deployed to contint1002, buster. 2 puppet runs needed, then manually removing 7.3 packages as above and restarting apache2" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [22:52:06] James_F: do you know if contint1002 is used by anything? buster/prod but not the "main CI server" [22:52:16] I am not sure right now [22:52:22] regardless it is done there [22:52:28] now doing the main one [22:52:34] mutante: I *think* it's not currently used, but maybe it's used by releases-jenkins? [22:52:56] I had similar thoughts there. and a bit like gerrit-replica [22:53:06] Yeah. [22:53:08] going ahead [22:53:39] (03CR) 10Cwhite: [C: 03+2] prometheus: add external swagger checks to all sites [puppet] - 10https://gerrit.wikimedia.org/r/925119 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [22:53:43] !log contint2001 (prod main CI server) - upgrading PHP 7.3 to 7.4 [22:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:19] !log jforrester@deploy1002 Started deploy [integration/docroot@8255d99]: I6c757561deb14e84a95ef9fc68053b3e48ff941c for T337425 [22:55:22] T337425: Re-implement post-merge publication of code coverage for Wikifunctions's repos on GitLab - https://phabricator.wikimedia.org/T337425 [22:55:30] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: remove alerting and observability plugins [puppet] - 10https://gerrit.wikimedia.org/r/925114 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [22:55:33] !log jforrester@deploy1002 Finished deploy [integration/docroot@8255d99]: I6c757561deb14e84a95ef9fc68053b3e48ff941c for T337425 (duration: 00m 13s) [22:56:02] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927308 [22:56:04] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927309 [22:56:23] we shouldn't be using mod_php there [22:56:35] php-fpm... but also .. not worth it, afaict [22:56:49] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927308 (owner: 10Andrew Bogott) [22:57:00] !log contint2001 - sudo apt-get remove --purge libapache2-mod-php7.3 php7.3-cli php7.3-common php7.3-json php7.3-opcache php7.3-readline [22:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: log to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/927309 (owner: 10Andrew Bogott) [22:57:35] !log contint2001 - sudo systemctl restart apache2 [22:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:52] mutante: Old stuff that got copy-pasted into the present, or actually used? [22:58:36] James_F: which part do you mean? that we use mod_php ? [22:58:41] James_F: it is done [22:58:42] mutante: Yeah. [22:58:47] \o/ [22:58:52] https://integration.wikimedia.org/ is up [22:58:55] but is that the right test [22:59:24] wanna recheck your related change or something? [22:59:24] I've got a patch that'll stress-test it. ;-) [22:59:30] perfect [22:59:51] James_F: I think old stuff, just never migrated [23:00:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P48802 and previous config saved to /var/cache/conftool/dbconfig/20230605-230034-ladsgroup.json [23:00:44] unsure how much "but soon all different anyways" applies for it :) [23:01:14] focuses on shutting down buster things [23:02:01] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338152 (10phaultfinder) [23:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48803 and previous config saved to /var/cache/conftool/dbconfig/20230605-230752-ladsgroup.json [23:07:56] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:09:19] (03CR) 10Dzahn: [C: 03+2] "step 5: deployed to contint2001, main prod host, same. 2 puppet runs, remove old packages, restart apache2.. integration.wikimedia.org loo" [puppet] - 10https://gerrit.wikimedia.org/r/914731 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [23:09:34] !log jforrester@deploy1002 Started deploy [integration/docroot@ab77611]: Idf6c7ad01ed18785b850967252c6867d7871e902 [23:09:40] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:42] !log jforrester@deploy1002 Finished deploy [integration/docroot@ab77611]: Idf6c7ad01ed18785b850967252c6867d7871e902 (duration: 00m 08s) [23:10:56] !log jforrester@deploy1002 Started deploy [integration/docroot@6eefe56]: I5c1b92322ae59bfe8a9233ad23c3c89b844f5fb7 for T334492 [23:10:59] T334492: Create a new phan config file to make usage for libraries easier - https://phabricator.wikimedia.org/T334492 [23:11:02] !log jforrester@deploy1002 Finished deploy [integration/docroot@6eefe56]: I5c1b92322ae59bfe8a9233ad23c3c89b844f5fb7 for T334492 (duration: 00m 05s) [23:12:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:13:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:14:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove mgmt DNS for ssw1-a1 for testing - pt1979@cumin2002" [23:15:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove mgmt DNS for ssw1-a1 for testing - pt1979@cumin2002" [23:15:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:15:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P48804 and previous config saved to /var/cache/conftool/dbconfig/20230605-231540-ladsgroup.json [23:15:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device ssw1-a1-codfw.mgmt.codfw.wmnet [23:22:08] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [23:22:10] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P48805 and previous config saved to /var/cache/conftool/dbconfig/20230605-232258-ladsgroup.json [23:24:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [23:25:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [23:25:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:30:11] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) Please also see T334517#8904608 for a plan on how to proceed with contint* upgrades. Also today we upgraded PHP from 7.... [23:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T336886)', diff saved to https://phabricator.wikimedia.org/P48806 and previous config saved to /var/cache/conftool/dbconfig/20230605-233046-ladsgroup.json [23:30:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:30:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:31:01] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) @hashar This new machine is on buster. Somehow I thought we did bullseye from the start. I suggest we reimage it. See li... [23:31:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T336886)', diff saved to https://phabricator.wikimedia.org/P48807 and previous config saved to /var/cache/conftool/dbconfig/20230605-233107-ladsgroup.json [23:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T336886)', diff saved to https://phabricator.wikimedia.org/P48808 and previous config saved to /var/cache/conftool/dbconfig/20230605-233318-ladsgroup.json [23:38:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P48809 and previous config saved to /var/cache/conftool/dbconfig/20230605-233804-ladsgroup.json [23:39:12] (03PS1) 10Zabe: Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954) [23:41:25] (03CR) 10Zabe: [C: 03+2] Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:42:11] (03Merged) 10jenkins-bot: Stop writing to revision_comment_temp in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927312 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:42:27] !log zabe@deploy1002 Started scap: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]] [23:42:30] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:43:47] !log zabe@deploy1002 zabe: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [23:48:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P48810 and previous config saved to /var/cache/conftool/dbconfig/20230605-234824-ladsgroup.json [23:49:29] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:927312|Stop writing to revision_comment_temp in group0 wikis (T299954)]] (duration: 07m 02s) [23:49:32] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T336886)', diff saved to https://phabricator.wikimedia.org/P48811 and previous config saved to /var/cache/conftool/dbconfig/20230605-235310-ladsgroup.json [23:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:53:14] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:53:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:53:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:53:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:53:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T336886)', diff saved to https://phabricator.wikimedia.org/P48812 and previous config saved to /var/cache/conftool/dbconfig/20230605-235346-ladsgroup.json