[00:01:16] (03CR) 10Eccenux: "^" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo) [00:01:28] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486 (owner: 10TrainBranchBot) [00:02:07] (03PS1) 10Arlolra: Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 [00:05:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T367856)', diff saved to https://phabricator.wikimedia.org/P65683 and previous config saved to /var/cache/conftool/dbconfig/20240703-000506-marostegui.json [00:05:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [00:05:13] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:05:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [00:06:45] (03CR) 10Arlolra: Change Linter log level to info (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [00:09:38] (03CR) 10Arlolra: [C:04-1] Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [00:15:13] (03PS2) 10Arlolra: Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 [00:15:15] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9947564 (10wiki_willy) Hi @Eevans - since we've replaced all hardware parts on this host, and the error is still showing up, it doesn't seem like it's a hardware problem. It's also really odd th... [00:16:00] (03CR) 10Arlolra: Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [00:16:16] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_drmrs [00:20:11] (03PS1) 10RLazarus: deployment_server: Add a daily systemd timer for mwscript_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) [00:27:06] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_drmrs [01:15:53] PROBLEM - Disk space on restbase2023 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 96068 MB (5% inode=99%): /srv/sdc4 68760 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops [01:16:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [01:16:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [01:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65684 and previous config saved to /var/cache/conftool/dbconfig/20240703-011701-marostegui.json [01:17:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:17:57] (03CR) 10Scott French: [C:03+1] "Had to check that "1 day" is a valid time span definition :) LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:25:41] (03CR) 10Scott French: [C:03+1] "Hmmm ... actually, I ran a PCC diff on this, and indeed it complains about the interval definition [0]." [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:54:16] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:48:11] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 9.81 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:50:33] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:16] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:47:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65685 and previous config saved to /var/cache/conftool/dbconfig/20240703-034751-marostegui.json [03:47:54] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:56:02] (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [04:00:34] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P65686 and previous config saved to /var/cache/conftool/dbconfig/20240703-040258-marostegui.json [04:03:14] (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [04:03:59] (03CR) 10Krinkle: [C:04-1] "Looks like the wgLoadScript comment still applies." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [04:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P65687 and previous config saved to /var/cache/conftool/dbconfig/20240703-041805-marostegui.json [04:19:48] (03PS1) 10Andrew Bogott: deployment-prep mcrouter: replace old memc servers with new ones [puppet] - 10https://gerrit.wikimedia.org/r/1051499 (https://phabricator.wikimedia.org/T361384) [04:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65688 and previous config saved to /var/cache/conftool/dbconfig/20240703-043312-marostegui.json [04:33:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [04:33:16] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:33:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [04:33:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T364069)', diff saved to https://phabricator.wikimedia.org/P65689 and previous config saved to /var/cache/conftool/dbconfig/20240703-043335-marostegui.json [04:46:36] (03PS1) 10Marostegui: db22[21-40]: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1051500 (https://phabricator.wikimedia.org/T368922) [04:47:26] (03CR) 10Marostegui: [C:03+2] db22[21-40]: Add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1051500 (https://phabricator.wikimedia.org/T368922) (owner: 10Marostegui) [04:50:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65690 and previous config saved to /var/cache/conftool/dbconfig/20240703-045018-root.json [04:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool with small weight T365805', diff saved to https://phabricator.wikimedia.org/P65691 and previous config saved to /var/cache/conftool/dbconfig/20240703-045109-marostegui.json [04:51:12] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [04:57:55] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9947893 (10Marostegui) >>! In T368098#9946355, @xcollazo wrote: >>>! In T36809... [05:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65692 and previous config saved to /var/cache/conftool/dbconfig/20240703-050523-root.json [05:06:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1051502 (https://phabricator.wikimedia.org/T369130) [05:06:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369130 [05:06:43] T369130: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T369130 [05:06:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T369130', diff saved to https://phabricator.wikimedia.org/P65693 and previous config saved to /var/cache/conftool/dbconfig/20240703-050647-root.json [05:07:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369130 [05:07:41] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1051502 (https://phabricator.wikimedia.org/T369130) (owner: 10Gerrit maintenance bot) [05:14:19] (03PS1) 10Marostegui: site.pp: Add db22[21-40] [puppet] - 10https://gerrit.wikimedia.org/r/1051504 (https://phabricator.wikimedia.org/T368922) [05:14:53] (03CR) 10Marostegui: [C:03+2] site.pp: Add db22[21-40] [puppet] - 10https://gerrit.wikimedia.org/r/1051504 (https://phabricator.wikimedia.org/T368922) (owner: 10Marostegui) [05:20:06] !log Starting s2 codfw failover from db2207 to db2204 - T369130 [05:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:11] T369130: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T369130 [05:20:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T369130', diff saved to https://phabricator.wikimedia.org/P65694 and previous config saved to /var/cache/conftool/dbconfig/20240703-052029-root.json [05:20:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65695 and previous config saved to /var/cache/conftool/dbconfig/20240703-052035-root.json [05:21:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 T369130', diff saved to https://phabricator.wikimedia.org/P65696 and previous config saved to /var/cache/conftool/dbconfig/20240703-052118-root.json [05:22:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Long schema change [05:23:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Long schema change [05:23:48] !log Deploy schema change on db2207 s2 codfw dbmaint T367856 [05:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:51] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:23:58] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9947951 (10SGupta-WMF) @xcollazo The column renaming is done to match api outp... [05:35:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65697 and previous config saved to /var/cache/conftool/dbconfig/20240703-053541-root.json [05:50:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65698 and previous config saved to /var/cache/conftool/dbconfig/20240703-055046-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65699 and previous config saved to /var/cache/conftool/dbconfig/20240703-060552-root.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65700 and previous config saved to /var/cache/conftool/dbconfig/20240703-062057-root.json [06:43:58] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [06:46:38] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: 208.80.152.129 - ayounsi@cumin1002" [06:47:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: 208.80.152.129 - ayounsi@cumin1002" [06:47:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:58:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364069)', diff saved to https://phabricator.wikimedia.org/P65701 and previous config saved to /var/cache/conftool/dbconfig/20240703-065759-marostegui.json [06:58:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:59:02] (03CR) 10Slyngshede: [C:03+2] LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0700). nyaa~ [07:00:05] wargo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:32] (03Merged) 10jenkins-bot: LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [07:01:54] Can I deploy MinT since there is no patches to deploy in the backport/config? [07:03:20] 1.. 2.. 3.. seems no one deploying. I'll go ahead. [07:03:53] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-07-02-060114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051290 (https://phabricator.wikimedia.org/T364525) (owner: 10KartikMistry) [07:04:45] (03Merged) 10jenkins-bot: Update MinT to 2024-07-02-060114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051290 (https://phabricator.wikimedia.org/T364525) (owner: 10KartikMistry) [07:07:29] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:12:06] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:13:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P65702 and previous config saved to /var/cache/conftool/dbconfig/20240703-071306-marostegui.json [07:14:09] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:17:43] (03CR) 10Arnaudb: [C:03+1] DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [07:21:59] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:23:40] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:24:49] (03CR) 10Superpes15: [C:04-1] "It doesn't work like this, you have to follow logos/README.md and run Tox, thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo) [07:28:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P65704 and previous config saved to /var/cache/conftool/dbconfig/20240703-072814-marostegui.json [07:31:51] (03PS1) 10JMeybohm: kubernetes: Remove etcd_urls from wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/1051678 (https://phabricator.wikimedia.org/T353464) [07:32:13] (03CR) 10CI reject: [V:04-1] kubernetes: Remove etcd_urls from wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/1051678 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [07:32:19] (03CR) 10JMeybohm: "Feel free to merge as you see fit" [puppet] - 10https://gerrit.wikimedia.org/r/1051678 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [07:32:36] (03PS2) 10JMeybohm: kubernetes: Remove etcd_urls from wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/1051678 (https://phabricator.wikimedia.org/T353464) [07:33:09] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051678 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [07:33:38] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:34:09] (03CR) 10Ayounsi: [C:03+2] DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [07:36:57] !log Updated MinT to 2024-07-02-060114-production (T364525) [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:00] T364525: Ignore extra spaces form source text in the MinT test instance - https://phabricator.wikimedia.org/T364525 [07:38:41] (03PS1) 10Brouberol: OpenJDK: build JDK/JDE 17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051679 (https://phabricator.wikimedia.org/T363461) [07:40:11] (03PS2) 10Brouberol: OpenJDK: build JDK/JRE 17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051679 (https://phabricator.wikimedia.org/T363461) [07:43:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364069)', diff saved to https://phabricator.wikimedia.org/P65705 and previous config saved to /var/cache/conftool/dbconfig/20240703-074321-marostegui.json [07:43:25] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:47:57] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: route thumbor logs in routing filter [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [07:48:24] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite) [07:50:41] (03CR) 10Volans: "Given the comment from Andrew Otto on task I think it's fine with just Research as approvers." [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [07:52:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:52:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:52:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T364069)', diff saved to https://phabricator.wikimedia.org/P65706 and previous config saved to /var/cache/conftool/dbconfig/20240703-075245-marostegui.json [07:52:48] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:57:35] (03PS2) 10Filippo Giunchedi: shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) [08:00:05] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0800) [08:00:41] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948101 (10dcaro) Doing some tests this morning with rados bench from several of the nodes. Running on 12 osd nodes... [08:00:57] jouncebot: now [08:00:57] For the next 1 hour(s) and 59 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0800) [08:00:58] hi [08:00:59] ;) [08:02:57] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051681 (https://phabricator.wikimedia.org/T366957) [08:02:59] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051681 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:03:40] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051681 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:04:56] (03CR) 10Filippo Giunchedi: "Cleaning up my queue, feel free to add me again as needed" [puppet] - 10https://gerrit.wikimedia.org/r/912872 (owner: 10Majavah) [08:05:12] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:05:37] (03CR) 10Filippo Giunchedi: "Cleaning up my queue, feel free to add me again as needed" [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [08:09:22] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [08:09:24] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host karapace1001.eqiad.wmnet [08:09:40] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [08:10:05] (03PS1) 10JMeybohm: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051685 (https://phabricator.wikimedia.org/T362978) [08:10:47] (03CR) 10JMeybohm: "Differences in manifests are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051685 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:11:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to db2136 - running 10.11 T365805', diff saved to https://phabricator.wikimedia.org/P65707 and previous config saved to /var/cache/conftool/dbconfig/20240703-081059-marostegui.json [08:11:03] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [08:11:25] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.12 refs T366957 [08:11:27] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [08:15:12] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:18:38] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host karapace1001.eqiad.wmnet [08:20:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [08:22:36] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host karapace1002.eqiad.wmnet [08:22:41] (03CR) 10Elukey: [C:03+1] "LGTM! I guess that a similar thing should be done for istio sidecars in ML-land, adding Tobias as FYI." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051685 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:23:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9948167 (10WMDECyn) hello @Dzahn , @AndyRussG WMDE email address is: andrew.green@extern.wikimedia.de in case this is still required. [08:23:43] (03CR) 10JMeybohm: [C:03+2] Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051685 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:24:18] (03Merged) 10jenkins-bot: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051685 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:31:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9948184 (10elukey) Reporting some thoughts from IRC: ` 10:48 Generic question about the future of puppet-merge, I'll write some... [08:31:46] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host karapace1002.eqiad.wmnet [08:35:44] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet [08:36:28] jouncebot: now and next [08:36:29] For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0800) [08:36:36] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: add public v4 tap_ip [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [08:36:48] I'm going ahead with a few mesh tracing patches for non-mw services btw [08:37:09] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:38:31] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [08:38:55] (03PS1) 10Jgiannelos: pcs: Connect to eventgate staging using ip for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 [08:39:01] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [08:39:06] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [08:39:37] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [08:39:59] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [08:40:11] (03PS2) 10Jgiannelos: pcs: Connect to eventgate staging using ipv4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 [08:40:13] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [08:40:13] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:40:22] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [08:40:29] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:40:35] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:40:37] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [08:40:47] (03PS1) 10JMeybohm: Add securityContext to opentelemetry pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051690 (https://phabricator.wikimedia.org/T362978) [08:40:48] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2008.wikimedia.org [08:40:49] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:40:51] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:40:57] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [08:41:20] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [08:41:23] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [08:41:27] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [08:41:28] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [08:41:54] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [08:41:58] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [08:42:13] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1001.eqiad.wmnet [08:42:38] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [08:42:39] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [08:42:54] how do I clean up after a broken mwscript-k8s command? [08:42:59] (03CR) 10Arturo Borrero Gonzalez: "consider having interface primary from facts instead of hiera. Can't think of a VM in toolforge with an interface different than interface" [puppet] - 10https://gerrit.wikimedia.org/r/1051444 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott) [08:43:04] there’s a broken pod in `kube_env mw-script eqiad` now [08:43:15] (`mwscript-cleanup --dry-run eqiad` prints no output…) [08:43:16] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [08:43:24] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [08:43:32] (03CR) 10Brouberol: [C:03+1] "Let's find out!" [puppet] - 10https://gerrit.wikimedia.org/r/1051415 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [08:44:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2008.wikimedia.org - ayounsi@cumin1002" [08:44:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:26] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2008.wikimedia.org on all recursors [08:44:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2008.wikimedia.org on all recursors [08:44:42] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1002.eqiad.wmnet [08:44:43] (03PS4) 10Filippo Giunchedi: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [08:44:43] (03PS3) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) [08:44:43] (03PS2) 10Filippo Giunchedi: mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) [08:44:58] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [08:45:49] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948214 (10dcaro) To minimize the routers load I'm going to use a spread-out set of nodes for the tests and try agai... [08:45:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2008.wikimedia.org - ayounsi@cumin1002" [08:45:59] okay, if I add --debug I can see that mwscript-cleanup is skipping release r72z2aop because the job completed recently [08:46:47] (03CR) 10Filippo Giunchedi: [C:03+2] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:46:50] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2008.wikimedia.org with OS bookworm [08:47:38] Lucas_WMDE: yeah the script skips removing deployments if they're less 5 minutes old [08:47:45] (03CR) 10Filippo Giunchedi: "LGTM! Adding Chris too as heads up" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051690 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:47:56] (I'm currently reading it) [08:48:16] alright, this worked: [08:48:16] RELEASE_NAME=r72z2aop helmfile --file /srv/deployment-charts/helmfile.d/services/mw-script/helmfile.yaml --environment eqiad --selector name=r72z2aop destroy [08:48:28] just cobbled together from what mwscript-cleanup would’ve done ^^ [08:48:36] (should I !log that?) [08:48:44] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:48:49] (03PS4) 10Filippo Giunchedi: wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) [08:49:02] Lucas_WMDE: if in doubt: !log [08:49:15] sure [08:49:26] !log RELEASE_NAME=r72z2aop helmfile --file /srv/deployment-charts/helmfile.d/services/mw-script/helmfile.yaml --environment eqiad --selector name=r72z2aop destroy # clean up broken mwscript-k8s run I did just to test something [08:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:30] (03CR) 10Filippo Giunchedi: [C:03+1] Add securityContext to opentelemetry pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051690 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:49:32] maybe that can prove to be helpful later ;) [08:49:35] Lucas_WMDE: You can yeah. Also drop a message to r.zl to ask why we're not cleaning up fail releases immediately [08:49:42] failed* [08:49:53] (03CR) 10JMeybohm: [C:03+2] Add securityContext to opentelemetry pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051690 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:51:35] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1002.eqiad.wmnet [08:51:40] claime: I’m filing a few tasks yeah [08:51:48] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet [08:53:02] !log deployed istio (adding securityContext) to wikikube clusters - T362978 [08:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:05] T362978: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978 [08:54:12] (03Merged) 10jenkins-bot: Add securityContext to opentelemetry pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051690 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:56:33] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948349 (10dcaro) using 12 spread nodes hits the discards again: {F56197512} and nothing popping up on the disks s... [08:57:11] (03PS1) 10Matthias Mullie: Handle campaigns where wikibase is not enabled [extensions/UploadWizard] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051696 (https://phabricator.wikimedia.org/T369085) [08:57:13] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:57:14] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [08:57:37] claime: created T369142 and T369143 if you’re interested [08:57:38] T369142: Show more useful information when mwscript-k8s fails to launch - https://phabricator.wikimedia.org/T369142 [08:57:38] T369143: Allow cleaning up specific mwscript-k8s runs - https://phabricator.wikimedia.org/T369143 [08:58:07] (03CR) 10Matthias Mullie: [C:03+2] Handle campaigns where wikibase is not enabled [extensions/UploadWizard] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051696 (https://phabricator.wikimedia.org/T369085) (owner: 10Matthias Mullie) [08:58:37] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet [08:58:49] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:59:25] Lucas_WMDE: Thanks for that [08:59:59] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [09:00:20] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [09:00:21] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [09:00:52] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [09:01:12] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:01:19] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:01:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/UploadWizard] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051696 (https://phabricator.wikimedia.org/T369085) (owner: 10Matthias Mullie) [09:02:04] !log brouberol@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet [09:02:41] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2008.wikimedia.org with reason: host reimage [09:04:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9948452 (10JMeybohm) I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting... [09:06:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2008.wikimedia.org with reason: host reimage [09:07:34] (03CR) 10Cathal Mooney: [C:03+2] Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:08:19] (03Merged) 10jenkins-bot: Handle campaigns where wikibase is not enabled [extensions/UploadWizard] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051696 (https://phabricator.wikimedia.org/T369085) (owner: 10Matthias Mullie) [09:09:06] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet [09:11:47] 10SRE-swift-storage, 10CX-deployments, 10LPL Essential, 10MinT: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9948480 (10Pginer-WMF) [09:13:37] (03CR) 10Elukey: [C:03+1] "I agree 100%" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [09:14:02] (03PS5) 10Filippo Giunchedi: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [09:14:03] (03PS1) 10Filippo Giunchedi: wikifeeds: lower tracing sample rate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051699 (https://phabricator.wikimedia.org/T320563) [09:15:09] gah [09:15:54] I accidentally +2'ed (now merged) a patch-to-be-backported later today [09:16:06] (03CR) 10Clément Goubert: [C:03+1] wikifeeds: lower tracing sample rate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051699 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:16:23] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wikifeeds: lower tracing sample rate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051699 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:16:26] should we leave it merged (and deploy in couple of hrs), revert, or deploy now? [09:16:28] (03PS2) 10Filippo Giunchedi: wikifeeds: lower tracing sample rate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051699 (https://phabricator.wikimedia.org/T320563) [09:17:06] matthiasmullie: I’d lean towards “deploy now” if that’s okay with hashar and jeena [09:17:15] (03PS4) 10Anzx: mswikisource: create author and translation namespaces and add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) [09:17:20] please do yes [09:17:27] will do, thanks [09:17:40] and thanks Lucas_WMDE for pointing it out; wasn't aware I had merged that :D [09:17:48] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wikifeeds: lower tracing sample rate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051699 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:17:48] np ^^ [09:17:53] hehe [09:17:54] I was just randomly looking at the deployment calendar and noticed it [09:18:09] (and apparently I happened to look at it like one or two minutes after the merge) [09:18:13] if it is already merged, I imagine it is quite quick to deploy it [09:18:15] (03CR) 10Elukey: [C:03+1] "One question though - should we merge this after removing it from the api-gateway in deployment-charts first?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [09:18:56] !log mlitn@deploy1002 Started scap sync-world: Backport for [[gerrit:1051696|Handle campaigns where wikibase is not enabled (T369085)]] [09:18:59] T369085: Cannot upload! – TypeError: Cannot read properties of undefined (reading 'dataValueType') - https://phabricator.wikimedia.org/T369085 [09:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to db2136 - running 10.11 T365805', diff saved to https://phabricator.wikimedia.org/P65709 and previous config saved to /var/cache/conftool/dbconfig/20240703-091956-marostegui.json [09:19:59] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [09:20:28] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [09:20:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2008.wikimedia.org with OS bookworm [09:20:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2008.wikimedia.org [09:20:42] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [09:20:43] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [09:20:55] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [09:21:29] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:1051696|Handle campaigns where wikibase is not enabled (T369085)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:26:13] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testvm2008 - ayounsi@cumin1002" [09:26:33] !log mlitn@deploy1002 mlitn: Continuing with sync [09:27:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testvm2008 - ayounsi@cumin1002" [09:28:27] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948601 (10dcaro) Created the data: ` dcaro@cumin1002:~$ sudo cumin -x cloudcephosd[1006,1016,1021].eqiad.wmnet,clou... [09:31:56] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:1051696|Handle campaigns where wikibase is not enabled (T369085)]] (duration: 12m 59s) [09:32:00] T369085: Cannot upload! – TypeError: Cannot read properties of undefined (reading 'dataValueType') - https://phabricator.wikimedia.org/T369085 [09:32:43] Lucas_WMDE & hashar - backport of my messed up merge is complete; thanks! [09:32:53] \o/ thanks for deploying it ^^ [09:33:32] haha; was the least I could do :D [09:37:29] 10SRE-swift-storage, 10CX-deployments, 10LPL Essential, 10MinT: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9948650 (10elukey) >>! In T335491#9925777, @santhosh wrote: > @elukey Thanks for these details. Currently in our code, models are downloaded [[... [09:41:48] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948660 (10dcaro) Compare the traffic generated when the cluster is rebalancing some data: {F56198741} :/ [09:46:38] matthiasmullie: congratulations! :) [09:48:50] 06SRE, 06Infrastructure-Foundations, 10netops: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358#9948690 (10cmooney) 05Open→03Resolved Gonna close this one as the design is finalised, see detail on wikitech here: https://wikitech.wikimedia.org/wik... [09:49:48] !log andrewtavis-wmde@deploy1002 Started deploy [airflow-dags/wmde@d773cac]: (no justification provided) [09:49:55] !log andrewtavis-wmde@deploy1002 Finished deploy [airflow-dags/wmde@d773cac]: (no justification provided) (duration: 00m 07s) [09:51:08] (03PS1) 10Gmodena: beta: eventbus: enable instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) [09:53:55] (03PS3) 10Jgiannelos: pcs: Connect to eventgate staging using cluster IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 (https://phabricator.wikimedia.org/T366819) [09:58:21] (03PS1) 10Clément Goubert: mw-on-k8s: Move php.envvars to mediawiki-common [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051711 (https://phabricator.wikimedia.org/T365265) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1000) [10:01:28] (03CR) 10Alexandros Kosiaris: [C:03+1] pcs: Connect to eventgate staging using cluster IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 (https://phabricator.wikimedia.org/T366819) (owner: 10Jgiannelos) [10:05:47] (03CR) 10Jgiannelos: [C:03+2] pcs: Connect to eventgate staging using cluster IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 (https://phabricator.wikimedia.org/T366819) (owner: 10Jgiannelos) [10:06:42] (03Merged) 10jenkins-bot: pcs: Connect to eventgate staging using cluster IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051688 (https://phabricator.wikimedia.org/T366819) (owner: 10Jgiannelos) [10:12:28] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9948751 (10dcaro) Ok, using 16 nodes, with 64 parallel operations each still does not trigger any issues on the driv... [10:16:47] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051716 [10:28:48] (03PS1) 10Jgiannelos: mobileapps: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051719 [10:28:56] (03CR) 10CI reject: [V:04-1] mobileapps: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051719 (owner: 10Jgiannelos) [10:29:01] (03PS2) 10Jgiannelos: mobileapps: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051719 [10:29:13] 06SRE, 10SRE-swift-storage, 10Thumbor, 06Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9948790 (10Midleading) Due to T266155, I have to keep refreshing the category page, about 5~10 times, until all 200 thumbnails are generated. Therefore some "c... [10:30:37] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051719 (owner: 10Jgiannelos) [10:31:25] (03Merged) 10jenkins-bot: mobileapps: Bump staging to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051719 (owner: 10Jgiannelos) [10:32:06] (03PS1) 10Btullis: Add an-conf100[4-6] role and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1051720 (https://phabricator.wikimedia.org/T364429) [10:32:35] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:32:41] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:33:07] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:36:50] (03PS2) 10Clément Goubert: mw-on-k8s: Move php.envvars to mediawiki-common [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051711 (https://phabricator.wikimedia.org/T365265) [10:37:55] (03CR) 10Jgiannelos: [C:03+1] Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [10:38:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [10:38:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [10:38:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T367856)', diff saved to https://phabricator.wikimedia.org/P65710 and previous config saved to /var/cache/conftool/dbconfig/20240703-103839-marostegui.json [10:38:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:41:53] (03CR) 10Btullis: [C:03+1] "Looks good to me." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051679 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [10:45:19] (03CR) 10Milimetric: [C:03+1] Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin) [10:52:39] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-on-k8s: Move php.envvars to mediawiki-common [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051711 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [10:54:53] 06SRE, 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454#9948905 (10Aklapper) 05Stalled→03Invalid Unfortunately closing this Phabricat... [10:55:31] (03PS3) 10Fabfur: benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [10:58:46] (03PS4) 10Fabfur: benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [10:59:37] (03CR) 10Fabfur: benthos:cache: encode problematic fields as b64url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [11:00:01] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [11:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1100). [11:02:36] I moved this window to later to day but I guess the bot put it back. Anyway, this window is free :). [11:03:50] jouncebot: now [11:03:50] For the next 0 hour(s) and 56 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1100) [11:03:53] jouncebot: refresh [11:03:53] I refreshed my knowledge about deployments. [11:03:55] jouncebot: now [11:03:55] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [11:04:04] I thought it was supposed to auto-refresh before each window, weird [11:04:22] maybe it was only done for the backport+config windows? idk [11:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T364069)', diff saved to https://phabricator.wikimedia.org/P65711 and previous config saved to /var/cache/conftool/dbconfig/20240703-110627-marostegui.json [11:06:31] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:08:47] ah, IIUC it’ll refresh the *contents* of each window just before notifying about it, but if the window was dropped in the meantime it won’t delete it [11:12:39] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Move php.envvars to mediawiki-common [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051711 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [11:14:13] (03Merged) 10jenkins-bot: mw-on-k8s: Move php.envvars to mediawiki-common [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051711 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [11:15:24] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:15:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:16:55] !log cgoubert@deploy1002 Started scap sync-world: mw-on-k8s: Move php.envvars to mediawiki-common - T365265 [11:16:58] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [11:18:32] (03PS1) 10Btullis: cephcsi: Grant elevated privileges to the driver-registrar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051732 (https://phabricator.wikimedia.org/T327259) [11:19:54] (03PS2) 10Btullis: cephcsi: Grant elevated privileges to the driver-registrar container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051732 (https://phabricator.wikimedia.org/T327259) [11:20:11] (03CR) 10Brouberol: [C:03+1] Add an-conf100[4-6] role and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1051720 (https://phabricator.wikimedia.org/T364429) (owner: 10Btullis) [11:20:22] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9948995 (10Ladsgroup) The explain: ` *************************** 1. row ******... [11:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P65712 and previous config saved to /var/cache/conftool/dbconfig/20240703-112135-marostegui.json [11:21:45] !log cgoubert@deploy1002 Finished scap: mw-on-k8s: Move php.envvars to mediawiki-common - T365265 (duration: 05m 22s) [11:24:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P65713 and previous config saved to /var/cache/conftool/dbconfig/20240703-112452-ladsgroup.json [11:27:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [11:27:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [11:27:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T352010)', diff saved to https://phabricator.wikimedia.org/P65714 and previous config saved to /var/cache/conftool/dbconfig/20240703-112728-ladsgroup.json [11:27:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:27:45] (03CR) 10Vgutierrez: benthos:cache: encode problematic fields as b64url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [11:31:56] (03CR) 10Btullis: [C:03+2] OpenJDK: build JDK/JRE 17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051679 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:31:59] (03CR) 10Btullis: [V:03+2 C:03+2] OpenJDK: build JDK/JRE 17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051679 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:32:57] jouncebot: nowandnext [11:32:57] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [11:32:57] In 1 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1300) [11:33:04] (03PS2) 10VolkerE: Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) [11:33:12] (03CR) 10Ladsgroup: [C:03+2] Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE) [11:33:56] (03Merged) 10jenkins-bot: Optimize static footer 'a Wikimedia project' icon further [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047521 (https://phabricator.wikimedia.org/T256190) (owner: 10VolkerE) [11:34:48] (03CR) 10Btullis: [C:03+2] Add an-conf100[4-6] role and partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1051720 (https://phabricator.wikimedia.org/T364429) (owner: 10Btullis) [11:35:35] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1047521|Optimize static footer 'a Wikimedia project' icon further (T256190)]] [11:35:38] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [11:36:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P65715 and previous config saved to /var/cache/conftool/dbconfig/20240703-113642-marostegui.json [11:38:15] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949035 (10BTullis) a:05BTullis→03Jclark-ctr Hi @Jclark-ctr - apologies for the delay. I've updated the required files, so please feel free to reimag... [11:38:17] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949037 (10BTullis) [11:38:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) (owner: 10Anzx) [11:39:17] !log ladsgroup@deploy1002 volker-e, ladsgroup: Backport for [[gerrit:1047521|Optimize static footer 'a Wikimedia project' icon further (T256190)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:39:33] (03PS2) 10Anzx: kawikisource: create author namespace, add namespace aliases and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) [11:39:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P65716 and previous config saved to /var/cache/conftool/dbconfig/20240703-113958-ladsgroup.json [11:40:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) (owner: 10Anzx) [11:40:02] !log ladsgroup@deploy1002 volker-e, ladsgroup: Continuing with sync [11:42:57] (03PS1) 10Jcrespo: dbbackups: Set dbprov[12]00[12] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) [11:44:41] (03CR) 10Jcrespo: "This requires a follow up patch to clean up ip grants from those servers and a careful deploy, but wanted at least to be aware of this." [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [11:45:03] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1047521|Optimize static footer 'a Wikimedia project' icon further (T256190)]] (duration: 09m 28s) [11:45:06] T256190: Update footer image links on all MediaWiki skins to be legible and accessible - https://phabricator.wikimedia.org/T256190 [11:45:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup) [11:46:32] (03Merged) 10jenkins-bot: rpc: Update function call in RunSingleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup) [11:47:02] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1038785|rpc: Update function call in RunSingleJob (T363839)]] [11:47:04] (03PS2) 10Jcrespo: dbbackups: Set dbprov[12]00[12] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) [11:47:05] T363839: Remove old/unused/internal methods in rdbms library from the public APIs - https://phabricator.wikimedia.org/T363839 [11:48:02] (03PS1) 10Effie Mouzeli: mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) [11:48:13] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9949051 (10phaultfinder) [11:48:16] (03PS1) 10Brouberol: OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) [11:48:54] (03CR) 10CI reject: [V:04-1] mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:49:56] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1038785|rpc: Update function call in RunSingleJob (T363839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:50:01] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [11:50:29] (03PS1) 10Kevin Bazira: ml-services: use MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051738 (https://phabricator.wikimedia.org/T368875) [11:50:41] (03CR) 10Btullis: OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:51:32] (03PS2) 10Brouberol: OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) [11:51:42] (03CR) 10Brouberol: OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:51:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T364069)', diff saved to https://phabricator.wikimedia.org/P65717 and previous config saved to /var/cache/conftool/dbconfig/20240703-115149-marostegui.json [11:51:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:51:53] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:52:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:52:10] (03PS2) 10Effie Mouzeli: mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) [11:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T364069)', diff saved to https://phabricator.wikimedia.org/P65718 and previous config saved to /var/cache/conftool/dbconfig/20240703-115211-marostegui.json [11:52:36] (03CR) 10Btullis: [C:03+2] OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:52:38] (03CR) 10Btullis: [V:03+2 C:03+2] OpenJRE 17: prevent the openjdk-jre-headless post-inst step from crashing [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051737 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:54:33] (03CR) 10Clément Goubert: [C:03+1] copy patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051449 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [11:54:48] (03CR) 10Clément Goubert: [C:03+1] mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [11:55:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P65719 and previous config saved to /var/cache/conftool/dbconfig/20240703-115504-ladsgroup.json [11:55:11] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038785|rpc: Update function call in RunSingleJob (T363839)]] (duration: 08m 08s) [11:55:13] T363839: Remove old/unused/internal methods in rdbms library from the public APIs - https://phabricator.wikimedia.org/T363839 [11:55:35] (03PS1) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) [11:56:04] (03CR) 10Clément Goubert: [C:03+1] Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [11:56:32] (03CR) 10CI reject: [V:04-1] mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [11:59:59] (03PS5) 10Fabfur: benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [12:01:12] (03CR) 10Fabfur: benthos:cache: encode problematic fields as b64url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [12:06:03] (03PS2) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) [12:06:51] (03PS1) 10Jcrespo: dbbackups: Disable es read-only backups and reenable rw ones [puppet] - 10https://gerrit.wikimedia.org/r/1051744 (https://phabricator.wikimedia.org/T363812) [12:07:22] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1051744 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:07:55] (03PS3) 10Effie Mouzeli: mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) [12:08:21] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [12:08:40] (03CR) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [12:08:48] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:09:17] (03CR) 10Volans: "Will this leave processes on the hosts running that might fail and are not managed anymore by Puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [12:09:41] (03Merged) 10jenkins-bot: mw-mcrouter: bump number of proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051736 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:09:56] (03PS7) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [12:10:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P65720 and previous config saved to /var/cache/conftool/dbconfig/20240703-121009-ladsgroup.json [12:11:29] (03PS3) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) [12:11:46] (03PS4) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) [12:11:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw1454 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:12:31] jouncebot: now [12:12:31] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [12:12:32] (03PS1) 10Brouberol: OpenJDK17: sync both the openjdk-{jdk,jre} debian version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051745 (https://phabricator.wikimedia.org/T363461) [12:12:35] jouncebot: next [12:12:36] In 0 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1300) [12:13:06] (03CR) 10Btullis: [C:03+2] OpenJDK17: sync both the openjdk-{jdk,jre} debian version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051745 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [12:13:08] (03CR) 10Btullis: [V:03+2 C:03+2] OpenJDK17: sync both the openjdk-{jdk,jre} debian version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051745 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [12:13:38] (03CR) 10Jcrespo: [C:03+1] "I will make sure it doesn't, by disabling them manually + deleting the existing passwords. If it was important, I would disable and delete" [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [12:14:14] (03CR) 10Vgutierrez: [C:04-1] "it looks like tests benthos tests need to be updated as well" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [12:15:03] (03CR) 10Jcrespo: [C:03+1] "Note also dbprovs are for the most part "temporary storage" (glorified disk space), other than the backups they shouldn't have any importa" [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [12:15:48] (03CR) 10Jcrespo: [C:03+2] dbbackups: Disable es read-only backups and reenable rw ones [puppet] - 10https://gerrit.wikimedia.org/r/1051744 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:16:12] (03PS1) 10Ayounsi: Add public1-virtual-codfw PTR [dns] - 10https://gerrit.wikimedia.org/r/1051746 (https://phabricator.wikimedia.org/T362330) [12:17:11] (03CR) 10CI reject: [V:04-1] Add public1-virtual-codfw PTR [dns] - 10https://gerrit.wikimedia.org/r/1051746 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [12:17:23] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [12:20:56] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:28] (03CR) 10Volans: "Sure, but if CI was able to run 3.12 it would fail. Same on any local checkout. Hence my reluctance to commit to the repo something that i" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [12:22:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:23:31] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9949203 (10dcaro) I'll try adding the `sdc` drive to `cloudcephosd1034`, that should force it to get populated with... [12:23:42] (03CR) 10Filippo Giunchedi: "LGTM, though hard to judge accurately until the heartbeat metrics are in prometheus" [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [12:23:48] effie: ^ [12:23:53] expected? [12:24:08] yes, it it still rollin y out [12:24:14] ack [12:24:30] (03CR) 10Volans: Spicerack: fix Netbox 4 breaking changes (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:24:32] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9949207 (10dcaro) Current error counters (before adding `sdc`): ` root@cloudcephosd1034:~# for i in /dev/sd?; do ech... [12:24:34] RECOVERY - Check whether ferm is active by checking the default input chain on mw1454 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:25:47] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9949210 (10Ladsgroup) The prefetch has been done now so these are causing issu... [12:27:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:27:56] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:29:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:29:56] (03PS1) 10Lucas Werkmeister (WMDE): PropertyValueExpertsModule: Turn on enableModuleContentVersion() [extensions/Wikibase] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051748 (https://phabricator.wikimedia.org/T369155) [12:30:19] sigh, 199 out of 214 new pods have been updated, it shouldnt complain [12:30:49] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:30:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Wikibase] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051748 (https://phabricator.wikimedia.org/T369155) (owner: 10Lucas Werkmeister (WMDE)) [12:31:35] (03PS1) 10Kosta Harlan: GlobalRenameQueue: Fix issues with wiki ID and row query [extensions/CentralAuth] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051749 (https://phabricator.wikimedia.org/T369147) [12:31:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051749 (https://phabricator.wikimedia.org/T369147) (owner: 10Kosta Harlan) [12:32:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [12:33:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037783 (owner: 10DCausse) [12:33:54] (03PS2) 10Ayounsi: Add public1-virtual-codfw PTR [dns] - 10https://gerrit.wikimedia.org/r/1051746 (https://phabricator.wikimedia.org/T362330) [12:34:20] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:34:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:36:02] jouncebot: nowandnext [12:36:02] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [12:36:02] In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1300) [12:36:36] hm that's a busy patch window, I'll wait [12:36:48] yeah, pretty full [12:37:57] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:38:35] (03CR) 10Dzahn: [C:04-1] admin: add approvers to group analytics-research-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [12:39:47] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [12:39:49] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:42:43] (03Abandoned) 10Wargo: Set logo and favicon for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo) [12:42:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] mswikisource: create author and translation namespaces and add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) (owner: 10Anzx) [12:43:36] (03PS1) 10Brouberol: OpenJDK17: fix typos in the changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051753 (https://phabricator.wikimedia.org/T363461) [12:44:49] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] kawikisource: create author namespace, add namespace aliases and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) (owner: 10Anzx) [12:45:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9949268 (10Dzahn) @Miriam Would you be ok with becoming a formal "group approver" for the group "analytics-research-admins"? That would mean we'd ask... [12:47:26] (03CR) 10Jgiannelos: [C:04-1] Remove page html endpoints from changeprop (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [12:47:37] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [12:47:44] (03CR) 10Dzahn: [C:04-1] "The email address has been provided now: andrew.green@extern.wikimedia.de" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [12:48:32] (03PS5) 10Anzx: mswikisource: create author and translation namespaces and add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) [12:48:46] (03PS3) 10Anzx: kawikisource: create author namespace, add namespace aliases and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) [12:49:00] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:49:46] (03PS4) 10Dzahn: admin: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [12:50:02] (03CR) 10Dzahn: admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [12:50:37] (03CR) 10Dzahn: [C:03+1] admin: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [12:51:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9949324 (10Dzahn) @WMDECyn Yes, it was still needed. Thank you! I updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047473 [12:51:08] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9949326 (10Dzahn) 05Stalled→03In progress [12:51:36] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "kicking off gate-and-submit ahead of backport window" [extensions/Wikibase] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051748 (https://phabricator.wikimedia.org/T369155) (owner: 10Lucas Werkmeister (WMDE)) [12:51:44] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9949330 (10Dzahn) a:05AndyRussG→03None [12:52:00] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [12:59:20] (03CR) 10Klausman: "I'd prefer going with Go 1.22 (I added the image for that a week or two back). Go has strong compatibility guarantees: code building and w" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1300). [13:00:05] anzx, Lucas_WMDE, kostajh, and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:11] o/ [13:00:11] hi [13:00:13] I can deploy! [13:00:14] o/ [13:00:17] thx! [13:00:17] thanks Lucas_WMDE [13:00:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) (owner: 10Anzx) [13:00:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) (owner: 10Anzx) [13:00:38] let’s start with anzx [13:00:42] Lucas_WMDE: urbanecm will verify the CentralAuth patch [13:00:45] (03CR) 10CDanis: [C:03+2] CHANGELOG for configuration 1.8.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051412 (https://phabricator.wikimedia.org/T362310) (owner: 10CDanis) [13:00:47] (03CR) 10CDanis: [C:03+2] copy patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051449 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:00:50] ack [13:00:51] * urbanecm waves [13:00:54] (03CR) 10CDanis: [C:03+2] mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:01:03] (03CR) 10CDanis: [C:04-2] DO NOT SUBMIT, testing mesh change against mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051466 (owner: 10CDanis) [13:01:08] dcausse: hi! could you maybe take a quick look at https://phabricator.wikimedia.org/T369149? [13:01:14] (03Merged) 10jenkins-bot: mswikisource: create author and translation namespaces and add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051503 (https://phabricator.wikimedia.org/T369047) (owner: 10Anzx) [13:01:15] mainly in case I shouldn’t run the maintenance script ^^ [13:01:18] (03Merged) 10jenkins-bot: kawikisource: create author namespace, add namespace aliases and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051733 (https://phabricator.wikimedia.org/T363243) (owner: 10Anzx) [13:01:26] Lucas_WMDE: looking [13:01:29] thanks [13:01:39] (03Merged) 10jenkins-bot: CHANGELOG for configuration 1.8.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051412 (https://phabricator.wikimedia.org/T362310) (owner: 10CDanis) [13:01:47] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1051503|mswikisource: create author and translation namespaces and add namespace aliases (T369047)]], [[gerrit:1051733|kawikisource: create author namespace, add namespace aliases and sitename (T363243)]] [13:01:51] T369047: Configure the namespaces on Malay Wikisource - https://phabricator.wikimedia.org/T369047 [13:01:52] T363243: Post-creation work for kawikisource - https://phabricator.wikimedia.org/T363243 [13:02:06] (that’s a property with a new datatype that we enabled yesterday, so it’s possible some code is erroring out about it… but I didn’t see anything in logstash) [13:02:07] (03Merged) 10jenkins-bot: copy patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051449 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:02:08] (03Merged) 10jenkins-bot: mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [13:02:39] Lucas_WMDE: ForceSearchIndex might not work... I'll dig into it [13:02:50] alright, then I’ll skip that for now [13:02:51] thanks! [13:04:31] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1051503|mswikisource: create author and translation namespaces and add namespace aliases (T369047)]], [[gerrit:1051733|kawikisource: create author namespace, add namespace aliases and sitename (T363243)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:04:35] (03CR) 10Elukey: "Sure, feel free to amend the patch with go 1.22, I'd be more conservative but if you feel strong about it I am +1." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [13:04:36] Lucas_WMDE: checking [13:04:38] thanks! [13:05:42] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9949479 (10elukey) Sorry for the late reply, I had a chat with Willy and with the I/F team, this is our proposal: * We create a new POSIX group for `dcops` that gets deployed to all productio... [13:06:23] hm, I just realized that the namespace (alias) Perbualan Wikisource now completely vanished from mswikisource [13:06:24] Lucas_WMDE: look good to me [13:06:27] (it has Perbualan Wikisumber instead) [13:06:29] is that okay? [13:06:51] (03PS6) 10Fabfur: benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [13:07:11] (03Merged) 10jenkins-bot: PropertyValueExpertsModule: Turn on enableModuleContentVersion() [extensions/Wikibase] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051748 (https://phabricator.wikimedia.org/T369155) (owner: 10Lucas Werkmeister (WMDE)) [13:07:17] eh, it’s a pretty new wiki, probably okay [13:07:19] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, anzx: Continuing with sync [13:07:56] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9949511 (10Ladsgroup) >>! In T368136#9935249, @fnegri wrote: > Can we somehow remove the data that is currently filtered at the view layer, and inste... [13:08:51] Lucas_WMDE: Ok , I thought it would get fixed through namespacesdupes.php or I can add perbulan wikisumber as namespace alias [13:09:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#9949514 (10elukey) To keep archives happy: T360356#9949479 We filed a proposal to basically implement sudo_pair "socially", as starting experiment. While at it... [13:09:15] I’ll run namespaceDupes once the deployment is done [13:12:27] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1051503|mswikisource: create author and translation namespaces and add namespace aliases (T369047)]], [[gerrit:1051733|kawikisource: create author namespace, add namespace aliases and sitename (T363243)]] (duration: 10m 39s) [13:12:31] T369047: Configure the namespaces on Malay Wikisource - https://phabricator.wikimedia.org/T369047 [13:12:31] T363243: Post-creation work for kawikisource - https://phabricator.wikimedia.org/T363243 [13:12:49] Lucas_WMDE: there were no pages left on that namespace, probably no need for adding alias [13:13:05] “TypeError: 'NoneType' object is not iterable” meh [13:13:10] let’s try non-k8s mwscript then [13:13:25] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:14:25] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes mswikisource --fix # T369047; 6 pages to fix, 6 were resolvable; 76 links to fix, 73 were resolvable, 3 were deleted [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:46] (03CR) 10Bking: "Thanks for the tip, I didn't know that was an option. Will start checking this out." [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [13:15:50] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [13:15:50] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes kawikisource --fix # T363243; 34 pages to fix, 34 were resolvable; 774 links to fix, 774 were resolvable, 0 were deleted [13:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [13:16:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:06] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1051748|PropertyValueExpertsModule: Turn on enableModuleContentVersion() (T369155)]] [13:17:08] T369155: New data type not available for all users after being enabled - https://phabricator.wikimedia.org/T369155 [13:17:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365994 - depool db1191,db1196,db1197', diff saved to https://phabricator.wikimedia.org/P65721 and previous config saved to /var/cache/conftool/dbconfig/20240703-131715-arnaudb.json [13:17:18] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [13:17:21] Lucas_WMDE: thanks for deployment [13:17:26] np :) [13:17:30] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2002.mgmt.codfw.wmnet on all recursors [13:17:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2002.mgmt.codfw.wmnet on all recursors [13:17:56] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 49.3.193.10.in-addr.arpa. on all recursors [13:17:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 49.3.193.10.in-addr.arpa. on all recursors [13:18:00] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: encode problematic fields as b64url [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [13:18:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[1191,1196-1197].eqiad.wmnet with reason: T365994 [13:18:20] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9949542 (10xcollazo) >>! In T361835#9947951, @SGupta-WMF wrote: > @xcollazo Th... [13:18:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[1191,1196-1197].eqiad.wmnet with reason: T365994 [13:19:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051749 (https://phabricator.wikimedia.org/T369147) (owner: 10Kosta Harlan) [13:19:43] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host parsoidtest1001 [13:19:43] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1051748|PropertyValueExpertsModule: Turn on enableModuleContentVersion() (T369155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:49] testing [13:20:22] seems to work as far as I can tell [13:20:24] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:20:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host parsoidtest1001 [13:22:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host parsoidtest1001.eqiad.wmnet with OS bullseye [13:22:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [13:23:23] (03PS1) 10Kgraessle: Remove QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) [13:23:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949564 (10Jclark-ctr) [13:23:54] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9949559 (10elukey) Proposed plan: * In T368023 we move the private repo to puppetserver1001, and we add a git pre-commit hook config to t... [13:24:10] (03PS6) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:24:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949567 (10Jclark-ctr) a:03Jclark-ctr [13:24:21] (03PS7) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:24:23] (03CR) 10CI reject: [V:04-1] role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:25:26] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1051748|PropertyValueExpertsModule: Turn on enableModuleContentVersion() (T369155)]] (duration: 08m 20s) [13:25:29] T369155: New data type not available for all users after being enabled - https://phabricator.wikimedia.org/T369155 [13:25:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051749 (https://phabricator.wikimedia.org/T369147) (owner: 10Kosta Harlan) [13:26:21] (03PS2) 10Vgutierrez: varnish: Fix text/02-frontend-headers.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) [13:26:21] (03CR) 10Vgutierrez: "tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) (owner: 10Vgutierrez) [13:26:26] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9949572 (10xcollazo) >>! In T368098#9949210, @Ladsgroup wrote: > The prefetch... [13:27:33] (03PS8) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:28:06] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9949574 (10xcollazo) Ok I am going to postpone re-enabling the Commons RDF/JSO... [13:28:14] (03Merged) 10jenkins-bot: GlobalRenameQueue: Fix issues with wiki ID and row query [extensions/CentralAuth] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051749 (https://phabricator.wikimedia.org/T369147) (owner: 10Kosta Harlan) [13:28:44] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1051749|GlobalRenameQueue: Fix issues with wiki ID and row query (T369147)]] [13:28:47] T369147: GlobalRenameQueue shows internal error Wikimedia\Assert\PreconditionException when opening requests - https://phabricator.wikimedia.org/T369147 [13:29:09] (03PS9) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) [13:29:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1004.eqiad.wmnet with OS bookworm [13:29:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1005.eqiad.wmnet with OS bookworm [13:29:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1006.eqiad.wmnet with OS bookworm [13:29:51] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949577 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm [13:29:52] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm [13:29:58] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm [13:30:04] jouncebot: now [13:30:04] For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1300) [13:30:09] (03CR) 10CDanis: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) (owner: 10Vgutierrez) [13:30:37] /14 [13:30:39] err :) [13:31:14] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:31:21] !log lucaswerkmeister-wmde@deploy1002 kharlan, lucaswerkmeister-wmde: Backport for [[gerrit:1051749|GlobalRenameQueue: Fix issues with wiki ID and row query (T369147)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:31:42] urbanecm: can you test the CentralAuth change? [13:32:27] Lucas_WMDE: sure [13:32:55] (03PS6) 10DCausse: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 [13:32:59] Lucas_WMDE: it works [13:33:00] (03PS1) 10Superpes15: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) [13:33:02] !log lucaswerkmeister-wmde@deploy1002 kharlan, lucaswerkmeister-wmde: Continuing with sync [13:33:05] great, thanks! [13:33:08] (03Abandoned) 10Jforrester: wikifunctions: Reduce helm deploy timeout from 600s default to 120s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [13:33:14] (03PS2) 10DCausse: CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037783 [13:33:35] (03CR) 10CI reject: [V:04-1] [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) (owner: 10Superpes15) [13:34:02] thank you urbanecm [13:34:07] np [13:34:30] (03CR) 10Btullis: [C:03+2] OpenJDK17: fix typos in the changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051753 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [13:34:31] (03CR) 10Btullis: [V:03+2 C:03+2] OpenJDK17: fix typos in the changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051753 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [13:34:46] (03PS1) 10Kgraessle: Remove QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) [13:35:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [13:35:52] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9949603 (10Eevans) >>! In T368766#9935779, @VRiley-WMF wrote: > Not that I'm aware of. I used the same cable for everything. @Eevans would you happen to know if the IP address changed on this? @VRiley-WMF when the ma... [13:36:04] (03PS2) 10Superpes15: [sysop_plwiki] Change the logo/icon and the favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051757 (https://phabricator.wikimedia.org/T368712) [13:36:42] (03PS2) 10Kgraessle: Remove QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) [13:38:13] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1051749|GlobalRenameQueue: Fix issues with wiki ID and row query (T369147)]] (duration: 09m 28s) [13:38:16] T369147: GlobalRenameQueue shows internal error Wikimedia\Assert\PreconditionException when opening requests - https://phabricator.wikimedia.org/T369147 [13:38:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [13:38:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037783 (owner: 10DCausse) [13:39:14] (03Merged) 10jenkins-bot: noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [13:39:16] (03Merged) 10jenkins-bot: CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037783 (owner: 10DCausse) [13:39:47] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1037587|noc: fail with a 404 when the selected wiki is nonexistent]], [[gerrit:1037783|CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup]] [13:40:00] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9949616 (10colewhite) 05Open→03Resolved a:03Joe [13:42:29] !log lucaswerkmeister-wmde@deploy1002 dcausse, lucaswerkmeister-wmde: Backport for [[gerrit:1037587|noc: fail with a 404 when the selected wiki is nonexistent]], [[gerrit:1037783|CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:37] dcausse: can the wgCirrusSearchIndexFieldsToCleanup change be tested on mwdebug? [13:42:43] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949632 (10JMeybohm) [13:42:52] (the other change already seems to be live, I can see the new message at https://noc.wikimedia.org/wiki.php?wiki=foobar) [13:43:05] Lucas_WMDE: the noc change seems good, the wgCirrusSearchIndexFieldsToCleanup change can't be tested [13:43:11] yes [13:43:12] !log lucaswerkmeister-wmde@deploy1002 dcausse, lucaswerkmeister-wmde: Continuing with sync [13:43:14] alright :) [13:43:18] thanks! [13:43:18] thanks! :) [13:44:04] !log draining wikikube-worker1007.eqiad.wmnet wikikube-worker1021.eqiad.wmnet kubernetes1060.eqiad.wmnet for T365994 [13:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:08] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [13:48:26] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1037587|noc: fail with a 404 when the selected wiki is nonexistent]], [[gerrit:1037783|CirrusSearch: add wgCirrusSearchIndexFieldsToCleanup]] (duration: 08m 38s) [13:48:49] !log UTC afternoon backport+config window done [13:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:50] * Lucas_WMDE done [13:49:00] cc effie you pinged jouncebot earlier ^^ [13:49:03] thanks for the deploy! [13:49:21] I’m shocked, six patches deployed and we finished *before* time :D [13:49:29] two of them backports even [13:49:45] kicking bare-metal (almost fully) out of the deploy really sped it up [13:50:35] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9949639 (10fnegri) 05Open→03Declined > I highly doubt it'd be possible honestly for everything. I tend to agree, I underestimated the amount... [13:51:22] (03PS1) 10Ssingh: dnsbox and Wikimedia DNS: revert usage of LE's alternate chain [puppet] - 10https://gerrit.wikimedia.org/r/1051759 [13:52:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3160/co" [puppet] - 10https://gerrit.wikimedia.org/r/1051759 (owner: 10Ssingh) [13:52:47] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-e2-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e2-eqiad [13:53:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-e2-eqiad.mgmt with reason: prep JunOS upgrade lsw1-e2-eqiad [13:53:17] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949642 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c8dbb89d-640c-4078-bc10-bbbe9c30f3ef) set by cmooney... [13:55:44] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 1:20:00 on kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet with reason: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 [13:55:54] RECOVERY - Disk space on restbase2023 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops [13:56:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:20:00 on kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet with reason: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 [13:56:12] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949650 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=753739a5-e1fb-44b6-9174-f7b3a8c4b73b) set by jayme@c... [13:56:45] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1091*,elastic1092* for T348977 - bking@cumin2002 [13:56:48] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [13:56:48] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1091*,elastic1092* for T348977 - bking@cumin2002 [13:57:43] !log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet) [13:58:15] (03CR) 10CDanis: [C:03+1] "I'd like it a little better if running puppet-merge on a puppetserver gave you a helpful error message instead of just command not found, " [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:58:30] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-e2-eqiad,lsw1-e2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e2-eqiad [13:58:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-e2-eqiad,lsw1-e2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-e2-eqiad [13:58:48] (03CR) 10Herron: [C:03+1] admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051421 (owner: 10Cwhite) [13:58:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949656 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=185956f6-b0e6-4a89-9e32-6a8223f5678e) set by cmooney... [13:59:13] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1091-1092].eqiad.wmnet,wdqs[1018,1020].eqiad.wmnet with reason: T348977 [13:59:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1091-1092].eqiad.wmnet,wdqs[1018,1020].eqiad.wmnet with reason: T348977 [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1400) [14:00:05] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949655 (10JMeybohm) !log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikik... [14:00:54] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 22 hosts with reason: JunOS upgrade lsw1-e2-eqiad [14:01:03] (03PS5) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) [14:01:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 22 hosts with reason: JunOS upgrade lsw1-e2-eqiad [14:01:25] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949662 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=11036a9f-0b48-4b07-9e63-571b4f67c201) set by cmooney... [14:03:05] (03CR) 10Ssingh: [C:03+1] "Nice catch and fix." [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) (owner: 10Vgutierrez) [14:03:25] (03CR) 10Jgiannelos: Remove page html endpoints from changeprop (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [14:04:09] (03PS3) 10Vgutierrez: varnish: Fix text/02-frontend-headers.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) [14:04:33] !log rebooting lsw1-e2-eqiad to install updated JunOS version T365994 [14:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:36] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [14:04:37] (03CR) 10Ssingh: [C:03+1] varnish: Fix text/02-frontend-headers.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) (owner: 10Vgutierrez) [14:06:20] (03CR) 10Kamila Součková: [C:03+2] benthos/mw_accesslog_metrics: Add buffer [puppet] - 10https://gerrit.wikimedia.org/r/1051415 (https://phabricator.wikimedia.org/T367076) (owner: 10Kamila Součková) [14:07:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parsoidtest1001.eqiad.wmnet with OS bullseye [14:07:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [14:08:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:09:04] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [14:09:07] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:09:33] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:10:01] Lucas_WMDE: than you [14:10:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host parsoidtest1001.eqiad.wmnet with OS bullseye [14:10:34] FIRING: [3x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:58] PROBLEM - MariaDB Replica IO: s1 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1196.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1196.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:11:00] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:11:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [14:11:56] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix text/02-frontend-headers.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1051750 (https://phabricator.wikimedia.org/T369162) (owner: 10Vgutierrez) [14:14:16] FIRING: [4x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:48] (03PS1) 10Effie Mouzeli: mw-mcrouter: bump eqiad proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051762 (https://phabricator.wikimedia.org/T346690) [14:15:18] ah I forgot about that [14:15:23] fixing, sorry for the alert [14:16:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on db1154.eqiad.wmnet with reason: T365994 [14:16:53] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [14:17:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on db1154.eqiad.wmnet with reason: T365994 [14:17:36] PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:17:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:45:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet with reason: T365994 [14:17:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet with reason: T365994 [14:18:03] FIRING: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:18:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367856)', diff saved to https://phabricator.wikimedia.org/P65722 and previous config saved to /var/cache/conftool/dbconfig/20240703-141826-marostegui.json [14:18:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:18:39] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [14:21:39] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet) [14:22:09] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949750 (10cmooney) Switch is back up, all looks good at first glance from the network side. [14:22:19] (03PS1) 10Effie Mouzeli: mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) [14:22:33] (03PS2) 10Ssingh: dnsbox and Wikimedia DNS: revert usage of LE's alternate chain [puppet] - 10https://gerrit.wikimedia.org/r/1051759 [14:22:46] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: bump eqiad proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051762 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:22:58] RECOVERY - MariaDB Replica IO: s1 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:23:31] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3162/co" [puppet] - 10https://gerrit.wikimedia.org/r/1051759 (owner: 10Ssingh) [14:23:35] (03Merged) 10jenkins-bot: mw-mcrouter: bump eqiad proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051762 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:24:08] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051759 (owner: 10Ssingh) [14:24:16] RESOLVED: [4x] ProbeDown: Service aqs1020-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:36] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:25:11] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949772 (10ABran-WMF) db hosts as well, repooling [14:25:17] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:25:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65723 and previous config saved to /var/cache/conftool/dbconfig/20240703-142541-arnaudb.json [14:25:44] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [14:25:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65724 and previous config saved to /var/cache/conftool/dbconfig/20240703-142553-arnaudb.json [14:26:04] (03CR) 10Filippo Giunchedi: [C:03+1] admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051421 (owner: 10Cwhite) [14:26:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65725 and previous config saved to /var/cache/conftool/dbconfig/20240703-142614-arnaudb.json [14:26:31] (03CR) 10Ssingh: [V:03+1 C:03+2] dnsbox and Wikimedia DNS: revert usage of LE's alternate chain [puppet] - 10https://gerrit.wikimedia.org/r/1051759 (owner: 10Ssingh) [14:26:54] (03PS2) 10Effie Mouzeli: mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) [14:27:03] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:27:11] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:28:03] RESOLVED: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:30:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:30:59] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet) [14:32:24] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet [14:32:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet [14:32:31] (03CR) 10Vgutierrez: "please could you rebase this change on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051750 and adjust the 02-frontend-head" [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [14:32:45] !log sudo cumin "A:wikidough" "run-puppet-agent" [14:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:02] !log sudo cumin "A:dnsbox" "run-puppet-agent" [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:13] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949834 (10JMeybohm) >>! In T365994#9949655, @JMeybohm wrote: > !log jayme@cumin1002 conftool action : set/pooled=no; selector:... [14:33:20] (03CR) 10Volans: [C:03+1] "Ack, thx for the context" [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [14:33:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P65726 and previous config saved to /var/cache/conftool/dbconfig/20240703-143334-marostegui.json [14:33:46] (03CR) 10Clément Goubert: "LGTM, extremely minor nit that can be fixed when going global." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:33:54] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:34:35] (03CR) 10Clément Goubert: [C:03+1] mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:34:58] (03CR) 10Volans: [C:03+2] admin: Extend access for AndyRussG [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [14:35:15] !log [correction of previous A:dnsbox run] sudo cumin -b1 -s60 "A:dnsbox" "run-puppet-agent" [14:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:56] (03PS3) 10Effie Mouzeli: mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) [14:36:09] (03CR) 10Effie Mouzeli: mw-parsoid: enable mcrouter ds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:36:30] (03CR) 10Andrea Denisse: [C:03+1] admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051421 (owner: 10Cwhite) [14:37:17] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:38:07] (03Merged) 10jenkins-bot: mw-parsoid: enable mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:38:54] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:38:57] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:39:05] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9949849 (10Volans) 05In progress→03Resolved a:03Volans This should be all done, resolving. Please feel free to re-open it if you encounter... [14:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:21] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:39:49] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:40:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-conf1004.eqiad.wmnet with OS bookworm [14:40:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-conf1005.eqiad.wmnet with OS bookworm [14:40:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-conf1006.eqiad.wmnet with OS bookworm [14:40:40] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm execute... [14:40:44] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm execute... [14:40:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65727 and previous config saved to /var/cache/conftool/dbconfig/20240703-144046-arnaudb.json [14:40:48] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9949857 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm execute... [14:40:50] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [14:41:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65728 and previous config saved to /var/cache/conftool/dbconfig/20240703-144059-arnaudb.json [14:41:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65729 and previous config saved to /var/cache/conftool/dbconfig/20240703-144119-arnaudb.json [14:41:38] PROBLEM - Druid overlord on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:41:38] PROBLEM - Druid coordinator on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:45:37] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:46:24] (03PS1) 10Alexandros Kosiaris: deployment::rsync: Remove long absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1051772 (https://phabricator.wikimedia.org/T364417) [14:46:50] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:48:15] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9949876 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [14:48:38] RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:48:38] RECOVERY - Druid coordinator on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P65730 and previous config saved to /var/cache/conftool/dbconfig/20240703-144841-marostegui.json [14:49:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949915 (10Eevans) >>! In T365994#9949750, @cmooney wrote: > Switch is back up, all looks good at first glance from the network... [14:50:27] 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops, 13Patch-For-Review: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9949907 (10kamila) 05Open→03Resolved a:03kamila Increasing batch size slightly improved the situation, very... [14:50:41] (03PS3) 10Kgraessle: Remove QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) [14:50:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:57] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9949920 (10AndyRussG) Yaayy thanks so much @Volans, @Dzahn, @kamila, @WMDECyn! [14:51:20] !log start rebooting A:cp-drmrs (upload|text in parallel) for T366555 [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:28] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs [14:51:31] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs [14:53:30] (03PS1) 10Arnaudb: mariadb: recording rules to monitor [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) [14:54:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parsoidtest1001.eqiad.wmnet with OS bullseye [14:55:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9949929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [14:55:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65731 and previous config saved to /var/cache/conftool/dbconfig/20240703-145552-arnaudb.json [14:55:55] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [14:56:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65732 and previous config saved to /var/cache/conftool/dbconfig/20240703-145604-arnaudb.json [14:56:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65733 and previous config saved to /var/cache/conftool/dbconfig/20240703-145625-arnaudb.json [14:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:20] (03PS5) 10Arnaudb: mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) [15:00:50] (03CR) 10Arnaudb: mariadb: monitoring memory pressure (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [15:00:55] (03CR) 10CI reject: [V:04-1] mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [15:00:56] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:01:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T364069)', diff saved to https://phabricator.wikimedia.org/P65734 and previous config saved to /var/cache/conftool/dbconfig/20240703-150121-marostegui.json [15:01:25] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:02:53] (03PS1) 10Alexandros Kosiaris: WIP deployment::rsync: Temporarily disable stunnel [puppet] - 10https://gerrit.wikimedia.org/r/1051782 (https://phabricator.wikimedia.org/T364417) [15:03:27] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: use MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051738 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [15:03:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T367856)', diff saved to https://phabricator.wikimedia.org/P65735 and previous config saved to /var/cache/conftool/dbconfig/20240703-150348-marostegui.json [15:03:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [15:03:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:04:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [15:04:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T367856)', diff saved to https://phabricator.wikimedia.org/P65736 and previous config saved to /var/cache/conftool/dbconfig/20240703-150411-marostegui.json [15:04:55] (03PS6) 10Arnaudb: mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) [15:05:54] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:06:06] (03CR) 10CI reject: [V:04-1] mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [15:06:16] (03CR) 10Jsn.sherman: [C:03+1] "thanks for the decommissioning patch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [15:08:05] (03PS3) 10Jcrespo: dbbackups: Set dbprov[12]00[12] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) [15:09:13] (03PS1) 10Ahmon Dancy: InitialiseSettings-dev: Disable wmgUseEntitySchema,enable wgShowExceptionDetails [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051785 [15:09:24] (03CR) 10Cwhite: [C:03+2] "Attested the accuracy of this in our team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/1051421 (owner: 10Cwhite) [15:10:19] (03CR) 10Ahmon Dancy: [C:03+2] InitialiseSettings-dev: Disable wmgUseEntitySchema,enable wgShowExceptionDetails [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051785 (owner: 10Ahmon Dancy) [15:10:32] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [15:10:33] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#9949993 (10elukey) Reporting a summary of various chats with Moritz: * On `puppetmasterXXXX` (Puppet 5 infra), the authoritat... [15:10:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65737 and previous config saved to /var/cache/conftool/dbconfig/20240703-151057-arnaudb.json [15:11:01] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [15:11:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65738 and previous config saved to /var/cache/conftool/dbconfig/20240703-151110-arnaudb.json [15:11:23] (03Merged) 10jenkins-bot: InitialiseSettings-dev: Disable wmgUseEntitySchema,enable wgShowExceptionDetails [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051785 (owner: 10Ahmon Dancy) [15:11:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65739 and previous config saved to /var/cache/conftool/dbconfig/20240703-151131-arnaudb.json [15:11:46] (03PS1) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) [15:12:35] (03CR) 10CI reject: [V:04-1] datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [15:13:23] (03PS7) 10Arnaudb: mariadb: monitoring memory pressure [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) [15:13:41] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: 208.80.152.129 v6 - ayounsi@cumin1002" [15:14:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: 208.80.152.129 v6 - ayounsi@cumin1002" [15:14:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:47] (03CR) 10Jcrespo: [C:03+2] dbbackups: Set dbprov[12]00[12] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1051735 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [15:15:25] (03PS2) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) [15:16:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P65740 and previous config saved to /var/cache/conftool/dbconfig/20240703-151628-marostegui.json [15:18:46] (03PS3) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) [15:20:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:20:53] (03PS3) 10Arnaudb: mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) [15:21:12] (03CR) 10Arnaudb: mariadb: add monitoring on io pressure for mariadb hosts (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [15:26:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65741 and previous config saved to /var/cache/conftool/dbconfig/20240703-152603-arnaudb.json [15:26:07] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [15:26:14] (03CR) 10Ayounsi: "> comments where we use the [0] approach would be awesome :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:26:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65742 and previous config saved to /var/cache/conftool/dbconfig/20240703-152616-arnaudb.json [15:26:21] (03PS4) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [15:26:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65743 and previous config saved to /var/cache/conftool/dbconfig/20240703-152636-arnaudb.json [15:27:06] (03PS5) 10Ayounsi: Spicerack: fix Netbox 4 breaking changes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) [15:27:06] (03PS2) 10Ayounsi: Tox: add Python3.12 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 [15:27:20] (03CR) 10Kevin Bazira: [C:03+2] ml-services: use MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051738 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [15:27:32] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9950063 (10dcaro) The osd in now in, no changes in the error counter: ` root@cloudcephosd1034:~# for i in /dev/sd?;... [15:27:49] (03CR) 10Ayounsi: "Sounds good! Ok to wait. I re-ordered them so we can merge the Netbox 4 breaking changes." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [15:28:14] (03Merged) 10jenkins-bot: ml-services: use MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051738 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [15:28:31] (03PS8) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [15:29:12] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [15:29:20] (03CR) 10Jforrester: [C:03+1] "Cherry-picked without incident to Beta Cluster's puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/1051499 (https://phabricator.wikimedia.org/T361384) (owner: 10Andrew Bogott) [15:29:45] (03CR) 10Ayounsi: "It's in a lot of places so I worry it would make the code more difficult to read." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:31:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P65744 and previous config saved to /var/cache/conftool/dbconfig/20240703-153136-marostegui.json [15:31:53] !log restart haproxy on dns1005 [15:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:47] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:33:05] (03CR) 10Dzahn: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [15:35:32] (03PS2) 10Elukey: knative: upgrade all images to Bullseye and Golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) [15:35:32] (03PS2) 10Elukey: wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) [15:36:10] (03CR) 10Elukey: "Found some time and I tried the upgrade, so far it seems building:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [15:41:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65746 and previous config saved to /var/cache/conftool/dbconfig/20240703-154109-arnaudb.json [15:41:13] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [15:41:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65747 and previous config saved to /var/cache/conftool/dbconfig/20240703-154121-arnaudb.json [15:41:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65748 and previous config saved to /var/cache/conftool/dbconfig/20240703-154142-arnaudb.json [15:41:43] (03PS3) 10Elukey: knative: upgrade all images to Bullseye and Golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) [15:41:43] (03PS3) 10Elukey: wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) [15:46:08] (03PS4) 10Elukey: knative: upgrade all images to Bookworm and Golang 1.22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) [15:46:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T364069)', diff saved to https://phabricator.wikimedia.org/P65749 and previous config saved to /var/cache/conftool/dbconfig/20240703-154643-marostegui.json [15:46:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:46:48] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:47:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:47:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T364069)', diff saved to https://phabricator.wikimedia.org/P65750 and previous config saved to /var/cache/conftool/dbconfig/20240703-154716-marostegui.json [15:47:19] (03CR) 10Aaron Schulz: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 (owner: 10Aaron Schulz) [15:47:28] (03PS5) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763 [15:49:09] (03CR) 10Klausman: [C:03+1] "Thank you for taking care of this!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [15:56:20] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9950212 (10ayounsi) 05Open→03Resolved All is done here. [15:57:08] (03PS1) 10JHathaway: add mx-in{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1051797 (https://phabricator.wikimedia.org/T367517) [15:58:15] (03PS1) 10DCausse: cirrus: re-enable search updates on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 [15:58:51] (03PS1) 10Kevin Bazira: ml-services: assign MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051799 (https://phabricator.wikimedia.org/T368875) [16:00:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 5.61% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:00:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 112665312 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:01:04] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: assign MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051799 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [16:01:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 71912 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:02:09] (03CR) 10Kevin Bazira: [C:03+2] ml-services: assign MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051799 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [16:02:58] (03Merged) 10jenkins-bot: ml-services: assign MAX_FEATURE_VALS in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051799 (https://phabricator.wikimedia.org/T368875) (owner: 10Kevin Bazira) [16:04:12] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:05:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 6.735% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:05:16] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1051746 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [16:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 1%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240703-160521-root.json [16:06:23] (03CR) 10Ayounsi: [C:03+2] Add public1-virtual-codfw PTR [dns] - 10https://gerrit.wikimedia.org/r/1051746 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [16:06:46] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:06:56] (03PS2) 10DCausse: cirrus: re-enable search updates on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 [16:07:37] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9950297 (10Sharvaniharan) Hi @Volans Thank you for getting the patch going. Confirming that I have read the user responsibilities doc and will adhere t... [16:09:44] ^ analytics https://librenms.wikimedia.org/bill/bill_id=28/ (cc topranks ) [16:09:56] ah, I was about to ask [16:10:41] do we know who ran the job? and if they can stop it? [16:11:47] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:12:52] XioNoX: has this been happening more often lately? [16:14:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1103717840 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:15:01] (03PS1) 10Kevin Bazira: Revert "ml-services: assign MAX_FEATURE_VALS in articlequality" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051800 [16:16:12] (03PS13) 10Jdlrobson: [July 4th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) [16:16:44] last time we had something similar it was tricky to find exactly who [16:16:45] https://phabricator.wikimedia.org/T364893#9800673 [16:17:16] jouncebot: nowandnext [16:17:16] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [16:17:16] In 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1700) [16:17:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:17:58] looks like all the an-worker nodes are pulling a lot of data [16:18:37] (03PS2) 10Kevin Bazira: Revert "ml-services: assign MAX_FEATURE_VALS in articlequality" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051800 [16:18:50] ok [16:18:58] I'm checking if it is Presto again [16:19:38] quick check of that dashboard doesn't look like it was the exact same this time [16:19:42] https://grafana-rw.wikimedia.org/d/000000006/presto-server-utilization-btullis?orgId=1&refresh=30s&viewPanel=27&from=1720012683206&to=1720023483207 [16:19:44] (03CR) 10Kevin Bazira: [C:03+2] Revert "ml-services: assign MAX_FEATURE_VALS in articlequality" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051800 (owner: 10Kevin Bazira) [16:19:46] yeah, not nearly the same magnitude [16:20:14] (03Merged) 10jenkins-bot: Revert "ml-services: assign MAX_FEATURE_VALS in articlequality" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051800 (owner: 10Kevin Bazira) [16:20:23] topranks: https://grafana-rw.wikimedia.org/d/ZvSPbGOnz/hadoop-server-utilization-btullis?orgId=1&from=1720017829797&to=1720023514494 [16:20:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65751 and previous config saved to /var/cache/conftool/dbconfig/20240703-162032-root.json [16:20:49] so I'm guessing it's HDFS + a query on Hive/Yarn (/Spark?) [16:22:23] (03PS2) 10Andrew Bogott: deployment-prep mcrouter: replace old memc servers with new ones [puppet] - 10https://gerrit.wikimedia.org/r/1051499 (https://phabricator.wikimedia.org/T361384) [16:22:23] (03PS1) 10Andrew Bogott: environment: add wikimediacloud.org to no_proxy domains [puppet] - 10https://gerrit.wikimedia.org/r/1051802 [16:22:35] (03CR) 10Vgutierrez: [C:03+1] "just out of curiosity, what are we considering here "low traffic"?" [puppet] - 10https://gerrit.wikimedia.org/r/1047191 (owner: 10BCornwall) [16:23:12] I'm trying to grok https://yarn.wikimedia.org/cluster/scheduler [16:24:46] (03PS1) 10JHathaway: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051803 (https://phabricator.wikimedia.org/T365395) [16:26:09] (03CR) 10Tjones: [C:03+1] "looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051798 (owner: 10DCausse) [16:26:35] network rx picked up around 15:58 - one job started ~10min before that (https://yarn.wikimedia.org/cluster/app/application_1719935448343_10585) [16:27:19] although starttime on that page differs from the scheduler page [16:27:53] nvm, wrong link [16:27:58] https://yarn.wikimedia.org/cluster/app/application_1719935448343_13378 [16:28:12] (03PS2) 10JHathaway: add mx-in{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1051797 (https://phabricator.wikimedia.org/T367517) [16:28:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051803 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:32:04] (03PS2) 10JHathaway: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051803 (https://phabricator.wikimedia.org/T365395) [16:32:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051803 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:34:13] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep mcrouter: replace old memc servers with new ones [puppet] - 10https://gerrit.wikimedia.org/r/1051499 (https://phabricator.wikimedia.org/T361384) (owner: 10Andrew Bogott) [16:35:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65752 and previous config saved to /var/cache/conftool/dbconfig/20240703-163538-root.json [16:38:54] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1236769896 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:41:54] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 168840 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:42:23] (03CR) 10JHathaway: [C:03+2] add mx-in{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1051797 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [16:44:20] !log adding inbound email servers mx-in{1001,2001} to our MX record [16:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:50] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-presto1004.eqiad.wmnet with reason: Cold booting to investigate RAM issue [16:47:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-presto1004.eqiad.wmnet with reason: Cold booting to investigate RAM issue [16:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:48:32] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy gemma2-27b-it on ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051806 (https://phabricator.wikimedia.org/T369055) [16:50:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65754 and previous config saved to /var/cache/conftool/dbconfig/20240703-165044-root.json [16:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:54:28] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9950626 (10WDoranWMF) [16:56:45] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9950632 (10WDoranWMF) [17:00:03] (03CR) 10Ottomata: [C:03+1] beta: eventbus: enable instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1700) [17:02:09] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9950686 (10xcollazo) I played with the offending SQL statements from T368098#9... [17:02:17] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-06-17-221517 to 2024-07-03-155425 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051807 (https://phabricator.wikimedia.org/T364413) [17:02:34] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-06-11-161031 to 2024-07-03-153821 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051808 (https://phabricator.wikimedia.org/T364413) [17:03:49] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [17:05:20] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-06-17-221517 to 2024-07-03-155425 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051807 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [17:05:47] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9950703 (10bcampbell) I see the new MX records in Google Workspace Admin now @jhathaway. {F56203753} [17:05:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65755 and previous config saved to /var/cache/conftool/dbconfig/20240703-170549-root.json [17:06:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-06-17-221517 to 2024-07-03-155425 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051807 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [17:07:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9950724 (10Volans) a:03ATsay-WMF [17:07:32] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:08:08] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:09:08] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:10:19] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:10:23] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9950740 (10wiki_willy) Thanks so much @elukey for putting this proposal together, and for the chat during office hours today. I like the entire idea, and will run it by the rest of the team d... [17:10:38] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:11:47] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:13:16] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-06-11-161031 to 2024-07-03-153821 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051808 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [17:14:16] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-06-11-161031 to 2024-07-03-153821 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051808 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [17:14:46] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9950763 (10cmooney) 05Open→03Resolved [17:15:19] jouncebot: nowandnext [17:15:19] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1700) [17:15:20] In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1800) [17:15:38] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:15:42] (03CR) 10CDanis: [C:03+2] Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [17:17:16] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:17:23] (03Merged) 10jenkins-bot: Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [17:17:53] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:19:46] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:19:50] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:20:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65756 and previous config saved to /var/cache/conftool/dbconfig/20240703-172055-root.json [17:22:06] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:22:45] (03PS1) 10Dreamrimmer: Remove "Create a book" link from sidebar on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) [17:23:11] (03PS1) 10CDanis: actually bump mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051810 [17:23:23] (03CR) 10CDanis: [C:03+2] actually bump mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051810 (owner: 10CDanis) [17:25:12] (03Merged) 10jenkins-bot: actually bump mediawiki chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051810 (owner: 10CDanis) [17:25:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051809 (https://phabricator.wikimedia.org/T368900) (owner: 10Dreamrimmer) [17:26:38] (03PS2) 10Dreamrimmer: [Wikitech] Remove namespace 666 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) [17:27:31] (03PS1) 10Jforrester: wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) [17:27:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) (owner: 10Dreamrimmer) [17:28:25] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:28:53] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:29:54] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:30:15] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:30:18] (03PS21) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [17:31:30] (03CR) 10Gergő Tisza: "DOne." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [17:31:54] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:33:13] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:33:14] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:34:26] (03CR) 10Andrew Bogott: [C:03+2] environment: add wikimediacloud.org to no_proxy domains [puppet] - 10https://gerrit.wikimedia.org/r/1051802 (owner: 10Andrew Bogott) [17:34:27] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:34:28] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:34:50] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:34:51] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:35:12] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:35:13] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [17:35:36] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [17:35:37] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [17:35:57] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [17:36:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65758 and previous config saved to /var/cache/conftool/dbconfig/20240703-173601-root.json [17:36:41] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:37:54] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:37:55] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:37:58] (03PS1) 10Pppery: WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) [17:38:38] (03CR) 10CI reject: [V:04-1] WIP: Add wmf-config changes for mos: interwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051814 (https://phabricator.wikimedia.org/T363538) (owner: 10Pppery) [17:40:11] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:40:11] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:41:30] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:41:31] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:41:33] (03PS8) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [17:41:46] 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9950976 (10CDanis) 05In progress→03Resolved a:03CDanis Boldly closing this because we've resolved all of {T353464} and the two tasks for 10G NICs T366204 T366205 [17:42:02] (03CR) 10Gergő Tisza: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:43:06] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:43:25] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:44:38] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:44:39] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:44:44] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:45:30] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [17:45:33] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [17:46:08] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:48:09] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:49:30] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:49:31] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:50:46] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:53:18] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 304364432 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:54:20] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 56520 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:00:05] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T1800) [18:02:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1091-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:04:12] ^^ Elastic alert should clear shortly, see unban cmd a few lines up [18:10:36] (03PS61) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [18:11:26] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 29524856 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:11:35] (03CR) 10Bking: "Yeah, this seems like the proper path forward. I've actually already started work on this, I just wanted to leave that for a different CR." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [18:11:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9951125 (10cmooney) So one thing I noticed is that we are not getting the stats for LAG/ae interfaces with the current setup, nor routed... [18:12:26] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2448 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:14:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:17:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1433152560 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:17:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1091-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:21:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 11848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:22:39] RESOLVED: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1091-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:25:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:17] (03PS1) 10CDanis: otelcol: update hardcoded k8s master IPs for the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051820 (https://phabricator.wikimedia.org/T365855) [18:28:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52339 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:10] cdanis: did someone already give you the "last famous words" punchline on this already or not :) [18:29:32] sukhe: lol [18:29:41] every time I put last/temporary/fix somewhere, someone messages me to tell me that it won't be the csae [18:29:48] well it *looks* trivial to do it the right way soon, even with an external chart [18:31:36] (03CR) 10CDanis: [C:03+2] otelcol: update hardcoded k8s master IPs for the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051820 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [18:34:33] (03Merged) 10jenkins-bot: otelcol: update hardcoded k8s master IPs for the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051820 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis) [18:34:54] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:35:05] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:36:48] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:36:56] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:39:33] (03CR) 10Mforns: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin) [18:40:38] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 568625416 and 61 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:41:38] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 20624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:54:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [18:54:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [18:55:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T364069)', diff saved to https://phabricator.wikimedia.org/P65759 and previous config saved to /var/cache/conftool/dbconfig/20240703-185511-marostegui.json [18:55:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:59:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:08:57] !log deploying airflow dags [19:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P65760 and previous config saved to /var/cache/conftool/dbconfig/20240703-191019-marostegui.json [19:11:54] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@d773cac]: (no justification provided) [19:12:27] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@d773cac]: (no justification provided) (duration: 00m 33s) [19:16:16] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2002.codfw.wmnet with OS bookworm [19:19:03] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [19:19:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:24:22] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bookworm [19:24:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookwo... [19:25:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [19:25:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:25:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P65761 and previous config saved to /var/cache/conftool/dbconfig/20240703-192526-marostegui.json [19:30:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:30:21] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:38:21] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9951496 (10KFrancis) I went ahead and processed an NDA here. It's just better to have our bases covered. I'll confirm when it's complete. [19:40:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T364069)', diff saved to https://phabricator.wikimedia.org/P65765 and previous config saved to /var/cache/conftool/dbconfig/20240703-194033-marostegui.json [19:40:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [19:40:37] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:40:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1195.eqiad.wmnet with reason: Maintenance [19:40:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65766 and previous config saved to /var/cache/conftool/dbconfig/20240703-194055-marostegui.json [19:49:47] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2002.codfw.wmnet with OS bookworm [19:54:05] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9951531 (10xcollazo) In {T29112} they modified the code to `ORDER BY page_id A... [19:54:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bookworm [19:55:05] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:55:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951547 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:56:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T2000). nyaa~ [20:00:05] katherine_g: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] here! [20:00:38] hi katherine_g - i can deploy for you unless you can self-deploy? [20:00:54] I cannot self deploy yet, so that would be great! [20:01:04] alrighty - let's go! [20:01:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:02:34] (03Merged) 10jenkins-bot: Remove QuickSurvey for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051756 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:03:07] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1051756|Remove QuickSurvey for Automoderator patroller workstream survey (T362969)]] [20:03:10] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:04:41] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [20:04:43] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [20:05:06] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [20:05:48] !log cjming@deploy1002 kgraessle, cjming: Backport for [[gerrit:1051756|Remove QuickSurvey for Automoderator patroller workstream survey (T362969)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:05:51] katherine_g: up on test servers if you want to check - lmk if/when to sync [20:06:10] looks good to sync! [20:06:17] yay! [20:06:20] !log cjming@deploy1002 kgraessle, cjming: Continuing with sync [20:09:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:10:13] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2002.codfw.wmnet with OS bookworm [20:11:29] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1051756|Remove QuickSurvey for Automoderator patroller workstream survey (T362969)]] (duration: 08m 22s) [20:11:32] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:11:41] katherine_g: should be live! [20:11:55] ok, thanks! [20:12:05] yw [20:13:07] ok - closing window bec i have to wrap up stuff for the upcoming long weekend - if someone needs something deployed in the next 45 mins, please ping me here or on slack and i can hop on [20:13:52] !log end of UTC late backport window [20:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:31] (03PS1) 10Andrew Bogott: wmcs-image-create: update with g4 flavors [puppet] - 10https://gerrit.wikimedia.org/r/1051836 [20:47:01] (03PS1) 10RLazarus: systemd: Expand Systemd::Timer::Interval pattern [puppet] - 10https://gerrit.wikimedia.org/r/1051839 [20:47:23] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051839 (owner: 10RLazarus) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T2100) [21:04:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:09:37] (03PS2) 10RLazarus: systemd: Expand Systemd::Timer::Interval pattern [puppet] - 10https://gerrit.wikimedia.org/r/1051839 [21:10:26] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051839 (owner: 10RLazarus) [21:12:30] (03PS3) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) [21:13:32] (03CR) 10RLazarus: "As discussed!" [puppet] - 10https://gerrit.wikimedia.org/r/1051839 (owner: 10RLazarus) [21:14:21] (03PS2) 10RLazarus: deployment_server: Add a daily systemd timer for mwscript_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) [21:16:04] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:16:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [21:18:09] (03PS3) 10RLazarus: deployment_server: Add a daily systemd timer for mwscript_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) [21:18:20] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:20:12] (03PS4) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) [21:22:15] (03CR) 10JHathaway: [C:03+2] Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051803 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:24:03] (03PS4) 10RLazarus: deployment_server: Add a daily systemd timer for mwscript_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) [21:24:14] (03PS5) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) [21:24:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:25:43] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9951919 (10jhathaway) [21:25:54] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:29:25] (03CR) 10Ryan Kemper: [C:03+1] wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [21:29:27] (03CR) 10Ryan Kemper: [C:03+2] wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [21:33:16] (03CR) 10RLazarus: "Damn, nice digging! As discussed I addressed this at the regex, mostly out of indignation." [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:34:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:35:21] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [21:36:25] (03CR) 10Scott French: [C:03+1] "Nice! Thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/1051839 (owner: 10RLazarus) [21:39:12] (03CR) 10Scott French: "Thanks for updating the regex!" [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:40:33] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [21:40:49] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [21:42:28] (03Abandoned) 10Dzahn: gerrit: remove NRPE process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1032526 (owner: 10Dzahn) [21:43:18] PROBLEM - Dell PowerEdge RAID Controller on db2161 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [21:43:20] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db2161 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T369229 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [21:43:24] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229 (10ops-monitoring-bot) 03NEW [21:49:27] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:50:08] (03PS2) 10Andrew Bogott: wmcs-image-create: update with g4 flavors [puppet] - 10https://gerrit.wikimedia.org/r/1051836 [21:50:08] (03PS1) 10Andrew Bogott: wmcs-image-create: clear image id in base image [puppet] - 10https://gerrit.wikimedia.org/r/1051845 (https://phabricator.wikimedia.org/T351507) [21:51:18] (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: update with g4 flavors [puppet] - 10https://gerrit.wikimedia.org/r/1051836 (owner: 10Andrew Bogott) [21:51:30] (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: clear image id in base image [puppet] - 10https://gerrit.wikimedia.org/r/1051845 (https://phabricator.wikimedia.org/T351507) (owner: 10Andrew Bogott) [21:55:32] (03PS1) 10Dzahn: puppetmaster: change git sender email address to git@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051846 [21:56:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host parsoidtest1001.eqiad.wmnet with OS bullseye [21:57:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye [22:08:31] (03CR) 10RLazarus: [C:03+2] systemd: Expand Systemd::Timer::Interval pattern [puppet] - 10https://gerrit.wikimedia.org/r/1051839 (owner: 10RLazarus) [22:08:40] (03CR) 10RLazarus: [C:03+2] deployment_server: Add a daily systemd timer for mwscript_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1051489 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [22:20:10] (03PS1) 10RLazarus: deployment_server: Run mwscript-cleanup as mwdeploy, not www-data [puppet] - 10https://gerrit.wikimedia.org/r/1051848 [22:21:25] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:30] (03PS5) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) [22:36:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parsoidtest1001.eqiad.wmnet with OS bullseye [22:36:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952219 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed... [22:36:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T352010)', diff saved to https://phabricator.wikimedia.org/P65768 and previous config saved to /var/cache/conftool/dbconfig/20240703-223632-ladsgroup.json [22:36:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:37:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65769 and previous config saved to /var/cache/conftool/dbconfig/20240703-223659-marostegui.json [22:37:03] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:37:41] 06SRE, 06Infrastructure-Foundations, 10netops: Should we add links between our spine switches aggregating each row of two? - https://phabricator.wikimedia.org/T369238 (10cmooney) 03NEW p:05Triage→03Low [22:38:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952225 (10Jclark-ctr) @Papaul if you get a chance can you look at this one? [22:47:28] (03CR) 10Scott French: [C:03+1] deployment_server: Run mwscript-cleanup as mwdeploy, not www-data [puppet] - 10https://gerrit.wikimedia.org/r/1051848 (owner: 10RLazarus) [22:51:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P65770 and previous config saved to /var/cache/conftool/dbconfig/20240703-225139-ladsgroup.json [22:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P65771 and previous config saved to /var/cache/conftool/dbconfig/20240703-225206-marostegui.json [22:53:50] PROBLEM - Check unit status of mwscript-cleanup on deploy1002 is CRITICAL: CRITICAL: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T2300). [23:00:31] PROBLEM - Check unit status of mwscript-cleanup on deploy1003 is CRITICAL: CRITICAL: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:03:17] PROBLEM - Check unit status of mwscript-cleanup on deploy2002 is CRITICAL: CRITICAL: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:06:45] ^ mwscript-cleanup is me, working on it [23:06:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P65772 and previous config saved to /var/cache/conftool/dbconfig/20240703-230646-ladsgroup.json [23:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P65773 and previous config saved to /var/cache/conftool/dbconfig/20240703-230713-marostegui.json [23:08:00] (03CR) 10RLazarus: [C:03+2] deployment_server: Run mwscript-cleanup as mwdeploy, not www-data [puppet] - 10https://gerrit.wikimedia.org/r/1051848 (owner: 10RLazarus) [23:08:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952366 (10Dzahn) We can see in reimage-extended.log that the reimage fails but it's not immediately clear why. ` 2024-07-03 22:36:13,115 jclark 2636322 [ERRO... [23:09:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:16:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T352010)', diff saved to https://phabricator.wikimedia.org/P65774 and previous config saved to /var/cache/conftool/dbconfig/20240703-232154-ladsgroup.json [23:21:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:22:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T364069)', diff saved to https://phabricator.wikimedia.org/P65775 and previous config saved to /var/cache/conftool/dbconfig/20240703-232221-marostegui.json [23:22:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:22:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:22:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:22:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:22:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:23:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T364069)', diff saved to https://phabricator.wikimedia.org/P65776 and previous config saved to /var/cache/conftool/dbconfig/20240703-232302-marostegui.json [23:23:17] RECOVERY - Check unit status of mwscript-cleanup on deploy2002 is OK: OK: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:23:51] RECOVERY - Check unit status of mwscript-cleanup on deploy1002 is OK: OK: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:24:04] (03PS1) 10Dwisehaupt: crm: Stop civicrm callouts to the internet for version checks [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) [23:24:38] (03CR) 10Dwisehaupt: "This is the ET change we worked through at the offsite." [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:26:25] RESOLVED: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:30:10] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486 (owner: 10TrainBranchBot) [23:30:31] RECOVERY - Check unit status of mwscript-cleanup on deploy1003 is OK: OK: Status of the systemd unit mwscript-cleanup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:03] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9952430 (10Dzahn) [23:34:07] (03PS1) 10Dzahn: Revert "Phabricator: Add safe.directory directive" [puppet] - 10https://gerrit.wikimedia.org/r/1051852 [23:34:38] (03PS1) 10Catrope: Graph extension: Add tracking for data sources used in tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 [23:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051854 [23:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051854 (owner: 10TrainBranchBot) [23:40:11] (03CR) 10Dzahn: [C:03+2] Revert "Phabricator: Add safe.directory directive" [puppet] - 10https://gerrit.wikimedia.org/r/1051852 (owner: 10Dzahn) [23:41:20] (03PS1) 10RLazarus: deployment_server: Handle None container_statuses in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1051855 (https://phabricator.wikimedia.org/T369175) [23:46:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9952433 (10Papaul) @Jclark-ctr @Dzahn this is what i have on the conole [ (1*installer) 2 shell 3 shell 4- log ][ Jul 03 23:44 ]... [23:47:39] !log removing 11 files for legal compliance [23:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:32] (03CR) 10Scott French: [C:03+1] deployment_server: Handle None container_statuses in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1051855 (https://phabricator.wikimedia.org/T369175) (owner: 10RLazarus) [23:56:54] (03CR) 10RLazarus: [C:03+2] deployment_server: Handle None container_statuses in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1051855 (https://phabricator.wikimedia.org/T369175) (owner: 10RLazarus) [23:59:43] (03PS1) 10Dzahn: installserver: add parsoidtest1001 to partman [puppet] - 10https://gerrit.wikimedia.org/r/1051856 (https://phabricator.wikimedia.org/T363399)