[00:04:05] PROBLEM - SSH on puppetserver1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65438 and previous config saved to /var/cache/conftool/dbconfig/20240626-000534-marostegui.json [00:07:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1049644 (owner: 10TrainBranchBot) [00:09:17] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:12:57] RECOVERY - SSH on puppetserver1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:18:43] (03PS1) 10Scott French: auto_schema: use dbctl config diff return status [software] - 10https://gerrit.wikimedia.org/r/1049648 [00:19:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9924471 (10Papaul) [00:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T367856)', diff saved to https://phabricator.wikimedia.org/P65439 and previous config saved to /var/cache/conftool/dbconfig/20240626-002041-marostegui.json [00:20:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [00:20:51] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:20:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [00:21:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T367856)', diff saved to https://phabricator.wikimedia.org/P65440 and previous config saved to /var/cache/conftool/dbconfig/20240626-002103-marostegui.json [00:23:48] (03PS2) 10Scott French: auto_schema: use dbctl config diff return status [software] - 10https://gerrit.wikimedia.org/r/1049648 [00:43:26] (03CR) 10Scott French: "Context: A change in conftool briefly added some log-spam on stderr (appeared in 3.0.0, fixed in 3.0.1), which surprisingly stalled schema" [software] - 10https://gerrit.wikimedia.org/r/1049648 (owner: 10Scott French) [00:44:53] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:46:57] RECOVERY - Disk space on mw1446 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops [00:59:11] FIRING: Temperature: Temp issue on wdqs2025:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2025 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [00:59:59] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:04:11] RESOLVED: Temperature: Temp issue on wdqs2025:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs2025 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [01:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:13:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#9924644 (10Jgreen) [01:13:06] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#9924645 (10Jgreen) [01:13:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#9924646 (10Jgreen) [01:13:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#9924648 (10Jgreen) [01:13:15] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frand200[12] - https://phabricator.wikimedia.org/T367804#9924647 (10Jgreen) [01:13:21] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#9924649 (10Jgreen) [01:32:51] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:48:21] (03PS4) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [01:51:26] (03PS5) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [02:54:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T364069)', diff saved to https://phabricator.wikimedia.org/P65441 and previous config saved to /var/cache/conftool/dbconfig/20240626-025412-marostegui.json [02:54:17] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:05:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P65442 and previous config saved to /var/cache/conftool/dbconfig/20240626-030919-marostegui.json [03:24:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P65443 and previous config saved to /var/cache/conftool/dbconfig/20240626-032426-marostegui.json [03:39:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T364069)', diff saved to https://phabricator.wikimedia.org/P65444 and previous config saved to /var/cache/conftool/dbconfig/20240626-033933-marostegui.json [03:39:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:39:39] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:39:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:39:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T364069)', diff saved to https://phabricator.wikimedia.org/P65445 and previous config saved to /var/cache/conftool/dbconfig/20240626-033955-marostegui.json [04:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:39] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [04:08:43] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [04:35:09] (03CR) 10Marostegui: [C:03+1] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049551 (https://phabricator.wikimedia.org/T368401) (owner: 10Gerrit maintenance bot) [04:35:39] (03CR) 10Marostegui: [C:03+1] mariadb: disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049555 (https://phabricator.wikimedia.org/T368401) (owner: 10Arnaudb) [04:36:03] (03CR) 10Marostegui: [C:03+1] mariadb: Promote es1038 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1049550 (https://phabricator.wikimedia.org/T368401) (owner: 10Gerrit maintenance bot) [04:40:13] (03PS1) 10Marostegui: m2 proxies: Test db1228 [puppet] - 10https://gerrit.wikimedia.org/r/1049664 (https://phabricator.wikimedia.org/T368494) [04:41:22] (03CR) 10Marostegui: [C:03+2] m2 proxies: Test db1228 [puppet] - 10https://gerrit.wikimedia.org/r/1049664 (https://phabricator.wikimedia.org/T368494) (owner: 10Marostegui) [04:44:08] (03PS1) 10Marostegui: Revert "m2 proxies: Test db1228" [puppet] - 10https://gerrit.wikimedia.org/r/1049665 [04:44:19] (03CR) 10Marostegui: "Test done" [puppet] - 10https://gerrit.wikimedia.org/r/1049665 (owner: 10Marostegui) [04:44:47] (03CR) 10Marostegui: [C:03+2] Revert "m2 proxies: Test db1228" [puppet] - 10https://gerrit.wikimedia.org/r/1049665 (owner: 10Marostegui) [04:51:05] !log dbmaint eqiad Drop ipblocks in s8 T367632 [04:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:10] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [04:53:29] !log dbmaint eqiad Drop ipblocks in s2 T367632 [04:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:15] (03PS1) 10Dreamrimmer: Meta-Wiki: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) [05:01:56] !log dbmaint eqiad Drop ipblocks in s5 T367632 [05:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:08] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [05:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:04:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) (owner: 10Dreamrimmer) [05:05:12] (03PS5) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) [05:06:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [05:25:53] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:32:07] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1717167405[2](2024-06-22T22:42:07.829Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:36:28] !log [Elastic] One unassigned shard; cluster status yellow. Not a big deal, looks like `shard has exceeded the maximum number of retries [5] on failed allocation attempts`, I'll try a manual `/_cluster/reroute?retry_failed=true` [05:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:00] !log [Elastic] `curl -s -X POST https://search.svc.eqiad.wmnet:9243/_cluster/reroute?retry_failed=true` did the trick. Shard initializing, cluster should be back to green soon enough [05:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:57] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 27, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:57:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:57:36] !log dbmaint eqiad Drop ipblocks in s4 T367632 [05:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:41] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [05:59:56] !log dbmaint eqiad Drop ipblocks in s3 T367632 [06:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0600) [06:01:37] !log dbmaint eqiad Drop ipblocks in s1 T367632 [06:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:20:57] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9924911 (10Marostegui) So in terms of data, my recap is: - root password is differentr from production - the data that is present there is sanitized... [06:31:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136 T365805', diff saved to https://phabricator.wikimedia.org/P65446 and previous config saved to /var/cache/conftool/dbconfig/20240626-063109-root.json [06:31:13] (03PS1) 10Marostegui: db2136: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1049673 (https://phabricator.wikimedia.org/T365805) [06:31:15] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [06:32:10] (03CR) 10Marostegui: [C:03+2] db2136: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1049673 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [06:39:26] !log Install mariadb 10.11 on s4 db2136 (depooled for now) T365805 [06:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:31] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [06:44:15] (03CR) 10Ayounsi: [C:03+2] Netbox puppet import: ignore ipip interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi) [06:44:43] (03PS1) 10Slyngshede: R:idp New CAS 7 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) [06:45:17] (03Merged) 10jenkins-bot: Netbox puppet import: ignore ipip interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1049499 (owner: 10Ayounsi) [06:52:29] !log Enable slow query log on db2136 running 10.11 T365805 [06:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:35] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [06:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db2136 - running 10.11 with minium weight T365805', diff saved to https://phabricator.wikimedia.org/P65447 and previous config saved to /var/cache/conftool/dbconfig/20240626-065636-marostegui.json [06:58:28] (03PS1) 10Marostegui: db2136: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1049764 [06:59:02] (03CR) 10Marostegui: [C:03+2] db2136: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1049764 (owner: 10Marostegui) [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0700). [07:00:05] DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:30] I am around [07:03:55] !log installing emacs security updates [07:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:29] any deployer around? [07:07:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: return logo-detection latency metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [07:13:38] (03CR) 10Ayounsi: [C:03+1] "nice !" [homer/public] - 10https://gerrit.wikimedia.org/r/1049566 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [07:19:28] (03CR) 10Muehlenhoff: "Can be abandoned, similar patch already merged" [puppet] - 10https://gerrit.wikimedia.org/r/1036592 (https://phabricator.wikimedia.org/T365574) (owner: 10Cwhite) [07:19:35] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for MX hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049481 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:22:58] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for archiva [puppet] - 10https://gerrit.wikimedia.org/r/1049515 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:26:46] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for icinga [puppet] - 10https://gerrit.wikimedia.org/r/1049502 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:28:11] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2025 with low load for warmup', diff saved to https://phabricator.wikimedia.org/P65448 and previous config saved to /var/cache/conftool/dbconfig/20240626-072810-jynus.json [07:31:41] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925038 (10dcaro) >>! In T348643#9921767, @CDanis wrote: > Ha, I had also made a silly little dashboard yesterday bu... [07:33:05] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2025 at 50% load', diff saved to https://phabricator.wikimedia.org/P65449 and previous config saved to /var/cache/conftool/dbconfig/20240626-073304-jynus.json [07:33:16] (03PS1) 10Muehlenhoff: Point idp-test to idp-test1003 [dns] - 10https://gerrit.wikimedia.org/r/1049818 (https://phabricator.wikimedia.org/T368503) [07:33:35] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [07:34:20] (03CR) 10Cathal Mooney: [C:03+2] Re-mark external traffic to DSCP BE on CRs, and rename fwd classes [homer/public] - 10https://gerrit.wikimedia.org/r/1049566 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [07:34:30] (03Merged) 10jenkins-bot: ml-services: return logo-detection latency metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049082 (https://phabricator.wikimedia.org/T367962) (owner: 10Kevin Bazira) [07:34:54] (03Merged) 10jenkins-bot: Re-mark external traffic to DSCP BE on CRs, and rename fwd classes [homer/public] - 10https://gerrit.wikimedia.org/r/1049566 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [07:37:53] (03PS2) 10Muehlenhoff: Point idp-test to idp-test2002 [dns] - 10https://gerrit.wikimedia.org/r/1049818 (https://phabricator.wikimedia.org/T368503) [07:38:15] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/1049505 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:38:17] (03CR) 10Arnaudb: mariadb: monitoring memory pressure (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [07:38:53] (03CR) 10Slyngshede: [C:03+2] Point idp-test to idp-test2002 [dns] - 10https://gerrit.wikimedia.org/r/1049818 (https://phabricator.wikimedia.org/T368503) (owner: 10Muehlenhoff) [07:39:00] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1049818 (https://phabricator.wikimedia.org/T368503) (owner: 10Muehlenhoff) [07:39:22] (03CR) 10Muehlenhoff: [C:03+2] Point idp-test to idp-test2002 [dns] - 10https://gerrit.wikimedia.org/r/1049818 (https://phabricator.wikimedia.org/T368503) (owner: 10Muehlenhoff) [07:44:29] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:44:43] (03PS1) 10Muehlenhoff: Remove acmechief annotations for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1049821 (https://phabricator.wikimedia.org/T365799) [07:45:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049821 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:46:43] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:50:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T364069)', diff saved to https://phabricator.wikimedia.org/P65451 and previous config saved to /var/cache/conftool/dbconfig/20240626-075043-marostegui.json [07:50:49] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:52:18] (03CR) 10Jcrespo: "I not 100% convinced this will be of much usefulness to an io-heavy process like mysql. This will monitor process stalls due to malloc; ho" [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [07:52:24] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1049517 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [07:54:29] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2025 at 100% load', diff saved to https://phabricator.wikimedia.org/P65453 and previous config saved to /var/cache/conftool/dbconfig/20240626-075428-jynus.json [07:54:31] (03CR) 10Marostegui: "+1 I wouldn't alert on this - I don't ever remember seeing an alert (only warnings) for production hosts. I have seen them for clouddb* an" [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [07:58:56] (03PS1) 10Elukey: Makefile: use 'go install' instead of 'go get' [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1049825 (https://phabricator.wikimedia.org/T368366) [07:59:47] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es1022 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65454 and previous config saved to /var/cache/conftool/dbconfig/20240626-075946-jynus.json [07:59:53] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:00:05] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0800) [08:00:05] jeena and jnuche: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0800) [08:01:20] o/ [08:01:31] jouncebot: now [08:01:31] For the next 0 hour(s) and 58 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0800) [08:01:31] For the next 1 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T0800) [08:01:33] :) [08:01:51] damn,I have caused stashbot AND ircservserv-vm to bail out? [08:02:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) (owner: 10Dreamrimmer) [08:03:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [08:03:27] I am going to upgrade Gerrit [08:03:41] (03CR) 10Hashar: [C:03+2] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:03:50] (03CR) 10Hashar: [C:03+2] Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Fix weights for es2021 and es2024', diff saved to https://phabricator.wikimedia.org/P65455 and previous config saved to /var/cache/conftool/dbconfig/20240626-080649-marostegui.json [08:06:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P65456 and previous config saved to /var/cache/conftool/dbconfig/20240626-080657-marostegui.json [08:07:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:46] (03PS1) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [08:10:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1023 as es5 master - this is a NOOP', diff saved to https://phabricator.wikimedia.org/P65457 and previous config saved to /var/cache/conftool/dbconfig/20240626-081014-marostegui.json [08:11:21] (03PS1) 10Marostegui: wmnet: Update es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1049829 [08:11:29] (03PS1) 10Slyngshede: R:idp Reimage idp-test1002 as CAS 6. [puppet] - 10https://gerrit.wikimedia.org/r/1049830 [08:11:30] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es1025 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65458 and previous config saved to /var/cache/conftool/dbconfig/20240626-081130-jynus.json [08:11:35] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:11:52] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for IDM/IDP [puppet] - 10https://gerrit.wikimedia.org/r/1049453 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:12:45] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1049829 (owner: 10Marostegui) [08:15:17] (03CR) 10CI reject: [V:04-1] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:15:41] (03CR) 10Hashar: [C:04-2] Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:15:53] pfff [08:18:50] (03CR) 10Hashar: [C:03+2] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:19:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049830 (owner: 10Slyngshede) [08:19:37] (03CR) 10JMeybohm: [C:03+1] Makefile: use 'go install' instead of 'go get' [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1049825 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:20:07] (03PS1) 10Muehlenhoff: Remove acmechief annotations for LDAP roles [puppet] - 10https://gerrit.wikimedia.org/r/1049833 (https://phabricator.wikimedia.org/T365799) [08:20:15] (03CR) 10Slyngshede: [C:03+2] R:idp Reimage idp-test1002 as CAS 6. [puppet] - 10https://gerrit.wikimedia.org/r/1049830 (owner: 10Slyngshede) [08:20:34] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1049512 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:22:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P65460 and previous config saved to /var/cache/conftool/dbconfig/20240626-082204-marostegui.json [08:22:19] the Gerrit upgrade takes a bit longer than expected cause I got hit by a flappy tes [08:22:20] t [08:22:21] :/ [08:22:39] (03PS3) 10Fabfur: benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718) [08:24:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:05] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1002.wikimedia.org with OS bookworm [08:26:34] (03CR) 10Elukey: [V:03+2 C:03+2] Makefile: use 'go install' instead of 'go get' [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1049825 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:27:03] (03CR) 10Fabfur: [C:03+2] benthos:cache: added catch resource to log errors in parse_log [puppet] - 10https://gerrit.wikimedia.org/r/1049498 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:27:38] (03PS1) 10Muehlenhoff: Switch acmechief1001/2001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1049837 (https://phabricator.wikimedia.org/T365799) [08:29:04] (03Merged) 10jenkins-bot: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:29:23] (03CR) 10Hashar: [C:03+2] Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:29:43] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 33.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:29:59] (03Merged) 10jenkins-bot: Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [08:31:21] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for LDAP roles [puppet] - 10https://gerrit.wikimedia.org/r/1049833 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [08:31:39] (03PS1) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [08:31:40] !log hashar@deploy1002 Started deploy [gerrit/gerrit@2fc2b03]: Gerrit to 3.10 on gerrit2002 # T367419 [08:31:45] T367419: Upgrade to Gerrit 3.10 - https://phabricator.wikimedia.org/T367419 [08:32:16] (03PS2) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [08:32:28] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@2fc2b03]: Gerrit to 3.10 on gerrit2002 # T367419 (duration: 00m 48s) [08:37:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T364069)', diff saved to https://phabricator.wikimedia.org/P65461 and previous config saved to /var/cache/conftool/dbconfig/20240626-083711-marostegui.json [08:37:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [08:37:17] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:37:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [08:37:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65462 and previous config saved to /var/cache/conftool/dbconfig/20240626-083733-marostegui.json [08:38:12] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage [08:39:13] !log hashar@deploy1002 Started deploy [gerrit/gerrit@2fc2b03]: Gerrit to 3.10 on gerrit1003 # T367419 [08:39:18] T367419: Upgrade to Gerrit 3.10 - https://phabricator.wikimedia.org/T367419 [08:39:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@2fc2b03]: Gerrit to 3.10 on gerrit1003 # T367419 (duration: 00m 43s) [08:40:08] I am upgrading Gerrit right now [08:40:15] \o/ [08:40:49] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1002.wikimedia.org with reason: host reimage [08:41:05] 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513 (10cmooney) 03NEW p:05Triage→03Medium [08:42:43] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:56] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for puppetmaster1003.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [08:43:31] FIRING: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for puppetmaster1003.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [08:48:27] (03PS2) 10Majavah: P:netbox: Don't show status MOTD for active hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049525 (https://phabricator.wikimedia.org/T352957) [08:48:31] RESOLVED: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:09] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3070/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049525 (https://phabricator.wikimedia.org/T352957) (owner: 10Majavah) [08:52:17] (03CR) 10Majavah: [V:03+1 C:03+2] P:netbox: Don't show status MOTD for active hosts [puppet] - 10https://gerrit.wikimedia.org/r/1049525 (https://phabricator.wikimedia.org/T352957) (owner: 10Majavah) [08:52:25] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining Collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1049850 (https://phabricator.wikimedia.org/T365799) [08:54:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:04] (03PS1) 10Muehlenhoff: Remove acmechief annotations for memcached/redis [puppet] - 10https://gerrit.wikimedia.org/r/1049851 (https://phabricator.wikimedia.org/T365799) [08:55:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136 T365805', diff saved to https://phabricator.wikimedia.org/P65463 and previous config saved to /var/cache/conftool/dbconfig/20240626-085511-root.json [08:55:17] T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805 [08:56:16] (03CR) 10Jelto: [V:03+1 C:03+2] buildkitd: Bump to version wmf-v0.14.1-3 [puppet] - 10https://gerrit.wikimedia.org/r/1049268 (https://phabricator.wikimedia.org/T367352) (owner: 10Ahmon Dancy) [08:58:25] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1049854 (https://phabricator.wikimedia.org/T365799) [09:01:14] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1002.wikimedia.org with OS bookworm [09:03:45] (03PS1) 10Jforrester: CodeEditor.vue: add watcher for disabled state [extensions/WikiLambda] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049857 (https://phabricator.wikimedia.org/T368504) [09:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:05:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:44] (03PS1) 10Marostegui: db2136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049858 (https://phabricator.wikimedia.org/T365805) [09:07:20] (03CR) 10Marostegui: [C:03+2] db2136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1049858 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui) [09:08:57] (03PS1) 10Muehlenhoff: Remove acmechief annotations for swift/ceph [puppet] - 10https://gerrit.wikimedia.org/r/1049859 (https://phabricator.wikimedia.org/T365799) [09:11:38] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for various DE roles [puppet] - 10https://gerrit.wikimedia.org/r/1049469 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:14:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:29] (03PS3) 10Effie Mouzeli: modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [09:17:35] (03CR) 10CI reject: [V:04-1] modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:18:12] (03CR) 10Elukey: "Left some comments even if I am not 100% sure about the convention used for versioning. To double check - dropping the icu component (that" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [09:23:55] (03CR) 10Elukey: [C:03+1] Remove acmechief annotations for remaining IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1049854 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:25:06] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining o11y roles [puppet] - 10https://gerrit.wikimedia.org/r/1049863 (https://phabricator.wikimedia.org/T365799) [09:25:37] (03CR) 10JMeybohm: [C:03+1] mediawiki: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:26:01] (03CR) 10JMeybohm: [C:03+1] mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:26:57] (03PS1) 10Muehlenhoff: Remove acmechief annotations for Cassandra roles [puppet] - 10https://gerrit.wikimedia.org/r/1049864 (https://phabricator.wikimedia.org/T365799) [09:27:27] (03CR) 10JMeybohm: [C:03+1] kubernetes: promote unavailable replicas alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [09:32:24] (03CR) 10Arnaudb: "note taken, lets keep alerts to 2 level then and see if it needs to stay/go/change in the next iterations" [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [09:34:23] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1049869 (https://phabricator.wikimedia.org/T365799) [09:34:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049864 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:37:14] (03PS1) 10Muehlenhoff: Remove acmechief annotations for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1049871 (https://phabricator.wikimedia.org/T365799) [09:38:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:39:00] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining Traffic roles [puppet] - 10https://gerrit.wikimedia.org/r/1049872 (https://phabricator.wikimedia.org/T365799) [09:41:05] (03CR) 10Vgutierrez: [C:03+1] Remove acmechief annotations for remaining Traffic roles [puppet] - 10https://gerrit.wikimedia.org/r/1049872 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:43:12] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remaining Search roles [puppet] - 10https://gerrit.wikimedia.org/r/1049873 (https://phabricator.wikimedia.org/T365799) [09:43:30] (03PS6) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) [09:43:47] (03CR) 10Jforrester: Switch php7.4-cli to bullseye and cascade (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [09:44:17] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:44:54] (03PS1) 10Muehlenhoff: Remove acmechief annotations for dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/1049874 (https://phabricator.wikimedia.org/T365799) [09:45:06] (03CR) 10Arnaudb: [C:03+1] Remove acmechief annotations for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1049871 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:45:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049873 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:45:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049873 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:46:17] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [09:47:20] (03CR) 10Majavah: [C:03+1] Remove acmechief annotations for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1049869 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:47:48] (03PS1) 10Muehlenhoff: Remove acmechief annotations for misc roles [puppet] - 10https://gerrit.wikimedia.org/r/1049875 (https://phabricator.wikimedia.org/T365799) [09:47:56] (03PS1) 10Elukey: docker_registry_ha: add more info to the nginx's access log [puppet] - 10https://gerrit.wikimedia.org/r/1049876 [09:48:46] (03CR) 10Btullis: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1049874 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [09:48:54] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:49:17] (03PS2) 10Elukey: docker_registry_ha: add more info to the nginx's access log [puppet] - 10https://gerrit.wikimedia.org/r/1049876 [09:50:19] (03PS1) 10Slyngshede: IDP: Failover to CAS 6.6.15.2 host. [dns] - 10https://gerrit.wikimedia.org/r/1049877 (https://phabricator.wikimedia.org/T368503) [09:50:48] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:50:48] (03CR) 10Muehlenhoff: [C:03+1] IDP: Failover to CAS 6.6.15.2 host. [dns] - 10https://gerrit.wikimedia.org/r/1049877 (https://phabricator.wikimedia.org/T368503) (owner: 10Slyngshede) [09:51:15] (03CR) 10Slyngshede: [C:03+2] IDP: Failover to CAS 6.6.15.2 host. [dns] - 10https://gerrit.wikimedia.org/r/1049877 (https://phabricator.wikimedia.org/T368503) (owner: 10Slyngshede) [09:51:16] (03PS3) 10Elukey: docker_registry_ha: add more info to the nginx's access log [puppet] - 10https://gerrit.wikimedia.org/r/1049876 [09:51:17] (03CR) 10Slyngshede: [V:03+2 C:03+2] IDP: Failover to CAS 6.6.15.2 host. [dns] - 10https://gerrit.wikimedia.org/r/1049877 (https://phabricator.wikimedia.org/T368503) (owner: 10Slyngshede) [09:54:14] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9925639 (10cmooney) >>! In T367439#9921613, @ayounsi wrote: > Your proposal seems good to me. > > Adding the anycast AS makes sens, I think I in... [09:55:04] !log Update idp.wikimedia.org to CAS 6.6.15.2 (T368503) [09:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:09] T368503: Update CAS to 6.6.15.2 - https://phabricator.wikimedia.org/T368503 [09:56:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:57:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:59:54] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9925649 (10fnegri) > there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine Do you know what... [10:00:05] claime: Time to do the MediaWiki infrastructure (UTC mid-day) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1000). [10:01:01] ok let's go then [10:01:40] fabfur, _joe_, kamila_, head's up, deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049150 [10:02:11] !log disabling puppet on cp-text - T367949 [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:17] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [10:02:37] (03CR) 10JMeybohm: [C:03+1] cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:02:50] (03CR) 10JMeybohm: [C:03+1] echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:03:44] (03CR) 10JMeybohm: [C:03+1] nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:04:02] !log enabling puppet on cp4037 - T367949 [10:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:51] ack, thanks claime [10:06:51] (03CR) 10Elukey: "The config should work fine, but I'll apply to the standby eqiad dc first with puppet disabled on codfw just-in-case. Lemme know!" [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [10:06:55] (03CR) 10Clément Goubert: [C:03+2] trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:07:07] (03PS5) 10Clément Goubert: trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) [10:07:27] oops kinda forgot to merge the patch [10:08:05] (03CR) 10Clément Goubert: [V:03+2 C:03+2] trafficserver: complete switch to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1049150 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:09:36] V:+2? :) [10:09:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:25] (03CR) 10JMeybohm: [C:03+1] "$ sudo -i reprepro ls python3-service-checker" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:11:20] (03CR) 10JMeybohm: [C:03+1] prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:11:55] ok, local queries for enwiki go to codfw, loginwiki to eqiad, looks good [10:12:34] (03CR) 10JMeybohm: "This will also change the mcrouter version by quite a bit:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:12:51] (03CR) 10JMeybohm: [C:03+1] helm-state-metrics: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049586 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:13:49] !log enabling puppet on cp-text - T367949 [10:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:54] T367949: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [10:14:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:41] (03PS1) 10Slyngshede: R:idp_test: Separate testing environment for CAS 7 [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) [10:19:58] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3071/console" [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:20:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:21:04] (03CR) 10Giuseppe Lavagetto: [C:04-1] "See my suggestions, I think this can be simplified a lot and the compatibility issues can be rooted out easily." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [10:21:50] (03CR) 10JMeybohm: [C:03+1] envoy: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049578 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:22:12] (03CR) 10JMeybohm: [C:03+1] coredns: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049577 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:22:25] (03CR) 10JMeybohm: [C:03+1] config.yaml: remove wikimedia-stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049576 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [10:22:40] (03CR) 10Ladsgroup: [C:03+2] auto_schema: use dbctl config diff return status [software] - 10https://gerrit.wikimedia.org/r/1049648 (owner: 10Scott French) [10:23:04] (03CR) 10Jcrespo: [C:03+1] "Go ahead any time with deployment." [puppet] - 10https://gerrit.wikimedia.org/r/1049871 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:23:12] (03Merged) 10jenkins-bot: auto_schema: use dbctl config diff return status [software] - 10https://gerrit.wikimedia.org/r/1049648 (owner: 10Scott French) [10:23:41] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [10:24:55] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925737 (10dcaro) There's definitely some load coming in: {F55892394} Though no spikes on the latencies so far: {F... [10:25:23] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2022 after backup T363812', diff saved to https://phabricator.wikimedia.org/P65464 and previous config saved to /var/cache/conftool/dbconfig/20240626-102523-jynus.json [10:25:25] (03PS1) 10Dreamrimmer: Add VK namespace alias to Azerbaijani Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049886 (https://phabricator.wikimedia.org/T368237) [10:25:29] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [10:26:17] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2024-April-June): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#9925777 (10santhosh) @elukey Thanks for these details. Currently in our code, models are downloaded [[ https://githu... [10:28:45] (03CR) 10Effie Mouzeli: mcrouter: upgrade to Bookworm (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:29:54] (03CR) 10Effie Mouzeli: [C:03+2] Remove acmechief annotations for memcached/redis [puppet] - 10https://gerrit.wikimedia.org/r/1049851 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [10:37:01] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925808 (10dcaro) changed the graphs to use rate of the stat, instead of the raw counter value, now there's some inf... [10:39:17] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925810 (10dcaro) The host with the old non-error-reporting drives has a similar shape (just a bit higher latency):... [10:39:30] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9925811 (10dcaro) read has even less difference, and flush only happens for the os-dedicated drives. [10:39:32] (03PS1) 10Superpes15: [u4c] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) [10:39:34] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2022 at 50% T363812', diff saved to https://phabricator.wikimedia.org/P65465 and previous config saved to /var/cache/conftool/dbconfig/20240626-103933-jynus.json [10:39:40] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [10:40:53] (03PS2) 10Superpes15: [u4c] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) [10:43:59] (03PS1) 10Aklapper: Provide weekly Phabricator data for Tech News [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) [10:44:20] (03PS1) 10Mvolz: Update Zotero to node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049891 (https://phabricator.wikimedia.org/T361728) [10:45:29] (03PS3) 10Superpes15: [u4cwiki] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) [10:55:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:32] (03CR) 10Aklapper: "Note that I am not sure about the "hour" parameter, trying to trigger the email on approx UTC-7 late night/early morning on Thursdays. If " [puppet] - 10https://gerrit.wikimedia.org/r/1049890 (https://phabricator.wikimedia.org/T368460) (owner: 10Aklapper) [11:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1100). [11:00:44] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 46.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:02:32] (03CR) 10Clément Goubert: [C:03+1] docker_registry_ha: add more info to the nginx's access log [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [11:02:32] (03CR) 10Giuseppe Lavagetto: [C:04-1] "The label is wrong. Otherwise, LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:03:06] (03CR) 10Mvolz: [C:03+2] Update Zotero to node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049891 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [11:06:02] (03PS11) 10Hnowlan: Add shellbox-video vars/config, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) [11:06:07] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:06:36] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:07:03] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:07:06] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:07:30] (03CR) 10Giuseppe Lavagetto: [C:03+1] Add shellbox-video vars/config, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:09:10] (03Merged) 10jenkins-bot: Update Zotero to node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049891 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [11:09:18] (03CR) 10Hnowlan: Add shellbox-video vars/config, enable on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:10:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:12:13] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:12:19] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for dumps roles [puppet] - 10https://gerrit.wikimedia.org/r/1049874 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:12:34] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:13:26] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining Traffic roles [puppet] - 10https://gerrit.wikimedia.org/r/1049872 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:13:38] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:14:07] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:14:23] (03PS1) 10KartikMistry: Enable MinT for Wikipedia readers MVP on a set of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049898 (https://phabricator.wikimedia.org/T363465) [11:14:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:43] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1049869 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:15:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049898 (https://phabricator.wikimedia.org/T363465) (owner: 10KartikMistry) [11:16:16] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining IF roles [puppet] - 10https://gerrit.wikimedia.org/r/1049854 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:17:33] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for misc roles [puppet] - 10https://gerrit.wikimedia.org/r/1049875 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:18:58] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1049871 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:19:06] noooo it made codfw have internal server errors *sigh* [11:19:35] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es2022 fully T363812', diff saved to https://phabricator.wikimedia.org/P65466 and previous config saved to /var/cache/conftool/dbconfig/20240626-111934-jynus.json [11:19:39] not sure if related, reverting... [11:19:41] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [11:19:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:21:41] yup. [11:22:02] (03PS1) 10Mvolz: Revert "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049899 [11:22:21] (03CR) 10Mvolz: [C:03+2] Revert "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049899 (owner: 10Mvolz) [11:22:24] Swagger is "yours", mvolz ? [11:22:38] jynus: yup [11:22:38] I confess I get lost with the names [11:22:46] sorry for the alert. [11:22:46] thanks [11:23:05] oh, no problem, was just thanking for reacting so fast or searching someone if it was unrelated [11:23:17] (03Merged) 10jenkins-bot: Revert "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049899 (owner: 10Mvolz) [11:23:59] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:24:01] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:24:25] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye [11:24:43] (03PS4) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463) [11:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367856)', diff saved to https://phabricator.wikimedia.org/P65467 and previous config saved to /var/cache/conftool/dbconfig/20240626-112614-marostegui.json [11:26:19] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:26:20] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:26:32] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:26:43] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:26:50] (03PS1) 10Muehlenhoff: Extend access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1049902 [11:27:14] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:28:08] (03CR) 10Muehlenhoff: [C:03+2] Extend access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1049902 (owner: 10Muehlenhoff) [11:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:30:07] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for dns/ncredir/durum/doh [puppet] - 10https://gerrit.wikimedia.org/r/1049461 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:30:45] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:34:34] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048019 (owner: 10PipelineBot) [11:35:32] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1049821 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:35:38] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048019 (owner: 10PipelineBot) [11:35:50] !log installing emacs security updates [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:54] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1049850 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:36:19] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining Collab roles [puppet] - 10https://gerrit.wikimedia.org/r/1049850 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [11:38:37] (03PS1) 10Muehlenhoff: Remove acmechief annotations for Druid/Kafka roles [puppet] - 10https://gerrit.wikimedia.org/r/1049907 (https://phabricator.wikimedia.org/T365799) [11:39:30] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:39:53] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:40:42] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:41:15] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:41:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65468 and previous config saved to /var/cache/conftool/dbconfig/20240626-114121-marostegui.json [11:41:28] (03PS1) 10Muehlenhoff: Remove acmechief annotations for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/1049908 (https://phabricator.wikimedia.org/T365799) [11:43:42] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:44:10] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:47:01] (03PS1) 10Ladsgroup: rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049909 (https://phabricator.wikimedia.org/T368289) [11:47:05] jouncebot: nowandnext [11:47:05] For the next 0 hour(s) and 12 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1100) [11:47:05] In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1300) [11:47:29] (03CR) 10Ladsgroup: [C:03+2] rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049909 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [11:49:30] (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049910 [11:49:36] (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049910 (owner: 10Mvolz) [11:50:01] 0/2 for deploys today *sigh* [11:51:00] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:51:02] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:51:56] (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049910 (owner: 10Mvolz) [11:52:04] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:52:06] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:54:00] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:54:12] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:54:18] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:54:47] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:55:08] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:55:32] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65469 and previous config saved to /var/cache/conftool/dbconfig/20240626-115628-marostegui.json [11:56:43] (03PS4) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [11:58:01] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [12:00:18] i am done unbreaking things [12:00:58] (03PS1) 10Dreamy Jazz: [GlobalBlocking] Enable global account blocks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) [12:02:00] (03CR) 10Dreamy Jazz: [C:04-2] "DNM until patches in T356935 and T356932 are merged, and have been deployed to all wikis via the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) (owner: 10Dreamy Jazz) [12:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:31] (03PS1) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) [12:09:25] (03PS2) 10Stevemunene: wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) [12:11:24] (03PS2) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [12:11:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T367856)', diff saved to https://phabricator.wikimedia.org/P65470 and previous config saved to /var/cache/conftool/dbconfig/20240626-121136-marostegui.json [12:11:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:11:41] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:11:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:11:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T367856)', diff saved to https://phabricator.wikimedia.org/P65471 and previous config saved to /var/cache/conftool/dbconfig/20240626-121158-marostegui.json [12:12:00] (03CR) 10CI reject: [V:04-1] Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:12:04] (03CR) 10Stevemunene: [C:03+2] wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [12:15:47] (03Merged) 10jenkins-bot: rdbms: Reduce log severity of "found writes pending" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049909 (https://phabricator.wikimedia.org/T368289) (owner: 10Ladsgroup) [12:20:54] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:22:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1049909|rdbms: Reduce log severity of "found writes pending" (T368289)]] [12:23:18] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:25:55] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1049909|rdbms: Reduce log severity of "found writes pending" (T368289)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:26:00] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:26:34] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [12:29:06] (03CR) 10Brouberol: [C:03+1] Remove acmechief annotations for Druid/Kafka roles [puppet] - 10https://gerrit.wikimedia.org/r/1049907 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [12:29:17] (03CR) 10Brouberol: [C:03+1] Remove acmechief annotations for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/1049908 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [12:30:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:30:54] !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bullseye [12:31:40] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1049909|rdbms: Reduce log severity of "found writes pending" (T368289)]] (duration: 08m 43s) [12:35:13] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:36:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikiLambda] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049857 (https://phabricator.wikimedia.org/T368504) (owner: 10Jforrester) [12:37:38] (03PS2) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) [12:37:38] (03PS1) 10Jcrespo: dbbackups: Disable regular es backups (es6, es7) while es4/5 run [puppet] - 10https://gerrit.wikimedia.org/r/1049921 (https://phabricator.wikimedia.org/T363812) [12:37:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:39:47] (03PS2) 10Jcrespo: dbbackups: Disable regular es backups (es6, es7) while es4/5 run [puppet] - 10https://gerrit.wikimedia.org/r/1049921 (https://phabricator.wikimedia.org/T363812) [12:40:48] (03PS2) 10Ssingh: P:systemd::timesyncd: switch to anycast NTP peers [puppet] - 10https://gerrit.wikimedia.org/r/1048018 (https://phabricator.wikimedia.org/T366360) [12:41:17] (03PS3) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [12:41:48] (03CR) 10CI reject: [V:04-1] Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:42:38] (03PS2) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) [12:42:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:25] (03CR) 10Elukey: "I am very surprised that PCC shows no diff, not getting why." [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [12:45:56] (03CR) 10Kamila Součková: opentelemetry: update k8s API IP addresses (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 (owner: 10Kamila Součková) [12:46:04] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) [12:46:12] (03PS2) 10Kamila Součková: opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 [12:46:26] (03PS4) 10Cory Massaro: Add addNestedMetadata to production orchestrator config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) [12:46:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65476 and previous config saved to /var/cache/conftool/dbconfig/20240626-124654-marostegui.json [12:47:00] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:47:54] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1049921 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:48:37] (03CR) 10Jcrespo: [C:03+2] dbbackups: Disable regular es backups (es6, es7) while es4/5 run [puppet] - 10https://gerrit.wikimedia.org/r/1049921 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:50:41] !rolling out CR border-in dscp marking config to core routers T339850 [12:50:42] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [12:55:19] (03PS2) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [12:55:20] (03PS2) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [12:55:20] (03PS2) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [12:55:20] (03PS2) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [12:55:22] (03PS2) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [12:55:23] (03PS3) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [12:55:44] (03PS4) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [12:56:11] (03PS3) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [12:56:11] (03PS3) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [12:56:11] (03PS3) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [12:56:11] (03PS3) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [12:56:13] (03PS3) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [12:56:14] (03PS4) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [12:56:55] (03CR) 10Elukey: mcrouter: upgrade to Bookworm (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [12:57:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [12:59:42] (03Abandoned) 10Cwhite: admin: add rickijay to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036592 (https://phabricator.wikimedia.org/T365574) (owner: 10Cwhite) [12:59:46] (03CR) 10Elukey: [C:03+2] docker_registry_ha: add more info to the nginx's access log [puppet] - 10https://gerrit.wikimedia.org/r/1049876 (owner: 10Elukey) [12:59:51] (03PS1) 10Lucas Werkmeister (WMDE): wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049924 (https://phabricator.wikimedia.org/T368010) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1300). [13:00:04] Dreamy_Jazz, DreamRimmer, Superpes, James_F, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:14] o/ [13:00:18] goddammit I was *seconds* too slow [13:00:23] \o [13:00:25] and now schedule-backport won’t offer me the ongoing backport window anymore [13:00:31] I’ll just have to add it to the calendar manually I guess [13:00:32] :( [13:00:33] anyway, I can deploy ^^ [13:00:47] :P [13:01:02] You could use it for a later timeslot, and then cut-and-paste the output to the current window? [13:01:11] oh, busy window already [13:01:15] well let’s see [13:01:18] :) [13:01:43] I don't mind deploying my change, but happy for someone else to be in charge of `scap backport` [13:01:52] I can test the change [13:02:00] alright, then let’s start with that [13:02:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P65478 and previous config saved to /var/cache/conftool/dbconfig/20240626-130201-marostegui.json [13:02:05] (03PS4) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) [13:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:02:53] (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [13:03:21] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1038741|[CheckUser] Stop writing old for event tables migration on group1 (T360685)]] [13:03:26] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [13:03:36] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1048018/3072/" [puppet] - 10https://gerrit.wikimedia.org/r/1048018 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:04:28] Could I sneak a patch in also (just added to the calendar)? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1043812/ should be a noop in prod [13:05:01] There is a patch that Lucas_WMDE wanted to also add [13:05:11] hnowlan: hm, it’ll still require a deploy because it touches non-labs files :/ [13:05:16] let’s just see how far we get I think [13:05:20] Dreamy_Jazz: already added mine [13:05:26] 👍 [13:05:42] * Lucas_WMDE sees that Wikifunctions *does* have changes to deploy this time so we shouldn’t run too far into their window [13:05:59] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamyjazz: Backport for [[gerrit:1038741|[CheckUser] Stop writing old for event tables migration on group1 (T360685)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:03] (03CR) 10Ssingh: "PCC errors for moss* are unrelated to this change and are related to PQL: https://phabricator.wikimedia.org/T366387" [puppet] - 10https://gerrit.wikimedia.org/r/1048018 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:06:07] Testing... [13:06:18] * Lucas_WMDE looks at the next changes [13:07:21] Lucas_WMDE: yeah :( won't affect prod configs in theory but I can easily verify [13:07:32] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "sounds reasonable and matches mediawikiwiki above 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) (owner: 10Dreamrimmer) [13:08:02] I’ll probably still be there for a bit after 17:00, so we could continue deploying after wikifunctions is done [13:08:51] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [13:09:59] Lucas_WMDE: Yes, but also I'm one of the patches for this window too. If necessary we can skip mine and I'll do the MW patch in the service window alongside the services. [13:10:01] (03CR) 10DCausse: [C:03+1] wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049924 (https://phabricator.wikimedia.org/T368010) (owner: 10Lucas Werkmeister (WMDE)) [13:10:05] (03CR) 10Ssingh: [C:03+2] P:systemd::timesyncd: switch to anycast NTP peers [puppet] - 10https://gerrit.wikimedia.org/r/1048018 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:10:20] James_F: I was just thinking that, yeah ^^ [13:10:21] Lucas_WMDE: Testing complete and successfull [13:10:25] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamyjazz: Continuing with sync [13:10:31] that would be nice to save some time if it’s okay with you [13:10:34] Dreamy_Jazz: thanks! [13:11:00] Sure. [13:13:25] !log reload nginx on registry* nodes (Docker registry) to pick up new logging changes [13:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1038741|[CheckUser] Stop writing old for event tables migration on group1 (T360685)]] (duration: 12m 09s) [13:15:36] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [13:15:50] DreamRimmer: we can probably deploy both of yours together? [13:16:00] (03PS1) 10Superpes15: [arbcom_itwiki] Change the logo and a new wordmark and a favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049929 (https://phabricator.wikimedia.org/T368532) [13:16:08] sure [13:16:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) (owner: 10Dreamrimmer) [13:16:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [13:16:34] “13:16:14 'https://gerrit.wikimedia.org/r/c/1049667/' is not a valid change number or URL” grmblgrmbl [13:16:50] it’s only the URL I copied from the deployments page… [13:16:53] Thanks for the deploy. [13:16:57] * Lucas_WMDE checks how easy that matching is to change in scap [13:16:59] Dreamy_Jazz: np :) [13:17:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P65479 and previous config saved to /var/cache/conftool/dbconfig/20240626-131709-marostegui.json [13:17:13] (03Merged) 10jenkins-bot: Meta-Wiki: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049667 (https://phabricator.wikimedia.org/T368416) (owner: 10Dreamrimmer) [13:17:15] (03Merged) 10jenkins-bot: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [13:17:46] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1049667|Meta-Wiki: restrict unfuzzy rights to autoconfirmed (T368416)]], [[gerrit:1031533|maiwiki: Remove 'CA' namespace alias (T363667)]] [13:17:53] T368416: Restrict unfuzzy rights on Meta - https://phabricator.wikimedia.org/T368416 [13:17:54] T363667: Remove 'CA' namespace alias in maiwiki - https://phabricator.wikimedia.org/T363667 [13:18:58] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database btmwiki (T368066) [13:19:04] T368066: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066 [13:20:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1049667|Meta-Wiki: restrict unfuzzy rights to autoconfirmed (T368416)]], [[gerrit:1031533|maiwiki: Remove 'CA' namespace alias (T363667)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:20:34] https://mai.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=json&formatversion=2 already looks good to me (CA alias gone) [13:21:53] yeah [13:22:17] (03PS1) 10Eevans: sessionstore: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049930 (https://phabricator.wikimedia.org/T354970) [13:22:19] and https://meta.wikimedia.org/w/api.php?action=query&meta=userinfo&uiprop=rights&format=json&formatversion=2 seems fine too (unfuzzy gone but only when logged out) [13:23:01] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049930 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [13:23:18] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye [13:23:20] (03PS5) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [13:23:33] looks good to me too [13:23:36] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync [13:23:40] alright, syncing then [13:24:15] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [13:24:26] (03PS1) 10FNegri: wmcs-wikireplica-dns: update cloudvps project [puppet] - 10https://gerrit.wikimedia.org/r/1049933 (https://phabricator.wikimedia.org/T365975) [13:25:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1405 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:26:07] (03CR) 10Majavah: [C:03+1] wmcs-wikireplica-dns: update cloudvps project [puppet] - 10https://gerrit.wikimedia.org/r/1049933 (https://phabricator.wikimedia.org/T365975) (owner: 10FNegri) [13:26:29] (03CR) 10FNegri: [C:03+2] wmcs-wikireplica-dns: update cloudvps project [puppet] - 10https://gerrit.wikimedia.org/r/1049933 (https://phabricator.wikimedia.org/T365975) (owner: 10FNegri) [13:27:30] (03PS6) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [13:28:15] !log elukey@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [13:28:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1049667|Meta-Wiki: restrict unfuzzy rights to autoconfirmed (T368416)]], [[gerrit:1031533|maiwiki: Remove 'CA' namespace alias (T363667)]] (duration: 10m 50s) [13:28:40] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [13:28:44] T368416: Restrict unfuzzy rights on Meta - https://phabricator.wikimedia.org/T368416 [13:28:44] T363667: Remove 'CA' namespace alias in maiwiki - https://phabricator.wikimedia.org/T363667 [13:29:16] right, let’s run namespaceDupes on maiwiki then [13:30:15] !log lucaswerkmeister-wmde@deploy1002 /srv/mediawiki-staging (master $ u=) $ mwscript-k8s namespaceDupes maiwiki -- --fix # T363667, 0 pages/links to fix, i.e. no-op [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:54] 0 pages to fix, lol [13:31:08] yeah, expected, but still good to confirm ^^ [13:31:34] it’s a good thing too, that the [[CA:]] namespace alias wasn’t used, because otherwise it would parse as a [[ca:]] language interwiki now IIUC [13:31:55] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database btmwiki (T368066) [13:31:59] (03PS3) 10Superpes15: [ltwiki] Add a new 'rollbacker' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) [13:32:01] T368066: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066 [13:32:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T364069)', diff saved to https://phabricator.wikimedia.org/P65480 and previous config saved to /var/cache/conftool/dbconfig/20240626-133216-marostegui.json [13:32:18] You can also merge all the 3 patches together if you want Lucas_WMDE :D [13:32:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [13:32:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) (owner: 10Superpes15) [13:32:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:32:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [13:32:38] Superpes: I’m starting with just the one because the second one looks a bit bigger and idk how fast I can review it [13:32:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T364069)', diff saved to https://phabricator.wikimedia.org/P65481 and previous config saved to /var/cache/conftool/dbconfig/20240626-133239-marostegui.json [13:32:54] * Lucas_WMDE doesn’t see a third one o_O [13:33:06] aha, F5 and it appears [13:33:09] (03Merged) 10jenkins-bot: [ltwiki] Add a new 'rollbacker' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) (owner: 10Superpes15) [13:33:12] Ah gotcha :P Naa All 3 are small actually :D [13:33:16] you know the window is already at like double capacity :P [13:33:26] :D [13:33:40] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1048408|[ltwiki] Add a new 'rollbacker' usergroup (T367993)]] [13:33:46] T367993: Creation of Rollbacker group on Lithuanian Wikipedia - https://phabricator.wikimedia.org/T367993 [13:34:19] LMAO I noticed it later, that's why I said if you want you can do everything together, they are very small changes and there shouldn't be any problems :P [13:34:41] given how busy the window is I'll move my change to another window [13:35:20] (03PS2) 10Ssingh: hiera dnsbox and P:bird: remove references to ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) [13:35:27] hnowlan: ok [13:35:45] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database btmwiki (T368066) [13:36:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, superpes: Backport for [[gerrit:1048408|[ltwiki] Add a new 'rollbacker' usergroup (T367993)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:36:26] 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9926423 (10kamila) I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the p... [13:37:08] Lucas_WMDE Looks fine [13:37:12] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3073/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:37:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, superpes: Continuing with sync [13:38:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [u4cwiki] Enable importing from dewiki/enwiki/metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) (owner: 10Superpes15) [13:38:37] (I uploaded https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/366 for the scap thing I whined about earlier, btw ^^) [13:39:46] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/1049908 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [13:40:00] (03PS1) 10Klausman: httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1049943 [13:40:00] (03CR) 10Klausman: "Currently all the files have a `test_` prefix. I am considering dropping it since it doesn't really provide any extra info. Lmk what you t" [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [13:41:01] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for Druid/Kafka roles [puppet] - 10https://gerrit.wikimedia.org/r/1049907 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [13:41:24] (03PS1) 10Fabfur: benthos:cache: moving parse_log directive to input [puppet] - 10https://gerrit.wikimedia.org/r/1049944 (https://phabricator.wikimedia.org/T365718) [13:42:04] (03CR) 10CI reject: [V:04-1] httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1049943 (owner: 10Klausman) [13:42:17] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [arbcom_itwiki] Change the logo and a new wordmark and a favicon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049929 (https://phabricator.wikimedia.org/T368532) (owner: 10Superpes15) [13:42:29] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1048408|[ltwiki] Add a new 'rollbacker' usergroup (T367993)]] (duration: 08m 48s) [13:42:35] T367993: Creation of Rollbacker group on Lithuanian Wikipedia - https://phabricator.wikimedia.org/T367993 [13:42:44] I think we can do the other two Superpes changes and then I’ll hand over to James_F [13:42:54] and potentially do Nemoralis, me and/or hnowlan afterwards [13:43:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) (owner: 10Superpes15) [13:43:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049929 (https://phabricator.wikimedia.org/T368532) (owner: 10Superpes15) [13:43:26] Ack. [13:43:48] and I guess we can kick off the gate-and-submit for the backport pretty soon already [13:43:57] (I’ll do it once the config changes merge and the scap sync starts) [13:44:09] seems to have taken ~10 minutes on the master branch [13:44:33] (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade to Cassandra 4.1.5 [puppet] - 10https://gerrit.wikimedia.org/r/1049930 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [13:44:34] (03Merged) 10jenkins-bot: [arbcom_itwiki] Change the logo and a new wordmark and a favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049929 (https://phabricator.wikimedia.org/T368532) (owner: 10Superpes15) [13:45:27] ugh, gerrit says merge conflict on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1049888 [13:45:32] (03PS4) 10Superpes15: [u4cwiki] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) [13:45:37] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] [u4cwiki] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) (owner: 10Superpes15) [13:45:42] but rebase was clean 🤷 [13:45:44] (03PS2) 10Klausman: httpbb/liftwing: Split up test definitions by k8s NS [puppet] - 10https://gerrit.wikimedia.org/r/1049943 [13:45:57] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) (owner: 10Superpes15) [13:46:25] (03Merged) 10jenkins-bot: [u4cwiki] Enable importing from dewiki/enwiki/metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049888 (https://phabricator.wikimedia.org/T368522) (owner: 10Superpes15) [13:46:27] "Merge conflict" means "speculative light-weight rebase didn't return true within 10ms" or whatever, it's not a full attempted rebase. [13:46:48] And even then, that's gerrit's jgit speaking, not actual git, hence why rebases locally "just work" when they don't in gerrit. [13:46:51] Isn't it grand? [13:46:56] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1049888|[u4cwiki] Enable importing from dewiki/enwiki/metawiki (T368522)]], [[gerrit:1049929|[arbcom_itwiki] Change the logo and a new wordmark and a favicon (T368532)]] [13:47:05] T368522: Enable importing from metawiki/dewiki/enwiki on u4cwiki - https://phabricator.wikimedia.org/T368522 [13:47:05] T368532: Change logo and add a wordmark and a favicon on private arbcom_itwiki - https://phabricator.wikimedia.org/T368532 [13:47:35] James_F: I know “merge conflict” usually just means “touches the same files as anything else” but I was hoping it would try a bit harder when actually +2ed ^^ [13:47:54] but eh, maybe I should’ve rebased first. it’s fine anyway [13:49:25] It does in normal repos, but not FF-only. [13:49:31] Because Reasons™. [13:49:34] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, superpes: Backport for [[gerrit:1049888|[u4cwiki] Enable importing from dewiki/enwiki/metawiki (T368522)]], [[gerrit:1049929|[arbcom_itwiki] Change the logo and a new wordmark and a favicon (T368532)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:35] ok [13:49:39] Superpes: please test :) [13:49:45] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[2005-2006].codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [13:49:48] Yep just a second to check everything :P [13:49:50] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [13:49:52] * Lucas_WMDE unsure how to test anything u4cwiki [13:50:32] Both patches are fine Lucas_WMDE :D [13:50:36] Lol [13:50:38] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, superpes: Continuing with sync [13:50:39] yay [13:50:45] on https://wikipedia-it-arbcom.wikimedia.org/ I can at least see a logo :P [13:50:52] even without any permissions ww [13:50:53] * ^^ [13:51:09] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] CodeEditor.vue: add watcher for disabled state [extensions/WikiLambda] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049857 (https://phabricator.wikimedia.org/T368504) (owner: 10Jforrester) [13:51:40] And here you can check the other skin https://wikipedia-it-arbcom.wikimedia.org/w/index.php?title=Pagina_principale&useskin=vector :D [13:52:34] I also don't have access there :P [13:53:37] god damn, what's happening on my ethernet [13:53:49] it keeps disconnecting me [13:55:09] PROBLEM - Check whether ferm is active by checking the default input chain on mw1362 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:55:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1405 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:55:46] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1049888|[u4cwiki] Enable importing from dewiki/enwiki/metawiki (T368522)]], [[gerrit:1049929|[arbcom_itwiki] Change the logo and a new wordmark and a favicon (T368532)]] (duration: 08m 49s) [13:55:52] T368522: Enable importing from metawiki/dewiki/enwiki on u4cwiki - https://phabricator.wikimedia.org/T368522 [13:55:52] T368532: Change logo and add a wordmark and a favicon on private arbcom_itwiki - https://phabricator.wikimedia.org/T368532 [13:56:00] one sec, I will reconnect [13:56:27] !log UTC afternoon backport+config window done (I might deploy a few more patches later out-of-window) [13:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] James_F: over to you [13:56:35] Thanks! [13:56:39] Nemoralis: we’re out of time anyway, sorry :( [13:56:57] no problem [13:56:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049857 (https://phabricator.wikimedia.org/T368504) (owner: 10Jforrester) [13:58:12] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) (owner: 10Jforrester) [13:58:36] (03Merged) 10jenkins-bot: CodeEditor.vue: add watcher for disabled state [extensions/WikiLambda] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049857 (https://phabricator.wikimedia.org/T368504) (owner: 10Jforrester) [13:59:09] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1049857|CodeEditor.vue: add watcher for disabled state (T368504)]] [13:59:14] T368504: CodeEditor: When selecting ZCode language, the CodeEditor is not set back to disabled=false - https://phabricator.wikimedia.org/T368504 [13:59:15] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046787 (https://phabricator.wikimedia.org/T325793) (owner: 10Jforrester) [13:59:21] (03CR) 10Lucas Werkmeister (WMDE): "The diffConfig build shows that this also resets the `$wgUploadNavigationUrl`, which IIUC means that the “upload” link in the sidebar woul" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1400) [14:00:58] (03CR) 10Clément Goubert: [C:03+2] mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:01:12] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:01:30] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database btmwiki (T368066) [14:01:35] T368066: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066 [14:01:36] !log Deploying statsd-exporter for mw-api-int - T365265 [14:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:42] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [14:01:48] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:01:49] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1049857|CodeEditor.vue: add watcher for disabled state (T368504)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:01:51] !log jforrester@deploy1002 jforrester: Continuing with sync [14:01:53] (03Merged) 10jenkins-bot: mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:01:57] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[2005-2006].codfw.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [14:01:59] James_F: can you ping me when you’re done with your changes? (no rush ^^) [14:02:07] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [14:02:15] Sure. [14:02:19] (03CR) 10NMW03: "First sentence in task description points to local upload page:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [14:02:20] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:02:28] thanks! [14:02:30] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[1004-1006].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [14:02:35] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:02:49] Lucas_WMDE I replied to your comment on gerrit [14:04:00] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:04:02] jforrester@deploy1002 Started scap: Backport for [[gerrit:1049857|CodeEditor.vue: add watcher for... < that backport redeploys mw-on-k8s [14:04:08] so it'll pull my change as well probably [14:04:14] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:04:18] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:04:35] claime: Ah, yes, it's my window. [14:04:43] yeah it's ok [14:04:57] (03PS1) 10Kgraessle: Update QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) [14:05:08] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:05:49] sorry for stepping on your window [14:05:59] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:01] No worries.:-) [14:06:15] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [14:06:18] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:06:24] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:06:50] (03PS5) 10Jforrester: wikifunctions: Add addNestedMetadata to production orchestrator config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) (owner: 10Cory Massaro) [14:06:55] (03CR) 10Jforrester: [C:03+2] wikifunctions: Add addNestedMetadata to production orchestrator config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) (owner: 10Cory Massaro) [14:07:07] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:07:09] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1049857|CodeEditor.vue: add watcher for disabled state (T368504)]] (duration: 08m 00s) [14:07:15] T368504: CodeEditor: When selecting ZCode language, the CodeEditor is not set back to disabled=false - https://phabricator.wikimedia.org/T368504 [14:07:41] (03CR) 10Lucas Werkmeister (WMDE): "Yes, but that doesn’t mean they want everyone to upload their files there… I think there have been some other recent-ish changes where wik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [14:08:03] (03Merged) 10jenkins-bot: wikifunctions: Add addNestedMetadata to production orchestrator config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046769 (https://phabricator.wikimedia.org/T366829) (owner: 10Cory Massaro) [14:08:06] Lucas_WMDE: OK, I'm done on the MW side; still doing prod deploys for the services, but if you want scap, go for it. [14:08:17] claime: Nothing seems to have exploded yet. [14:08:41] James_F: no it's a fairly safe change [14:08:42] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:55] we've already deployed it to other mw-on-k8s deployments [14:09:01] * James_F nods. [14:09:13] ok, thanks! [14:09:17] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [14:09:23] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:51] (03PS1) 10Ssingh: cookbooks/sre/dns: add a cookbook for roll restart of ntpd.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 [14:10:27] (03PS2) 10Lucas Werkmeister (WMDE): wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049924 (https://phabricator.wikimedia.org/T368010) [14:10:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049924 (https://phabricator.wikimedia.org/T368010) (owner: 10Lucas Werkmeister (WMDE)) [14:11:12] hnowlan: if you’re still around, I could deploy your config change in ~10 minutes probably [14:11:19] (03Merged) 10jenkins-bot: wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049924 (https://phabricator.wikimedia.org/T368010) (owner: 10Lucas Werkmeister (WMDE)) [14:11:50] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]] [14:11:54] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:55] T368010: Search not working for entity schemas - https://phabricator.wikimedia.org/T368010 [14:11:58] (03CR) 10Eevans: [C:03+1] Remove acmechief annotations for Cassandra roles [puppet] - 10https://gerrit.wikimedia.org/r/1049864 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:12:14] (or, if there’s nothing to test in prod, I guess you could just tell me to deploy it now and I’ll do it without having to bother you again ^^) [14:12:21] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for Cassandra roles [puppet] - 10https://gerrit.wikimedia.org/r/1049864 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:13:06] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:06] Lucas_WMDE: That would be great, thanks - testing in prod would just be to verify that everything behaves as it already does [14:13:16] (03CR) 10NMW03: "They already said they want to upload files to local wiki (adding a link to the page). When I asked about regular users' behaviour earlier" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) (owner: 10NMW03) [14:13:20] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:37] (03PS12) 10Hnowlan: Add shellbox-video vars/config, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) [14:14:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:24] testing… [14:14:37] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:14:45] yup, diff in https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&format=json&formatversion=2 LGTM [14:14:49] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [14:15:33] And now I'm done fully. [14:15:51] I test changes like this using https://github.com/lucaswerkmeister/home/blob/master/.bashrc.d/wikimedia-debug-diff btw, thought that might be useful to others ww [14:15:52] * ^^ [14:15:56] this damn keyboard [14:16:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9926641 (10Jhancock.wm) [14:17:48] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368079#9926632 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:17:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9926647 (10Jhancock.wm) fyi, this one is ready up to the point of imaging. Waiting for a work around on the PXE boot issue from Supermicro. [14:19:41] (03PS1) 10Andrew Bogott: Move cloudvirt200[1,2,3]-dev to insetup, prepare for decom [puppet] - 10https://gerrit.wikimedia.org/r/1049951 (https://phabricator.wikimedia.org/T368536) [14:19:42] (03PS1) 10Andrew Bogott: Remove mention of cloudvirt200[1,2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1049952 (https://phabricator.wikimedia.org/T368536) [14:19:48] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]] (duration: 07m 57s) [14:19:54] T368010: Search not working for entity schemas - https://phabricator.wikimedia.org/T368010 [14:20:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:20:50] (03Merged) 10jenkins-bot: Add shellbox-video vars/config, enable on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:21:19] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[1004-1006].eqiad.wmnet: Apply Cassandra upgrade to 4.1.5 — T354970 - eevans@cumin1002 [14:21:20] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1043812|Add shellbox-video vars/config, enable on beta (T356241)]] [14:21:24] T354970: Upgrade Cassandra to 4.1.5 - https://phabricator.wikimedia.org/T354970 [14:21:26] (03CR) 10Elukey: [V:03+2 C:03+2] config.yaml: remove wikimedia-stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049576 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [14:21:29] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:22:02] (03CR) 10Andrew Bogott: [C:03+2] Move cloudvirt200[1,2,3]-dev to insetup, prepare for decom [puppet] - 10https://gerrit.wikimedia.org/r/1049951 (https://phabricator.wikimedia.org/T368536) (owner: 10Andrew Bogott) [14:22:47] (03CR) 10Elukey: Initial import of ceph-csi-rbd chart for inspection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [14:24:07] !log lucaswerkmeister-wmde@deploy1002 hnowlan, lucaswerkmeister-wmde: Backport for [[gerrit:1043812|Add shellbox-video vars/config, enable on beta (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:22] !log lucaswerkmeister-wmde@deploy1002 hnowlan, lucaswerkmeister-wmde: Continuing with sync [14:24:24] (03CR) 10Elukey: [C:03+1] "After all the reviews I think we are ready to proceed, there will be something that we probably forgot to fix/improve but as first import " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [14:25:08] RECOVERY - Check whether ferm is active by checking the default input chain on mw1362 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:26:03] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: moving parse_log directive to input [puppet] - 10https://gerrit.wikimedia.org/r/1049944 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:26:11] (03CR) 10Elukey: [C:03+1] Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [14:26:38] (03PS1) 10JHathaway: Revert "postfix: always send local mail to smarthosts" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 [14:26:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 (owner: 10JHathaway) [14:27:03] (03CR) 10Elukey: [C:03+1] Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [14:28:03] (03CR) 10Elukey: [C:03+1] Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:29:42] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1043812|Add shellbox-video vars/config, enable on beta (T356241)]] (duration: 08m 22s) [14:29:53] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [14:30:03] (03CR) 10CI reject: [V:04-1] Revert "postfix: always send local mail to smarthosts" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 (owner: 10JHathaway) [14:30:24] (03PS1) 10Muehlenhoff: Remove acmechief annotations for remainign Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1049954 (https://phabricator.wikimedia.org/T365799) [14:30:49] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9926695 (10VirginiaPoundstone) @SGupta-WMF and @Scott_French Thank you for you... [14:31:14] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1049821 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:33:29] (03Abandoned) 10Hnowlan: mw-web, mw-api-ext: bump replicas in advance of traffic shift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028842 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:33:32] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:34:40] (03PS2) 10JHathaway: Revert "postfix: always send local mail to smarthosts" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 (https://phabricator.wikimedia.org/T325407) [14:34:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [14:34:51] (03PS2) 10Hnowlan: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) [14:35:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9926712 (10Jhancock.wm) [14:36:45] * Lucas_WMDE done deploying [14:37:14] thanks Lucas_WMDE! [14:37:15] hm, bit of a spike in logspam-watch, let me take a look at that [14:37:22] hnowlan: np :) good luck with the testing in beta! [14:38:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2007 to codfw - jhancock@cumin2002" [14:39:03] (03CR) 10Effie Mouzeli: [C:03+1] prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:39:10] (03CR) 10Brouberol: [C:03+1] Remove acmechief annotations for remainign Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1049954 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:39:22] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:39:24] (03PS1) 10Hashar: gerrit: fix motd/role description [puppet] - 10https://gerrit.wikimedia.org/r/1049957 [14:40:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbproxy2007 to codfw - jhancock@cumin2002" [14:40:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:40:17] (03CR) 10Muehlenhoff: [C:03+2] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1049957 (owner: 10Hashar) [14:40:39] (03PS2) 10Fabfur: benthos:cache: moving parse_log directive to input [puppet] - 10https://gerrit.wikimedia.org/r/1049944 (https://phabricator.wikimedia.org/T365718) [14:40:59] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remainign Data Engineering roles [puppet] - 10https://gerrit.wikimedia.org/r/1049954 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:43:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9926724 (10cmooney) >>! In T364870#9865334, @wiki_willy wrote: > Hi @dcaro - just following up on this. Can you provide the racking information for us, t... [14:44:12] (03CR) 10Fabfur: [C:03+2] benthos:cache: moving parse_log directive to input [puppet] - 10https://gerrit.wikimedia.org/r/1049944 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [14:45:26] (03CR) 10Muehlenhoff: "Looks good, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1049883 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [14:45:36] (03CR) 10JHathaway: [C:03+2] Revert "postfix: always send local mail to smarthosts" [puppet] - 10https://gerrit.wikimedia.org/r/1049953 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [14:46:23] filed T368543 for the log errors I saw FTR [14:46:24] T368543: Error: Call to a member function getPageAsLinkTarget() on null - https://phabricator.wikimedia.org/T368543 [14:46:44] (03PS3) 10Clément Goubert: prometheus-php-fpm-exporter, prometheus-apache-exporter: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/997896 (https://phabricator.wikimedia.org/T283861) [14:46:45] (03PS7) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [14:47:43] (03CR) 10Herron: [C:03+1] "LGTM! 🧹🧼" [puppet] - 10https://gerrit.wikimedia.org/r/1049274 (https://phabricator.wikimedia.org/T368327) (owner: 10Cwhite) [14:48:34] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [14:48:40] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt2001-dev.codfw.wmnet [14:53:12] (03PS1) 10JHathaway: postfix: support for discarding select recipients [puppet] - 10https://gerrit.wikimedia.org/r/1049959 (https://phabricator.wikimedia.org/T325406) [14:53:13] (03PS1) 10JHathaway: postfix: add more smtp restrictions [puppet] - 10https://gerrit.wikimedia.org/r/1049960 (https://phabricator.wikimedia.org/T325406) [14:53:14] (03PS1) 10JHathaway: postfix: revert to traditional local mail setup [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) [14:53:16] (03PS1) 10JHathaway: postfix: add dev hiera data for mx-out boxen [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) [14:53:27] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9926798 (10Clement_Goubert) [14:54:24] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [14:54:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049959 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [14:54:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049960 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [14:54:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [14:54:42] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [14:57:02] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [14:57:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049863 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:57:34] (03PS1) 10Hnowlan: LabsServices: add port for shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049963 (https://phabricator.wikimedia.org/T357309) [14:57:44] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#9926829 (10MoritzMuehlenhoff) [14:58:01] (03CR) 10Muehlenhoff: [C:03+2] Remove acmechief annotations for remaining o11y roles [puppet] - 10https://gerrit.wikimedia.org/r/1049863 (https://phabricator.wikimedia.org/T365799) (owner: 10Muehlenhoff) [14:58:38] !log taavi@cumin1002 START - Cookbook sre.puppet.renew-cert for cloudcephosd1006.eqiad.wmnet: Renew puppet certificate - taavi@cumin1002 [14:58:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [14:58:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt2001-dev.codfw.wmnet [14:59:20] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt2002-dev.codfw.wmnet [15:00:37] !log taavi@cumin1002 END (ERROR) - Cookbook sre.puppet.renew-cert (exit_code=97) for cloudcephosd1006.eqiad.wmnet: Renew puppet certificate - taavi@cumin1002 [15:04:01] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:04:07] (03PS2) 10Kgraessle: Update QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) [15:05:08] (03PS1) 10David Caro: cloudcephosd1006: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1049964 (https://phabricator.wikimedia.org/T309789) [15:06:01] (03CR) 10David Caro: [C:03+2] cloudcephosd1006: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1049964 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [15:06:24] (03PS1) 10Fabfur: Revert "benthos:cache: moving parse_log directive to input" [puppet] - 10https://gerrit.wikimedia.org/r/1049965 [15:06:27] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:07:01] (03PS1) 10Elukey: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T367427) [15:07:03] (03PS3) 10Kgraessle: Update QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) [15:07:53] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2002-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:07:53] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt2002-dev.codfw.wmnet [15:08:39] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt2003-dev.codfw.wmnet [15:09:06] (03CR) 10Andrew Bogott: [C:03+2] Remove mention of cloudvirt200[1,2,3]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1049952 (https://phabricator.wikimedia.org/T368536) (owner: 10Andrew Bogott) [15:10:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [15:11:48] (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: moving parse_log directive to input" [puppet] - 10https://gerrit.wikimedia.org/r/1049965 (owner: 10Fabfur) [15:12:25] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:13:27] (03CR) 10Jsn.sherman: [C:03+1] "Thanks! looks good to me, and aligns with the feedback we got from @kcvelaga@wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [15:13:37] 10ops-codfw, 06DC-Ops: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9926882 (10Andrew) [15:14:48] !log sudo cumin "A:dnsbox" 'disable-puppet "rolling out CR 1048064"' [15:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:59] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2003-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:15:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:15:43] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bullseye [15:16:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt2003-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [15:16:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:05] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt2003-dev.codfw.wmnet [15:16:10] 10ops-codfw, 06DC-Ops: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9926896 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudvirt2003-dev.codfw.wmnet` - cloudvirt2003-dev.codfw.wmnet (**FAIL**) - Downtimed ho... [15:19:12] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9926900 (10jcrespo) p:05Triage→03Low This seems to not be reproducible, maybe it was related to cold caches after reboot? Lowerin... [15:20:21] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on logstash1023.eqiad.wmnet with reason: Temporary stop to migrate the VM away from the ganeti node [15:20:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on logstash1023.eqiad.wmnet with reason: Temporary stop to migrate the VM away from the ganeti node [15:23:32] (03PS1) 10Ssingh: hiera: dnsbox: add acmechief1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1049969 [15:24:10] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3074/co" [puppet] - 10https://gerrit.wikimedia.org/r/1049969 (owner: 10Ssingh) [15:25:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:25:13] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [15:27:02] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:27:06] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:27:15] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [15:27:27] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [15:28:40] (03CR) 10Elukey: "Good point, I didn't notice https://gerrit.wikimedia.org/r/c/operations/debs/mcrouter/+/959212" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:29:37] (03PS1) 10Hnowlan: testwiki: use shellbox-video for scaling video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049970 (https://phabricator.wikimedia.org/T357309) [15:29:53] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9926945 (10Jdforrester-WMF) Should we call this Resolved and track the remaining migrations in the parent, T290536? [15:32:04] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt2003-dev.codfw.wmnet [15:32:45] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on logstash1024.eqiad.wmnet with reason: Temporary stop to migrate the VM away from the ganeti node [15:32:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on logstash1024.eqiad.wmnet with reason: Temporary stop to migrate the VM away from the ganeti node [15:34:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9926991 (10Scott_French) [15:35:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [15:36:57] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [15:38:00] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [15:38:32] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9927038 (10Scott_French) @VirginiaPoundstone - I believe there was one tick ma... [15:38:36] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:36] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt2003-dev.codfw.wmnet [15:38:42] 10ops-codfw, 06SRE, 06DC-Ops: decommission cloudvirt200[1,2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T368536#9927040 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudvirt2003-dev.codfw.wmnet` - cloudvirt2003-dev.codfw.wmnet (**FAIL**) - Dow... [15:40:38] (03PS1) 10Jdlrobson: Enable user pages and select special pages in dark mode (1.43.0-wmf.11) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049972 (https://phabricator.wikimedia.org/T366364) [15:42:26] (03CR) 10JHathaway: [C:03+2] postfix: support for discarding select recipients [puppet] - 10https://gerrit.wikimedia.org/r/1049959 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [15:43:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [15:44:10] (03PS4) 10Elukey: mcrouter: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) [15:44:10] (03PS4) 10Elukey: prometheus-exporters: upgrade mcrouter and statsd to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049588 (https://phabricator.wikimedia.org/T368366) [15:44:10] (03PS4) 10Elukey: service-checker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049590 (https://phabricator.wikimedia.org/T368366) [15:44:11] (03PS4) 10Elukey: nutcracker: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049591 (https://phabricator.wikimedia.org/T368366) [15:44:12] (03PS4) 10Elukey: echoserver: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049828 (https://phabricator.wikimedia.org/T368366) [15:44:14] (03PS5) 10Elukey: cfssl-issuer: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049838 (https://phabricator.wikimedia.org/T368366) [15:44:37] (03CR) 10Elukey: "It seems the way that upstream currently releases software (sigh), so I changed it accordingly, lemme know." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049587 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:44:45] (03CR) 10Elukey: [V:03+2 C:03+2] coredns: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049577 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:44:56] (03CR) 10Elukey: [V:03+2 C:03+2] envoy: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049578 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:45:07] (03CR) 10Elukey: [V:03+2 C:03+2] helm-state-metrics: upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1049586 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:45:35] (03PS2) 10Jdlrobson: Enable user pages and select special pages in dark mode (1.43.0-wmf.11) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049972 (https://phabricator.wikimedia.org/T366364) [15:45:37] (03PS1) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [15:46:18] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera dnsbox and P:bird: remove references to ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1048064 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [15:46:30] (03CR) 10JHathaway: [C:03+2] postfix: add more smtp restrictions [puppet] - 10https://gerrit.wikimedia.org/r/1049960 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [15:48:09] jouncebot: nowandnext [15:48:09] No deployments scheduled for the next 1 hour(s) and 11 minute(s) [15:48:09] In 1 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1700) [15:54:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#9927154 (10dcaro) [15:54:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9927160 (10dcaro) >>! In T364870#9926724, @cmooney wrote: >>>! In T364870#9865334, @wiki_willy wrote: >> Hi @dcaro - just following up on this. Can you p... [15:58:01] !log sudo cumin -b1 -s120 "A:dnsbox and not P{dns6001*}" "run-puppet-agent --enable 'rolling out CR 1049969'" [15:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:36] !log sudo cumin -b1 -s120 "A:dnsbox and not P{dns6001*}" "run-puppet-agent --enable 'rolling out CR 1048064'" [15:58:44] (03PS1) 10Ladsgroup: Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049982 (https://phabricator.wikimedia.org/T368098) [15:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:57] (03PS1) 10Ladsgroup: Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) [15:59:04] (03CR) 10Ladsgroup: [C:03+2] Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049982 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [15:59:08] (03CR) 10Ladsgroup: [C:03+2] Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:00:25] xcollazo: here! [16:00:37] I +2'ed the backports, once merged, we deploy [16:00:54] great, thank you. [16:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:02] (03CR) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [16:10:46] (03PS2) 10JHathaway: postfix: revert to traditional local mail setup [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) [16:10:46] (03PS2) 10JHathaway: postfix: add dev hiera data for mx-out boxen [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) [16:11:14] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:11:21] (03CR) 10CI reject: [V:04-1] Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:12:01] (03CR) 10Ladsgroup: [C:03+2] "again" [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:14:35] !log mnz@deploy1002 Started deploy [airflow-dags/research@1996a7a]: (no justification provided) [16:15:08] !log mnz@deploy1002 Finished deploy [airflow-dags/research@1996a7a]: (no justification provided) (duration: 00m 33s) [16:15:33] (03CR) 10CI reject: [V:04-1] Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:19:22] (03PS3) 10JHathaway: postfix: revert to traditional local mail setup [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) [16:19:22] (03PS3) 10JHathaway: postfix: add dev hiera data for mx-out boxen [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) [16:19:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:20:18] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9927276 (10WDoranWMF) p:05Unbreak!→03High [16:23:39] (03CR) 10CI reject: [V:04-1] Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049982 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:24:26] (03PS1) 10JHathaway: postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 [16:25:31] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5018.eqsin.wmnet [16:25:48] (03CR) 10BCornwall: [C:03+2] cp5018: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049169 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [16:25:53] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:26:01] (03CR) 10JHathaway: [C:03+2] postfix: revert to traditional local mail setup [puppet] - 10https://gerrit.wikimedia.org/r/1049961 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:26:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9927314 (10VRiley-WMF) 05Open→03In progress I am proceeding with moving the server physically. I will update this ticket once it's completed and updated... [16:27:28] !log xcollazo@deploy1002 Started deploy [analytics/refinery@ca1acb3]: Regular analytics weekly train [analytics/refinery@ca1acb34] [16:27:58] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@ca1acb3]: Regular analytics weekly train [analytics/refinery@ca1acb34] (duration: 00m 29s) [16:29:07] (03PS1) 10Ladsgroup: Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049988 (https://phabricator.wikimedia.org/T362425) [16:29:12] (03CR) 10Ladsgroup: [C:03+2] Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049988 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:29:22] (03PS1) 10Ladsgroup: Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049989 (https://phabricator.wikimedia.org/T362425) [16:29:28] (03CR) 10Ladsgroup: [C:03+2] Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049989 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:30:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS bullseye [16:30:39] (03CR) 10Hashar: [C:03+1] Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049989 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:30:43] (03CR) 10Hashar: [C:03+1] Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049988 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:30:45] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS b... [16:31:22] xcollazo: due to unrelated reasons the build is failing, fixing that [16:33:19] Amir1: ack [16:36:18] (03Merged) 10jenkins-bot: Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049984 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:36:39] (03PS4) 10JHathaway: postfix: add dev hiera data for mx-out boxen [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) [16:37:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:38:25] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [16:39:16] (03CR) 10Ladsgroup: [C:03+2] "again" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049982 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [16:39:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [16:39:32] (03CR) 10JHathaway: [C:03+2] postfix: add dev hiera data for mx-out boxen [puppet] - 10https://gerrit.wikimedia.org/r/1049962 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:39:47] (03CR) 10JHathaway: [C:03+2] postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 (owner: 10JHathaway) [16:39:49] (03CR) 10JHathaway: [V:03+2 C:03+2] postfix: mx domain aliases [labs/private] - 10https://gerrit.wikimedia.org/r/1049987 (owner: 10JHathaway) [16:42:07] (03CR) 10Ottomata: [C:03+2] Configurably remove varnish handling of /beacon/event [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:43:23] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9927398 (10VRiley-WMF) 05In progress→03Open The server has been physically moved from U 42 to 33. No other changes happened (such as CableID) also, power... [16:44:28] !log mnz@deploy1002 Started deploy [airflow-dags/research@1996a7a]: (no justification provided) [16:44:32] !log mnz@deploy1002 Finished deploy [airflow-dags/research@1996a7a]: (no justification provided) (duration: 00m 03s) [16:48:15] PROBLEM - eventlogging Varnishkafka log producer on cp1100 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:50:36] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [16:50:45] T362033: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033 [16:50:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033 [16:51:02] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9927424 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d957387f-e2c5-4ff4-9a63-38c743e151c4) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r... [16:51:52] (03Merged) 10jenkins-bot: Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049988 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:52:21] !log disable puppet on A:cp-text [16:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:32] !log xcollazo@deploy1002 Started deploy [analytics/refinery@ca1acb3]: Regular analytics weekly train [analytics/refinery@ca1acb34] [16:54:24] (03Merged) 10jenkins-bot: Skip failing ForeignResourceStructureTest [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1049989 (https://phabricator.wikimedia.org/T362425) (owner: 10Ladsgroup) [16:56:22] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1049574 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [16:57:07] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9927454 (10cmooney) [16:59:14] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:05] swfrench-wmf: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1700) [17:00:37] (03CR) 10Cwhite: [C:03+2] logstash: clean up remnants of logstash200[123] [puppet] - 10https://gerrit.wikimedia.org/r/1049274 (https://phabricator.wikimedia.org/T368327) (owner: 10Cwhite) [17:01:48] here, but holding for there moment. there are some patches that are higher priority that may need deployed instead. [17:01:49] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@ca1acb3]: Regular analytics weekly train [analytics/refinery@ca1acb34] (duration: 09m 16s) [17:01:56] (03Merged) 10jenkins-bot: Modify WikiExporter's BATCH_SIZE from 50000 to 10000 [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1049982 (https://phabricator.wikimedia.org/T368098) (owner: 10Ladsgroup) [17:03:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [17:03:30] (03PS1) 10Ottomata: varnishkafka::instance - let service_unit manage service enable param [puppet] - 10https://gerrit.wikimedia.org/r/1049994 (https://phabricator.wikimedia.org/T238230) [17:04:14] Amir1: let me know when you're done deploying your changes, and I can gauge whether there's enough time to fit mine in the infra window. [17:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:04:38] sure [17:04:57] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1049994 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:05:36] xcollazo: backporting now [17:05:51] (03CR) 10Ssingh: [C:03+1] varnishkafka::instance - let service_unit manage service enable param [puppet] - 10https://gerrit.wikimedia.org/r/1049994 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:05:57] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1049982|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] [17:06:05] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [17:06:05] T362425: ForeignResourceStructureTest flaky in CI due to "Failed to download resource at https://codeload.github.com" - https://phabricator.wikimedia.org/T362425 [17:06:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [17:06:59] (03CR) 10Ottomata: [V:03+1 C:03+2] varnishkafka::instance - let service_unit manage service enable param [puppet] - 10https://gerrit.wikimedia.org/r/1049994 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:08:53] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1049982|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwd [17:08:54] ebug) [17:08:59] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [17:11:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:14] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:49] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1049982|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] (duration: 08m 52s) [17:14:55] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [17:14:56] T362425: ForeignResourceStructureTest flaky in CI due to "Failed to download resource at https://codeload.github.com" - https://phabricator.wikimedia.org/T362425 [17:16:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:16:25] FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:32] !log re-enable puppet on A:cp-text [17:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:07] !log mnz@deploy1002 Started deploy [airflow-dags/research@1996a7a]: (no justification provided) [17:17:11] !log mnz@deploy1002 Finished deploy [airflow-dags/research@1996a7a]: (no justification provided) (duration: 00m 03s) [17:18:02] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 06Traffic, 07patch-welcome: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724#9927634 (10Dzahn) status of this ticket in 2024. remaining services using this: [] base::service_unit { 'prometheus-node-exporter': [] base... [17:21:25] FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1049969 (owner: 10Ssingh) [17:22:12] !log xcollazo@deploy1002 Started deploy [analytics/refinery@ca1acb3] (thin): Regular analytics weekly train THIN [analytics/refinery@ca1acb34] [17:22:25] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: dnsbox: add acmechief1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1049969 (owner: 10Ssingh) [17:26:25] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@ca1acb3] (thin): Regular analytics weekly train THIN [analytics/refinery@ca1acb34] (duration: 04m 12s) [17:26:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:41] !log xcollazo@deploy1002 Started deploy [analytics/refinery@ca1acb3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ca1acb34] [17:29:31] Amir1: how are things progressing - do you need more time? [17:29:35] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@ca1acb3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ca1acb34] (duration: 02m 54s) [17:30:53] swfrench-wmf: sorry I meant to ping you [17:30:57] I am done [17:31:14] Amir1: ack, thanks and no worries [17:31:25] FIRING: [18x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:08] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:38] (03PS1) 10Ottomata: Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) [17:34:45] rescheduling my changes for tomorrow's UTC-late infra window, as there's a train window starting in 26m (and I need more time than that) [17:35:27] (03PS5) 10Cathal Mooney: Update aggregate route creation policy for network pops [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [17:36:25] FIRING: [18x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:43] (03PS6) 10Cathal Mooney: Update aggregate route creation policy for network pops [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) [17:37:42] !log mnz@deploy1002 Started deploy [airflow-dags/research@5121748]: (no justification provided) [17:37:51] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9927686 (10VRiley-WMF) Attempted to swap drives into decomm unit snapshot1009. However, the server wasn't powering up. Suspected issue on that unit and will test with a different decomm server. [17:37:53] !log mnz@deploy1002 Finished deploy [airflow-dags/research@5121748]: (no justification provided) (duration: 00m 11s) [17:38:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T364069)', diff saved to https://phabricator.wikimedia.org/P65486 and previous config saved to /var/cache/conftool/dbconfig/20240626-173810-marostegui.json [17:38:16] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:39:12] !log sudo cumin "A:dnsbox" "run-puppet-agent" [17:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:36] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:40:08] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:40:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS bullseye [17:40:30] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS bulls... [17:40:45] (03CR) 10Scott French: "FYI, deferred until the 6-27 UTC-late infra window due a conflict with higher priority changes for T368098." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049607 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:41:25] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:43:07] (03PS2) 10Ottomata: Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) [17:43:24] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5018.eqsin.wmnet [17:44:03] (03CR) 10BCornwall: [C:03+2] cp5019: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049170 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [17:44:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927710 (10BCornwall) [17:45:07] (03CR) 10Ssingh: [C:03+1] "🚢 it" [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:45:07] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:45:32] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:25] FIRING: [19x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:31] !log disable puppet in A:cp-text [17:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:47] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368564 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:46:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368564 (10ops-monitoring-bot) 03NEW [17:49:51] (03CR) 10Ottomata: [V:03+1 C:03+2] Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1050000 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:51:12] !log disabling varnishkafka-eventlogging and varnish /beacon/event handling on ache text nodes. Puppet is disabled on all cache text, will test a few at a time first. - T238230 [17:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:18] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [17:53:12] (03PS1) 10Michael Große: Homepage: log rendering time for each module and each wiki [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050002 (https://phabricator.wikimedia.org/T368405) [17:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P65487 and previous config saved to /var/cache/conftool/dbconfig/20240626-175317-marostegui.json [17:54:14] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050002 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [17:55:34] (03PS1) 10Eevans: aptrepo: remove component/cassandra311 (no longer needed) [puppet] - 10https://gerrit.wikimedia.org/r/1050004 (https://phabricator.wikimedia.org/T354970) [17:56:25] FIRING: [18x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:53] (03PS1) 10Michael Große: Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) [17:57:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [17:57:46] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9927760 (10Dzahn) Hi @DSmit-WMF the request has been approved by the group owner of deployment. Now we just still need manager approval and then w... [17:58:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9927764 (10Dzahn) And while at it, please also point him to the other comment from Andre at T368159#9917785 [17:58:35] !log sudo cumin -b1 -s30 "A:cp-text" "run-puppet-agent" [17:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:22] !log sudo cumin -b10 "A:cp-text" "run-puppet-agent" [17:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] jeena and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T1800). [18:00:40] (03CR) 10Ssingh: [C:03+2] conftool-data: remove ntp service [puppet] - 10https://gerrit.wikimedia.org/r/1048067 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [18:01:23] (03PS2) 10Clare Ming: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) [18:01:25] FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:01:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9927777 (10Dzahn) Thanks! Gotcha. We are waiting for approval from data engineering. [18:02:02] (03CR) 10Clare Ming: "1.43.0-wmf.11 has been cut - does this need to wait until all groups are on it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [18:04:14] FIRING: [6x] ProbeDown: Service aqs1013-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:44] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050007 (https://phabricator.wikimedia.org/T128546) [18:04:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9927779 (10Dzahn) 05Open→03In progress p:05Triage→03High a:03Dzahn [18:04:57] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash200[123] - https://phabricator.wikimedia.org/T368327#9927783 (10colewhite) [18:05:03] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash200[123] - https://phabricator.wikimedia.org/T368327#9927790 (10colewhite) 05In progress→03Open [18:05:32] RESOLVED: [5x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:05:42] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash200[123] - https://phabricator.wikimedia.org/T368327#9927791 (10colewhite) a:05colewhite→03None [18:05:53] (03PS1) 10Ottomata: MediaWikiPingback is now on event platform. Use eventlogging_legacy refine job [puppet] - 10https://gerrit.wikimedia.org/r/1050008 (https://phabricator.wikimedia.org/T323828) [18:06:24] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@5121748]: Deploying latest DAGs to analytics Airflow instance. [18:06:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:03] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@5121748]: Deploying latest DAGs to analytics Airflow instance. (duration: 00m 39s) [18:08:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P65488 and previous config saved to /var/cache/conftool/dbconfig/20240626-180824-marostegui.json [18:09:00] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050010 (https://phabricator.wikimedia.org/T366956) [18:09:02] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050010 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:09:42] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050010 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:10:08] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:12:40] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5019.eqsin.wmnet [18:14:56] !log # etcdctl --username root --endpoints https://conf1007.eqiad.wmnet:4001 rmdir /conftool/v1/pools/${site}/dnsbox/ntp: T366360 [18:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:01] T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360 [18:15:28] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9927826 (10cwylo) >>! In T368027#9913044, @kamila wrote: > @cwylo Can you please confirm that you have read the [[ https://wikitech.wikimedia.org/wiki/Analytics... [18:15:57] (03PS1) 10JHathaway: postfix: add recipient discards from exim [puppet] - 10https://gerrit.wikimedia.org/r/1050013 (https://phabricator.wikimedia.org/T325406) [18:16:11] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3078/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050008 (https://phabricator.wikimedia.org/T323828) (owner: 10Ottomata) [18:16:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050013 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:17:38] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.11 refs T366956 [18:17:44] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [18:18:24] (03CR) 10Muehlenhoff: [C:03+1] "We don't strictly need to, usually these get garbage collected when we retire a full distro (like soon buster). But the patch is correct, " [puppet] - 10https://gerrit.wikimedia.org/r/1050004 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [18:19:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS bullseye [18:19:32] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS b... [18:22:26] (03CR) 10JHathaway: [C:03+2] postfix: add recipient discards from exim [puppet] - 10https://gerrit.wikimedia.org/r/1050013 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:23:13] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [18:23:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T364069)', diff saved to https://phabricator.wikimedia.org/P65489 and previous config saved to /var/cache/conftool/dbconfig/20240626-182333-marostegui.json [18:23:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [18:23:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:23:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [18:23:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T364069)', diff saved to https://phabricator.wikimedia.org/P65490 and previous config saved to /var/cache/conftool/dbconfig/20240626-182355-marostegui.json [18:25:29] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove ntp.anycast.wmnet - sukhe@cumin1002" [18:26:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove ntp.anycast.wmnet - sukhe@cumin1002" [18:26:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:28] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:54] 06SRE, 06Traffic: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9927897 (10ssingh) 05Open→03Resolved a:03ssingh This was rolled out to all 2166 hosts today that are now using `ntp-[abc].anycast.wmnet`. All traces of `ntp.anycast.wm... [18:36:30] (03PS1) 10Ottomata: Remove profile::cache::kafka::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) [18:36:32] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:38:06] (03CR) 10Ottomata: "Let's wait a day before we merge this, in case we want to easily revert the previous patch." [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:38:16] (03CR) 10Eevans: "Will do!" [puppet] - 10https://gerrit.wikimedia.org/r/1050004 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [18:38:21] (03CR) 10Eevans: [C:03+2] aptrepo: remove component/cassandra311 (no longer needed) [puppet] - 10https://gerrit.wikimedia.org/r/1050004 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [18:38:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:43:06] (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [18:43:09] (03CR) 10Scott French: [C:03+2] kubernetes: promote unavailable replicas alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [18:43:34] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 9 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:44:21] (03Merged) 10jenkins-bot: kubernetes: promote unavailable replicas alert to critical [alerts] - 10https://gerrit.wikimedia.org/r/1049627 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [18:46:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049972 (https://phabricator.wikimedia.org/T366364) (owner: 10Jdlrobson) [18:53:31] Rolling back the train due to higher than normal DB query error rates [18:54:07] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050021 (https://phabricator.wikimedia.org/T366956) [18:54:13] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050021 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:54:56] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050021 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [18:55:51] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5019.eqsin.wmnet with OS bullseye [18:56:04] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9927952 (10Scott_French) @SGupta-WMF - Ahmon merged [0] this morning, so you s... [18:56:05] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS bullseye [18:56:06] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927953 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bulls... [18:56:22] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9927954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS b... [18:56:26] (03PS1) 10Eevans: Default Cassandra clusters back to target_version '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1050022 (https://phabricator.wikimedia.org/T354970) [18:59:00] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050022 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [19:00:07] (03PS2) 10Eevans: Default Cassandra clusters back to target_version '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1050022 (https://phabricator.wikimedia.org/T354970) [19:02:51] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.11 refs T366956 [19:02:57] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [19:04:54] (03PS1) 10Ottomata: Revert "Disable varnish handling of /beacon/event to decommission eventlogging backend" [puppet] - 10https://gerrit.wikimedia.org/r/1050027 [19:04:58] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050022 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [19:07:15] (03PS2) 10Ottomata: Revert "Disable varnish handling of /beacon/event to decommission eventlogging backend" [puppet] - 10https://gerrit.wikimedia.org/r/1050027 [19:10:05] (03CR) 10Ssingh: [C:03+1] Revert "Disable varnish handling of /beacon/event to decommission eventlogging backend" [puppet] - 10https://gerrit.wikimedia.org/r/1050027 (owner: 10Ottomata) [19:10:21] (03CR) 10Ssingh: [V:03+2 C:03+2] Revert "Disable varnish handling of /beacon/event to decommission eventlogging backend" [puppet] - 10https://gerrit.wikimedia.org/r/1050027 (owner: 10Ottomata) [19:10:32] (03CR) 10Eevans: [C:03+2] Default Cassandra clusters back to target_version '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1050022 (https://phabricator.wikimedia.org/T354970) (owner: 10Eevans) [19:10:32] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:24] !log re-enabling varnishkafka-eventlogging and varnish /beacon/event handling on cache text nodes. /beacon/event/ redirects which breaks the MediaWikiPingback usage - T238230 [19:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:29] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [19:12:54] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:32] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1050031 [19:14:39] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (owner: 10CDanis) [19:16:32] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:17:06] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1050031 [19:17:09] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (owner: 10CDanis) [19:18:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:03] (03PS3) 10CDanis: haproxy: add notion of trusted IP space [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) [19:20:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:23:46] (03PS1) 10JHathaway: Revert^2 "mw: change mail_host" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050032 [19:23:51] (03PS2) 10JHathaway: Revert^2 "mw: change mail_host" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050032 [19:27:38] (03PS4) 10CDanis: haproxy: add notion of trusted IP space [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) [19:27:39] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1050034 [19:27:45] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (owner: 10CDanis) [19:28:37] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [19:30:15] (03PS2) 10CDanis: haproxy: don't allow x-req-id from outside [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) [19:32:08] (03CR) 10Vgutierrez: "looking good, see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:33:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [19:33:30] (03PS5) 10CDanis: haproxy: add notion of trusted IP space [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) [19:33:30] (03PS3) 10CDanis: haproxy: don't allow x-req-id from outside [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) [19:33:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:33:54] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:33:58] (03CR) 10JHathaway: [C:03+2] Revert^2 "mw: change mail_host" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050032 (owner: 10JHathaway) [19:34:12] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:35:33] (03Merged) 10jenkins-bot: Revert^2 "mw: change mail_host" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050032 (owner: 10JHathaway) [19:36:08] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:37:40] (03CR) 10Vgutierrez: [C:03+1] "thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:37:55] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050036 (https://phabricator.wikimedia.org/T366956) [19:37:57] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050036 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [19:38:40] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050036 (https://phabricator.wikimedia.org/T366956) (owner: 10TrainBranchBot) [19:39:31] !log jhathaway@deploy1002 Started scap: (no justification provided) [19:39:36] (03CR) 10Vgutierrez: [C:03+1] haproxy: don't allow x-req-id from outside [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:40:57] !log jhathaway@deploy1002 Finished scap: (no justification provided) (duration: 02m 38s) [19:42:27] jhathaway: Next time you would like to use scap during the train window it would be helpful if you could ping one of the train conductors just in case [19:43:01] sorry jeena that was my ignorance on proper protocol, will do [19:43:06] next time [19:43:08] np, thank you! [19:46:50] (03CR) 10CDanis: [C:03+2] haproxy: add notion of trusted IP space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050031 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:46:56] (03CR) 10CDanis: [C:03+2] haproxy: don't allow x-req-id from outside [puppet] - 10https://gerrit.wikimedia.org/r/1050034 (https://phabricator.wikimedia.org/T368557) (owner: 10CDanis) [19:48:49] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.11 refs T366956 [19:48:54] T366956: 1.43.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T366956 [19:52:54] (03PS1) 10Eevans: cassandra: remove support for 2.x versions [puppet] - 10https://gerrit.wikimedia.org/r/1050041 [19:55:36] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050041 (owner: 10Eevans) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T2000). [20:00:05] Katherine_g, jan_drewniak, MichaelG_WMF, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] * MichaelG_WMF is here [20:00:26] here [20:00:55] o/ I can self-deploy my portals patch [20:03:14] present [20:04:14] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:24] I added a last minute patch, can self deploy [20:04:43] well, that alert is not the common one.. I will take a look [20:04:53] oh, nevermind, the old server.. duh [20:05:00] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5019.eqsin.wmnet with OS bullseye [20:05:09] putting servers into "insetup" does not remove all the timers [20:05:12] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bulls... [20:06:14] hi - i can deploy if no one else has shown up yet [20:07:23] i'll do katherine_g's patch and then ping jan_drewniak [20:07:44] (03PS4) 10Kgraessle: Update QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) [20:08:20] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet [20:08:24] (03CR) 10BCornwall: [C:03+2] cp5020: update hieradata for dual NVMe disks configuration [puppet] - 10https://gerrit.wikimedia.org/r/1049171 (https://phabricator.wikimedia.org/T365763) (owner: 10Ssingh) [20:08:56] !log lists1001:/lib/systemd/system# rm wmf_auto_restart_apache2.* ; systemctl reset-failed - reaction to monitoring alert "FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100" [20:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:03] MichaelG_WMF: can your 2 patches go out together? [20:09:18] jan_drewniak: are you able to deploy mine too? [20:09:28] cjming: yes they can, I think [20:10:16] oh hi Jdlrobson! didn't notice you had a patch in as well. I can deploy that as well. [20:10:21] thanks mutante [20:10:32] RESOLVED: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:11:36] ok - so after we do the first patch, i'll pass to you jan_drewniak to do your + Jon's patches [20:11:44] (03Merged) 10jenkins-bot: Update QuickSurvey coverage rate for Automoderator patroller workstream survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049947 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:12:05] cjming: sounds good [20:12:05] jan_drewniak: please ping me when you're done and i can do MichaelG_WMF's backports [20:12:18] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1049947|Update QuickSurvey coverage rate for Automoderator patroller workstream survey (T362969)]] [20:12:24] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:13:21] MichaelG_WMF: i'm going to merge both your patches now since I suspect CI will take 20+ mins (while Jan deploys his stuff) and then scap backport yours together [20:13:43] cjming: yes that makes sense. Thank you 🙏 [20:14:09] (03CR) 10Clare Ming: [C:03+2] Homepage: log rendering time for each module and each wiki [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050002 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [20:14:12] (03CR) 10Clare Ming: [C:03+2] Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [20:14:51] !log cjming@deploy1002 cjming, kgraessle: Backport for [[gerrit:1049947|Update QuickSurvey coverage rate for Automoderator patroller workstream survey (T362969)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:13] katherine_g: can i sync? [20:15:42] yep [20:15:48] !log cjming@deploy1002 cjming, kgraessle: Continuing with sync [20:18:49] jan_drewniak: you might see the GrowthExperiment's 2 patches in git if they merge before you get your + Jon's patches fully deployed -- i think that's ok tho - at least i've just plowed ahead in the past when that has happened before and no one gave me grief [20:20:19] cjming: no problem [20:20:32] cool [20:21:04] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1049947|Update QuickSurvey coverage rate for Automoderator patroller workstream survey (T362969)]] (duration: 08m 46s) [20:21:10] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:21:15] jan_drewniak: all yours [20:21:24] thanks clare! [20:21:39] you're welcome katherine_g - should be live! [20:22:12] cjming: thanks! [20:22:20] np! lmk when you're done [20:22:40] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050007 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:23:00] (03CR) 10Jdrewniak: [C:03+2] Enable user pages and select special pages in dark mode (1.43.0-wmf.11) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049972 (https://phabricator.wikimedia.org/T366364) (owner: 10Jdlrobson) [20:23:24] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050007 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:23:47] (03Merged) 10jenkins-bot: Enable user pages and select special pages in dark mode (1.43.0-wmf.11) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049972 (https://phabricator.wikimedia.org/T366364) (owner: 10Jdlrobson) [20:24:27] tgr: if all goes well, i'll ping you to self-deploy after Jan is done or after the Michael's backports - whichever comes first [20:27:25] (03PS1) 10Dzahn: admin: convert dmuthuri from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050049 (https://phabricator.wikimedia.org/T367872) [20:28:07] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5020.eqsin.wmnet [20:29:43] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9928408 (10Dzahn) 05In progress→03Stalled [20:29:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9928409 (10Dzahn) p:05Triage→03High [20:30:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9928410 (10Dzahn) p:05Triage→03High [20:32:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928412 (10Dzahn) 05Open→03Stalled p:05Triage→03High [20:33:47] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1050007| Bumping portals to master (T128546)]] (duration: 07m 27s) [20:33:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:37:15] cjming the second of the two changes has failed in Gate-and-Submit: flaky network when installing npm dependencies [20:37:29] ugh - bummer [20:38:33] jan_drewniak: ready for testing? [20:38:37] ok - i'll backport separately then -- and re-merge 2nd -- maybe in between tgr can deploy his config patch [20:39:34] Jdlrobson: the patch is merged and on mwdebug, but I'm still syncing the portal change. Should be testable now though [20:39:48] cjming: thank you 👍 [20:40:14] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9928460 (10Dzahn) Hello @Sharvaniharan I see there is a Wikitech user called "Sharvaniharan" but it doesn't use the work E-mail address from WMF. Vice versa when checking for a user using you... [20:40:46] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1050007| Bumping portals to master (T128546)]] (duration: 06m 58s) [20:40:46] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368566#9928473 (10Dzahn) a:03Sharvaniharan [20:40:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:41:28] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:41:51] looking! [20:42:24] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1049972|Enable user pages and select special pages in dark mode (1.43.0-wmf.11) (T366364 T366375 T367375 T367581 T367582 T367583)]] [20:42:37] T366364: Enable night theme on user pages - https://phabricator.wikimedia.org/T366364 [20:42:38] T366375: Special:QrCode and Special:UrlShortener has night mode issues - https://phabricator.wikimedia.org/T366375 [20:42:38] T367375: Search page doesn't work in dark mode - https://phabricator.wikimedia.org/T367375 [20:42:38] T367581: Dark mode is not available on Special:MathStatus - https://phabricator.wikimedia.org/T367581 [20:42:39] T367582: Dark mode is not available on Special:GlobalRenameRequest - https://phabricator.wikimedia.org/T367582 [20:42:39] T367583: Dark mode is not available on Special:Block - https://phabricator.wikimedia.org/T367583 [20:42:40] wow - i think 28+ mins for a merge is a new record [20:43:26] I'm *really* looking forward to that change to run the phpunit tests in parallel [20:43:32] jan_drewniak: LGTM! please sync! [20:43:54] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:44:13] Jdlrobson: Ok I'll go ahead with the sync [20:44:55] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1049972|Enable user pages and select special pages in dark mode (1.43.0-wmf.11) (T366364 T366375 T367375 T367581 T367582 T367583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:11] tgr: do you want me to do your config patch? i can do it in between the 2 GrowthExperiments backports since we'll have to wait for the 2nd patch to merge again [20:45:23] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Continuing with sync [20:47:05] (03Merged) 10jenkins-bot: Homepage: log rendering time for each module and each wiki [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050002 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [20:47:14] finally! [20:47:17] finally! [20:47:32] jinx - 32 mins! [20:47:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [20:48:36] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS b... [20:49:17] (03PS1) 10JHathaway: postfix: add gitlab to recipient discards [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) [20:49:52] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:49:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050058 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [20:50:14] (03CR) 10CI reject: [V:04-1] Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [20:50:34] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1049972|Enable user pages and select special pages in dark mode (1.43.0-wmf.11) (T366364 T366375 T367375 T367581 T367582 T367583)]] (duration: 08m 09s) [20:50:46] (03PS2) 10Michael Große: Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) [20:50:47] cjming: ok I'm all done [20:50:48] T366364: Enable night theme on user pages - https://phabricator.wikimedia.org/T366364 [20:50:49] T366375: Special:QrCode and Special:UrlShortener has night mode issues - https://phabricator.wikimedia.org/T366375 [20:50:50] T367375: Search page doesn't work in dark mode - https://phabricator.wikimedia.org/T367375 [20:50:51] T367581: Dark mode is not available on Special:MathStatus - https://phabricator.wikimedia.org/T367581 [20:50:51] T367582: Dark mode is not available on Special:GlobalRenameRequest - https://phabricator.wikimedia.org/T367582 [20:50:52] T367583: Dark mode is not available on Special:Block - https://phabricator.wikimedia.org/T367583 [20:50:55] jan_drewniak: thanks! [20:51:50] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] [20:51:55] T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec) - https://phabricator.wikimedia.org/T368405 [20:54:14] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1105-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:55:16] !log cjming@deploy1002 cjming, migr: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:23] MichaelG_WMF: 1st patch on mwdebug if it's testable [20:55:31] I'll have a look [20:56:40] (03PS7) 10Scott French: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [20:56:40] (03PS7) 10Scott French: rest-gateway: route commons-impact via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [20:57:39] cjming: I'm seeing the data in graphite, Thanks! [20:57:46] cool - syncing [20:57:49] !log cjming@deploy1002 cjming, migr: Continuing with sync [20:59:15] (03CR) 10Clare Ming: [C:03+2] Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [20:59:43] 🤞 [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T2100) [21:00:43] gah - am i all right to extend the backport window? [21:01:10] 2 patches left - one config + one backport (which might take a bit to merge) [21:01:42] (03CR) 10Scott French: "We almost have an image, so I figured now is a good time to send this out. Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [21:02:01] If that's not possible then the backport can also be moved to a window tomorrow. It would have been nice to get it out today, but it is not essential [21:02:08] PROBLEM - Check whether ferm is active by checking the default input chain on parse2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:03:43] (03CR) 10Scott French: "FYI, I don't intend to merge / apply this until the Data Products folks confirm they're ready for the endpoints to go live. Thanks in adva" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [21:03:43] MichaelG_WMF: let's see if the Wikifunction Services folks need the window -- if not, happy to get it out - it's just a test of our patience waiting for CI [21:04:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:04:25] 🧘 [21:05:51] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] (duration: 14m 01s) [21:05:57] T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec) - https://phabricator.wikimedia.org/T368405 [21:06:33] MichaelG_WMF: 1st patch should be live - waiting on 2nd [21:06:57] cjming: Thanks [21:06:59] in the meantime, tgr are you around? i can do yours while waiting or happy to let you self-deploy [21:09:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9928609 (10Dzahn) Ok, thanks for confirming both things! Sounds good. I also made a mistake there in my original question. When I mentioned the "wmf" group that... [21:09:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9928614 (10Dzahn) [21:09:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9928615 (10Dzahn) [21:10:56] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:10:58] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:11:08] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9928619 (10Dzahn) @MoritzMuehlenhoff I _think_ this is complete? Except we might want to follow-up regarding the part where the offboard scr... [21:11:16] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9928620 (10Dzahn) p:05Triage→03Medium [21:12:03] (03PS1) 10JHathaway: frtech: Update civicrm email server [puppet] - 10https://gerrit.wikimedia.org/r/1050061 (https://phabricator.wikimedia.org/T329882) [21:12:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050061 (https://phabricator.wikimedia.org/T329882) (owner: 10JHathaway) [21:12:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9928622 (10Ottomata) Approved. [21:13:38] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5020.eqsin.wmnet with OS bullseye [21:13:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bulls... [21:13:57] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS bullseye [21:14:08] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS b... [21:14:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9928626 (10Ottomata) Approved. [21:14:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9928629 (10Ottomata) Approved [21:15:26] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9928635 (10Dzahn) Hello @Soda, this is still pending. Based on the tickets you linked that you work on, maybe you can ask @Samwalton9-WMF to be your sponsor? Do you normally chat with any WM... [21:15:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928632 (10Ottomata) Approved [21:17:43] jouncebot: now [21:17:44] For the next 0 hour(s) and 42 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240626T2100) [21:18:04] time hurtles at light speed in every part of my experience except for when waiting for CI to finish [21:18:24] oh CI merely waits for tests to complete [21:18:25] :D [21:18:38] sorry - yes - tests [21:18:46] but yeah it is troublesome [21:19:16] but there are sooo many of those. And they run all one after another [21:20:41] yeah [21:20:53] the wmf-quibble jobs take 25+ minutes nowadays [21:21:43] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928641 (10Dzahn) Thanks! Confirmed user with check_user script Now has all needed approvals. Uploading change to code review. [21:21:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928645 (10Dzahn) 05Stalled→03In progress [21:22:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928646 (10Dzahn) a:03Dzahn [21:22:41] hashar: not sure who is part of Wikifunctions Services but is it ok that the backport window goes over? just 1, possibly 2, more patches [21:23:05] no idea [21:23:09] lol [21:23:11] James_F I guess [21:23:32] but if you ask, I am usually letting the window to extend [21:23:36] Yeah it’s fine. [21:23:46] as long as the next persons are aware of it [21:23:48] Thanks! [21:23:49] thanks James_F [21:23:54] I am gonna restart Jenkins [21:23:56] I’ve got deploys to do but I can go in parallel. [21:24:50] (03CR) 10JHathaway: [C:03+2] frtech: Update civicrm email server [puppet] - 10https://gerrit.wikimedia.org/r/1050061 (https://phabricator.wikimedia.org/T329882) (owner: 10JHathaway) [21:25:14] hashar wait! we have a change to finish [21:25:35] it should be done any second [21:25:41] I am waiting for that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1050005 to merge [21:25:49] the jenkins restarts take just a minute or so [21:25:50] cool thanks [21:25:55] Thanks [21:26:11] and Zuul will wait for it to be back up [21:26:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928655 (10Dzahn) Can I assume you mean "private data in Superset" and not just "Superset, dashboards without private data"? Because those are different requ... [21:27:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928657 (10Kgraessle) >>! In T367747#9928655, @Dzahn wrote: > Can I assume you mean "private data in Superset" and not just "Superset, dashboards without pri... [21:27:42] it's like watching a flower grow - any minute now [21:28:01] (03PS1) 10Dzahn: admin: convert kgraessle from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) [21:29:00] (03Merged) 10jenkins-bot: Homepage: don't load yesterdays edits on desktop [extensions/GrowthExperiments] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050005 (https://phabricator.wikimedia.org/T368405) (owner: 10Michael Große) [21:29:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9928661 (10Dzahn) Alright, thanks. Code change uploaded. You can expect this to be merged tomorrow. [21:29:09] that got merged [21:29:22] thank goodness [21:29:22] !log restarting CI Jenkins [21:29:23] whee [21:29:24] (03PS2) 10Dzahn: admin: convert kgraessle from ldap_only to analytics-privatedata, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1050062 (https://phabricator.wikimedia.org/T367747) [21:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:41] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] [21:29:46] T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec) - https://phabricator.wikimedia.org/T368405 [21:32:08] RECOVERY - Check whether ferm is active by checking the default input chain on parse2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:32:21] !log cjming@deploy1002 cjming, migr: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:26] MichaelG_WMF: ok to sync? [21:32:32] Jenkins is back [21:33:12] yep [21:33:17] !log cjming@deploy1002 cjming, migr: Continuing with sync [21:33:28] works as expected :D [21:33:35] nice [21:34:28] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9928699 (10Soda) >>! In T366032#9928634, @Dzahn wrote: > Hello @Soda, > > this is still pending. Based on the tickets you linked that you work on, maybe you can ask @Samwalton9-WMF to be you... [21:38:29] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] (duration: 08m 48s) [21:38:34] T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec) - https://phabricator.wikimedia.org/T368405 [21:38:45] MichaelG_WMF: should be live! [21:39:48] can confirm! [21:39:57] yay! [21:40:02] Thank you so much for sticking with it! [21:40:15] you're welcome - glad it all worked out [21:40:21] tgr: finally done - looks like you're away and since we're way over, i think i'll close the window [21:40:34] !log end of UTC late backport window [21:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:44] James_F: all yours - thanks for your patience [21:42:35] Thanks!± [21:46:43] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:46:45] (03PS2) 10Eevans: cassandra: remove support for 2.x versions [puppet] - 10https://gerrit.wikimedia.org/r/1050041 [21:50:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:50:42] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9928727 (10bd808) >>! In T368136#9925649, @fnegri wrote: >> there's some data data there that we filter via the views and not only via sanitarium, bu... [21:57:56] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9928743 (10Dzahn) Great! Let us know how it goes. Ideally let one of them just comment here on the ticket please. We will then make sure this moves forward soon. [21:57:57] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9928742 (10bd808) >>! In T368136#9924910, @Marostegui wrote: > So in terms of data, my recap is: > - non-public data (such as suppressed edits or ban... [22:02:04] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9928764 (10jhathaway) [22:06:11] (03PS1) 10Andrew Bogott: codfw1dev: update bastion IP [puppet] - 10https://gerrit.wikimedia.org/r/1050068 (https://phabricator.wikimedia.org/T368341) [22:06:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9928791 (10Dzahn) 05Stalled→03In progress a:03Dzahn [22:09:31] (03PS1) 10Dzahn: admin: add dreamyjazz to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050069 (https://phabricator.wikimedia.org/T368260) [22:11:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for WBrown (WMF) - https://phabricator.wikimedia.org/T368260#9928799 (10Dzahn) [22:13:42] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: update bastion IP [puppet] - 10https://gerrit.wikimedia.org/r/1050068 (https://phabricator.wikimedia.org/T368341) (owner: 10Andrew Bogott) [22:14:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9928803 (10Dzahn) 05Stalled→03In progress a:03Dzahn [22:15:02] (03PS1) 10Dzahn: admin: add jsn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1050070 (https://phabricator.wikimedia.org/T367295) [22:18:44] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093#9928813 (10Andrew) 05Open→03Resolved thank you! I've put this back in service; we'll see if it cooks again. [22:19:09] (03PS1) 10Cathal Mooney: Do add QoS configuration for fasw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1050071 (https://phabricator.wikimedia.org/T339850) [22:19:59] (03CR) 10Cathal Mooney: [C:03+2] Do add QoS configuration for fasw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1050071 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [22:20:32] (03Merged) 10jenkins-bot: Do add QoS configuration for fasw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1050071 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [22:22:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS bullseye [22:23:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bulls... [22:24:25] (03PS1) 10JHathaway: postfix: vtrs, use cname [puppet] - 10https://gerrit.wikimedia.org/r/1050072 (https://phabricator.wikimedia.org/T329882) [22:24:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T364069)', diff saved to https://phabricator.wikimedia.org/P65493 and previous config saved to /var/cache/conftool/dbconfig/20240626-222434-marostegui.json [22:24:43] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:26:02] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5020.eqsin.wmnet [22:26:23] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9928849 (10Dzahn) a:05Dzahn→03Urbanecm Let's chat abou... [22:26:29] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928853 (10BCornwall) [22:33:56] cjming: sorry had to go afk and then forgot about it! in any case, the patch wasn't urgent. [22:35:29] tgr: no worries! i thought about just doing it but we were running late already [22:35:30] (03CR) 10Gergő Tisza: [beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [22:39:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050072 (https://phabricator.wikimedia.org/T329882) (owner: 10JHathaway) [22:39:24] (03PS1) 10Dzahn: delete cache.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) [22:39:37] (03PS2) 10Dzahn: delete cache.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) [22:39:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P65494 and previous config saved to /var/cache/conftool/dbconfig/20240626-223944-marostegui.json [22:39:56] (03CR) 10Dzahn: "isn't this a fun title :)" [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [22:41:07] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5021.eqsin.wmnet [22:44:05] (03CR) 10Dzahn: "Oh! I searched wikitech and I found something! This was a thing before 2005 :) the 4th SAL ever mentions it. That was _imported_ in Septem" [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [22:46:00] (03CR) 10Dzahn: "mentioned by Mark on June 23 2005" [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [22:46:29] (03CR) 10JHathaway: [C:03+2] postfix: vtrs, use cname [puppet] - 10https://gerrit.wikimedia.org/r/1050072 (https://phabricator.wikimedia.org/T329882) (owner: 10JHathaway) [22:47:28] andrewbogott: shall I merge your puppet change? [22:47:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS bullseye [22:47:56] yes please [22:48:06] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS b... [22:48:31] great, done [22:54:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P65495 and previous config saved to /var/cache/conftool/dbconfig/20240626-225451-marostegui.json [23:00:04] (03PS1) 10JHathaway: vrts: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) [23:00:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [23:02:15] (03PS2) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [23:02:24] (03CR) 10CI reject: [V:04-1] Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) (owner: 10Jdlrobson) [23:02:28] (03PS3) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [23:02:33] (03PS4) 10Jdlrobson: Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) [23:02:36] (03CR) 10CI reject: [V:04-1] Enable special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366384) (owner: 10Jdlrobson) [23:03:30] 10ops-eqdfw, 06SRE, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9928914 (10Papaul) 05Open→03Resolved a:03Papaul Order # 1-235341265861 was created on June 12th at 21:53 and it was resolved without the issue been fixed. I open another order # 1-2... [23:04:35] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1050076 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [23:09:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T364069)', diff saved to https://phabricator.wikimedia.org/P65496 and previous config saved to /var/cache/conftool/dbconfig/20240626-230958-marostegui.json [23:10:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:10:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:10:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:10:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T364069)', diff saved to https://phabricator.wikimedia.org/P65497 and previous config saved to /var/cache/conftool/dbconfig/20240626-231020-marostegui.json [23:20:31] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [23:23:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [23:26:07] !log people1004 - stopped confd which logs every 3 seconds that it can't find any templates (T356296) [23:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:12] T356296: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 [23:33:16] (03CR) 10Ssingh: [C:03+1] delete cache.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1050075 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [23:36:04] (03PS1) 10Dzahn: peopleweb: set profile::firewall::defs_from_etcd to false [puppet] - 10https://gerrit.wikimedia.org/r/1050080 (https://phabricator.wikimedia.org/T356296) [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050081 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050081 (owner: 10TrainBranchBot) [23:55:37] (03CR) 10Dzahn: "compiler shows this does indeed remove the entire confd stuff plus the prometheus exporter for it etc.. This means no requestctl rules on " [puppet] - 10https://gerrit.wikimedia.org/r/1050080 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [23:56:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5021.eqsin.wmnet with OS bullseye [23:56:49] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9928977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bulls... [23:59:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050081 (owner: 10TrainBranchBot)