[00:00:13] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247 (10Ladsgroup) 03NEW [00:02:06] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096575 (10Ladsgroup) rsyslog logs don't give anything useful. It turns on, immediately segfaults, tries again and so on. This showed up only once: ` Aug 18 23:44:07 ms-be1071 rsyslogd[3526091]: fatal... [00:06:14] swfrench-wmf/zabe: The last scap deployment before zabe's was a security patch deployment which uses sync-file which skips l10n stuff. Zabe things should be normal after this deployment but let me know if not. [00:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765 [00:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765 (owner: 10TrainBranchBot) [00:08:05] dancy: ah, there we go. yeah, that would be consistent with potentially leaving latent changes that could trigger a full build, which succeeded for -81 on z.abe's first attempt, but failed for -83 due to the disk ussye [00:08:09] *issue [00:10:08] yes, `539 languages rebuilt out of 539` at 23:04:15.217 from scap logs [00:11:30] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] (duration: 37m 15s) [00:11:35] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [00:12:28] Ah, so it rebuilt l10n on my try [00:12:32] * first try [00:12:53] Which is why it did not do it on the current sync [00:13:02] exactly, yeah [00:17:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:23:17] !log Clearing corrupted logs on ms-be1071 - T402247 [00:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:21] T402247: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247 [00:30:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765 (owner: 10TrainBranchBot) [00:31:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:33:09] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096653 (10andrea.denisse) I think that the drive is failing: `sudo dmesg | grep -i 'error\|fail\|ata'`: ` [37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags... [00:36:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:40:38] I need to revert a security patch that was deployed a few hours earlier during the security deployment window [00:40:48] Would that conflict with anything anyone is doing? [00:40:58] it's causing an unbreak now issue [00:41:07] jouncebot: nowandnext [00:41:07] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [00:41:07] In 1 hour(s) and 18 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0200) [00:41:13] maryum: I think you're clear [00:41:23] thanks! [00:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:42:26] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096685 (10andrea.denisse) >>! In T402247#11096653, @andrea.denisse wrote: > I think that the drive is failing: > > `sudo dmesg | grep -i 'error\|fail\|ata'`: > ` > [37968478.484217] blk_update_request... [00:45:39] (03PS3) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) [00:45:53] (03CR) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [00:49:42] okay about to run scap to undeploy security fix [00:50:37] scap running now [01:03:16] scap finished [01:03:35] !log undeploy security fix for T397396 [01:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376) [01:08:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [01:22:40] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [01:24:18] (03PS6) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) [01:26:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:27:16] (03PS7) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) [01:27:42] (03PS8) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) [01:27:50] (03CR) 10Krinkle: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle) [01:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0200) [02:01:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [02:06:07] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [02:15:28] huh, that spike in logic successes is massive [02:15:51] login* [02:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:33:14] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:37:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:42:59] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0300) [03:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:13:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:25:38] (03CR) 10BCornwall: [C:03+2] hiera: ncmonitor: add wikimedia.ee to ignored_domains [puppet] - 10https://gerrit.wikimedia.org/r/1179688 (owner: 10Ssingh) [03:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0400) [04:04:25] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.12 (duration: 04m 23s) [04:13:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:28:01] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:28:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:30:59] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:43:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:43:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:45:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:48:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:50:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:08:07] (03CR) 10Muehlenhoff: [C:03+2] "Also confirmed out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1178960 (owner: 10Jdlrobson) [05:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:18:08] (03CR) 10Muehlenhoff: [C:03+2] Also update tracked email address [puppet] - 10https://gerrit.wikimedia.org/r/1179712 (https://phabricator.wikimedia.org/T401882) (owner: 10Muehlenhoff) [05:18:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:20:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:21:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:25:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:42:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:47:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [05:52:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:54:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:55:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0600) [06:02:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:11:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae5 (External: Arelion Transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:16:51] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae5 (External: Arelion Transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:18:37] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:39] (03CR) 10Muehlenhoff: [C:03+2] ganeti-routed: Enable bird component for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179706 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [06:32:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:35:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [06:37:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:43:36] (03CR) 10Muehlenhoff: [C:03+2] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [06:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:50:44] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add policy file [puppet] - 10https://gerrit.wikimedia.org/r/1179971 [06:53:08] (03PS1) 10Muehlenhoff: durum: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:38] I'm here, will start the deployment.. [07:00:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry) [07:02:25] (03Merged) 10jenkins-bot: Content Translation: Remove unused configuration parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry) [07:02:56] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]] [07:03:01] T400671: Cleanup unused ContentTranslation configuration parameters - https://phabricator.wikimedia.org/T400671 [07:04:04] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6625/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:04:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [07:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:52] !log kartik@deploy1003 kartik: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:49] (03PS1) 10Muehlenhoff: doh: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) [07:10:03] !log kartik@deploy1003 kartik: Continuing with sync [07:10:19] (03PS3) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) [07:10:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [07:10:56] (03CR) 10Brouberol: [C:03+2] airflow-dev: don't report dag runs to datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178905 (https://phabricator.wikimedia.org/T401932) (owner: 10Brouberol) [07:12:47] (03PS3) 10Stevemunene: dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) [07:14:12] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11096965 (10ayounsi) 05Resolved→03Open @Jclark-ctr could you ask them if a device reboot would clear the alarm ? We would ideally need to upgrade all switches of the VXLAN domain (so rows E and... [07:14:14] (03PS1) 10Slyngshede: hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 [07:15:22] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]] (duration: 12m 26s) [07:15:26] T400671: Cleanup unused ContentTranslation configuration parameters - https://phabricator.wikimedia.org/T400671 [07:18:13] (03CR) 10Huei Tan: "we have postponed this backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [07:22:42] (03PS2) 10KartikMistry: Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117) [07:23:03] Backport done, I'll deploy cxserver as there are no other patches in the window.. [07:24:09] (03CR) 10Ayounsi: "change lgtm but leaving it to Sukhe for the final +1" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [07:24:16] (03CR) 10Ayounsi: "change lgtm but leaving it to Sukhe for the final +1" [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [07:24:16] (03PS19) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [07:25:01] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6626/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:25:09] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry) [07:26:03] (03CR) 10Stevemunene: "Added the A record for this and updated I83d6df36c9fa08eeabab4b724ed87e9345284175 with the codfw ingres IP as well." [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [07:26:49] (03Merged) 10jenkins-bot: Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry) [07:27:37] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [07:27:59] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:29:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6628/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:30:38] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:30:41] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6629/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:31:36] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:32:40] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:33:13] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:33:26] kart_: I would like to start working on the train, please let me know when you're finished [07:33:45] (03PS1) 10DCausse: eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) [07:33:53] !log Updated cxserver to 2025-08-14-134810-production (T399117, T393705) [07:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:59] T399117: Support querying "easy" translation recommendations - https://phabricator.wikimedia.org/T399117 [07:33:59] T393705: Remove CXStats related code - https://phabricator.wikimedia.org/T393705 [07:34:02] jnuche: I'm done. [07:34:07] o/ [07:34:15] I am floating around if you need assistance :] [07:34:16] kart_: ty [07:34:25] thanks hashar :) [07:36:18] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376) [07:36:20] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [07:37:14] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [07:37:36] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.15 refs T396376 [07:37:40] T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376 [07:41:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259 (10MoritzMuehlenhoff) 03NEW [07:44:05] (03CR) 10Vgutierrez: [C:03+1] "verified against ops members in modules/admin/data/data.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede) [07:44:43] (03CR) 10Slyngshede: [C:03+2] hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede) [07:44:46] (03CR) 10Slyngshede: [V:03+2 C:03+2] hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede) [07:44:49] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260 (10ABran-WMF) 03NEW [07:45:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#11097039 (10ABran-WMF) [07:46:13] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6630/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:51:18] (03CR) 10Stevemunene: [C:03+1] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [07:55:45] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11097062 (10Josve05a) > Follow-up from ticket #2025081710002753: > > - OS: Windows 11 > - Browser: Chromium v126.0.6478.251 > - Browser add-ons: uBlock Origin, Shazam, Don't f***... [07:58:30] (03PS1) 10Ayounsi: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259) [08:03:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097070 (10ayounsi) [08:04:00] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11097073 (10hashar) The images overflowing the disk on deploy was previously filed as {T387796} and a follow up action was to have the images to be garbage collected: T387927 We had the issue earlier in Augus... [08:04:19] jnuche: did it fail overnight? [08:04:41] hashar: yeah, failed patch as usual [08:08:42] :( [08:09:47] I should make a self note to verify them on Monday morning [08:11:12] (03PS1) 10AikoChou: ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) [08:12:38] (03PS2) 10Ayounsi: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259) [08:13:22] (03CR) 10Ozge: [C:03+1] ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [08:15:10] (03CR) 10AikoChou: [C:03+2] ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [08:16:04] (03PS1) 10Ayounsi: Add esams routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1180083 (https://phabricator.wikimedia.org/T402259) [08:16:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097114 (10ayounsi) [08:16:40] (03Merged) 10jenkins-bot: ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [08:19:13] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376) [08:19:15] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [08:20:10] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [08:21:51] (03PS1) 10Ayounsi: Remove esams RIPE Atlas measurements [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) [08:22:13] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [08:23:47] (03PS20) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [08:24:15] (03CR) 10Phuedx: [C:04-1] MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [08:24:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097122 (10ayounsi) [08:25:04] (03CR) 10Phuedx: [C:04-1] MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [08:25:12] (03CR) 10Federico Ceratto: [C:03+1] python-webapp: add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) (owner: 10Elukey) [08:25:18] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6631/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:27:41] (03CR) 10Huei Tan: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [08:28:37] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [08:28:47] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board), 13Patch-For-Review: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11097126 (10Mvolz) [08:28:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:28:57] (03PS5) 10Huei Tan: MinT: Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) [08:29:21] (03CR) 10Huei Tan: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [08:30:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1179711 (owner: 10Ayounsi) [08:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:37:39] (03CR) 10Ayounsi: [C:03+2] Add all Nokia switches to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/1179711 (owner: 10Ayounsi) [08:38:48] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11097142 (10cmooney) >>! In T400783#11096965, @ayounsi wrote: > @Jclark-ctr could you ask them if a device reboot would clear the alarm ?+ Might just be worth giving it a shot. We know we didn't h... [08:39:02] (03CR) 10Cathal Mooney: [C:03+1] doh: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [08:39:30] (03CR) 10Cathal Mooney: [C:03+1] "but per Arzhel's comments prob best wait on Sukhbir" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [08:50:21] (03PS1) 10Filippo Giunchedi: pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) [08:52:33] (03CR) 10CI reject: [V:04-1] pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi) [08:52:43] (03CR) 10Lucas Werkmeister (WMDE): "FYI I got a user report that this interacts badly with the mobile redirect, constantly going between `m.` and `www.` until the browser abo" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [08:53:58] (03CR) 10FNegri: [C:03+2] aptrepo: import wikireplicas-utils from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [08:56:23] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.15 refs T396376 [08:56:28] T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376 [09:33:15] (03CR) 10Btullis: [C:03+1] dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [09:33:56] (03CR) 10Btullis: [C:03+1] dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [09:43:22] (03CR) 10Tiziano Fogli: [C:03+1] logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite) [09:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:00:05] claime and hnowlan: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1000). [10:00:55] 😎 [10:05:26] Let's go [10:05:30] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Add fractional routing to gateway-check [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [10:11:57] (03CR) 10Tiziano Fogli: "Quick question since we’re using a lot of ext filesystems: 5% of the space is reserved for root, right?" [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [10:13:41] claime: tests look good to me [10:13:48] hnowlan: same [10:13:52] \o/ [10:16:04] jmm@cumin2002 reimage (PID 3139755) is awaiting input [10:18:10] (03CR) 10Btullis: [C:03+1] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [10:18:22] (03PS1) 10AikoChou: ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) [10:18:22] (03CR) 10Btullis: [C:03+1] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [10:20:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1180099 (https://phabricator.wikimedia.org/T402275) [10:20:17] (03CR) 10Btullis: [C:03+1] "It also enables *creating* new ingestion recipes, even though they are unlikely to work, doesn't it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [10:20:32] !log Fractional routing support for rest API deployed - T400131 [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:37] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [10:20:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie [10:21:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:21:19] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:21:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81485 and previous config saved to /var/cache/conftool/dbconfig/20250819-102126-ladsgroup.json [10:21:30] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [10:22:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1180100 (https://phabricator.wikimedia.org/T402276) [10:23:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T402276 [10:23:46] T402276: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T402276 [10:23:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81486 and previous config saved to /var/cache/conftool/dbconfig/20250819-102346-ladsgroup.json [10:24:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T402276', diff saved to https://phabricator.wikimedia.org/P81487 and previous config saved to /var/cache/conftool/dbconfig/20250819-102414-fceratto.json [10:25:20] (03CR) 10Tiziano Fogli: [C:03+1] resources: Exclude docker|containerd|kubelet mounts from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [10:25:31] (03PS1) 10Ayounsi: gNMI collect more metrics [puppet] - 10https://gerrit.wikimedia.org/r/1180101 (https://phabricator.wikimedia.org/T395998) [10:28:16] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1180100 (https://phabricator.wikimedia.org/T402276) (owner: 10Gerrit maintenance bot) [10:30:47] (03CR) 10Stevemunene: [C:03+2] dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [10:31:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11097572 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm executed w... [10:31:56] !log stevemunene@dns1004 START - running authdns-update [10:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:32:48] (03PS1) 10Mvolz: Remove all references to deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180103 (https://phabricator.wikimedia.org/T361576) [10:33:09] !log stevemunene@dns1004 END - running authdns-update [10:33:16] !log Starting s2 codfw failover from db2204 to db2207 - T402276 [10:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:20] T402276: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T402276 [10:34:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T402276', diff saved to https://phabricator.wikimedia.org/P81488 and previous config saved to /var/cache/conftool/dbconfig/20250819-103402-fceratto.json [10:34:25] (03CR) 10Stevemunene: "Thanks, the dns change is merged and deployed via authdns-update we are ready to proceed." [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [10:38:15] (03PS2) 10Muehlenhoff: apereo_cas: Remove some obsolete version checks [puppet] - 10https://gerrit.wikimedia.org/r/1125094 [10:38:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P81489 and previous config saved to /var/cache/conftool/dbconfig/20250819-103854-ladsgroup.json [10:39:33] !log installing openjdk-17 security updates [10:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [10:42:04] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097645 (10VRiley-WMF) 05Resolved→03Open [10:42:19] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097646 (10VRiley-WMF) a:03VRiley-WMF [10:42:30] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097648 (10VRiley-WMF) 05Open→03Resolved [10:44:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11097667 (10VRiley-WMF) @MatthewVernon I wanted to check in and see if any of these are ready for the install of the new card? Thank you! [10:44:49] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1044 - vriley@cumin1003" [10:45:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1044 - vriley@cumin1003" [10:45:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:45:19] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2204.codfw.wmnet [10:45:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2204 - Upgrading db2204.codfw.wmnet [10:45:37] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2204 - Upgrading db2204.codfw.wmnet [10:45:37] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044 [10:45:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044 [10:46:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [10:46:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [10:47:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:50:42] (03CR) 10Tiziano Fogli: "I’m thinking about this because it seems that node_exporter itself avoids exporting these filesystems:" [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [10:51:20] (03PS2) 10Hnowlan: trafficserver: simplify gateway-check path globs [puppet] - 10https://gerrit.wikimedia.org/r/1176473 (https://phabricator.wikimedia.org/T400131) [10:51:49] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2204.codfw.wmnet [10:52:32] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:53:30] (03CR) 10Tiziano Fogli: "For me, it’s a +1, since I think these would also be better thresholds for the global alert, as reported in https://gerrit.wikimedia.org/r" [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [10:54:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P81491 and previous config saved to /var/cache/conftool/dbconfig/20250819-105401-ladsgroup.json [10:54:26] (03CR) 10Tiziano Fogli: "Unresolving the previous comment." [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [10:56:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:57:05] (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Enable debug logging for the rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/1179753 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [11:00:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2204* gradually with 4 steps - Upgraded MariaDB [11:01:36] (03PS21) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:01:45] vriley@cumin1003 provision (PID 2748538) is awaiting input [11:02:28] 06SRE, 06Infrastructure-Foundations: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284 (10fnegri) 03NEW [11:02:40] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:03:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS trixie [11:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:06:06] (03CR) 10STran: [C:03+1] Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders) [11:08:03] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [11:08:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097809 (10VRiley-WMF) Setting up cloudcesphosd1044, having to decom and provision it again [11:09:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81493 and previous config saved to /var/cache/conftool/dbconfig/20250819-110909-ladsgroup.json [11:09:14] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [11:09:24] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:09:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81494 and previous config saved to /var/cache/conftool/dbconfig/20250819-110931-ladsgroup.json [11:10:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097818 (10VRiley-WMF) [11:10:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:11:03] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044 [11:11:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044 [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:13:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff) [11:13:15] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:13:35] (03CR) 10Vgutierrez: dse-k8s: add dse-k8s-codfw hosts to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:13:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81495 and previous config saved to /var/cache/conftool/dbconfig/20250819-111353-ladsgroup.json [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:19:01] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6632/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:19:38] (03PS4) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) [11:20:35] (03PS1) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) [11:21:10] (03CR) 10CI reject: [V:04-1] Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [11:23:42] (03PS5) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) [11:23:42] (03PS1) 10Stevemunene: dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) [11:27:02] (03CR) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:29:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P81496 and previous config saved to /var/cache/conftool/dbconfig/20250819-112900-ladsgroup.json [11:33:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11097924 (10Jclark-ctr) a:03Jclark-ctr [11:33:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11097926 (10ABran-WMF) Our current setup trains spamassassin by exporting to mails to the mbox format. This format is not supp... [11:34:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11097929 (10ABran-WMF) p:05Triage→03Medium [11:35:43] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:37:30] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm [11:37:45] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm [11:38:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097941 (10VRiley-WMF) [11:38:59] !log uploaded openjdk-21 21.0.8+9-1~deb12u1 to bookworm-wikimedia (backport of latest security release) [11:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:01] (03Abandoned) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [11:42:10] (03CR) 10Muehlenhoff: [C:03+2] apereo_cas: Remove some obsolete version checks [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff) [11:44:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P81497 and previous config saved to /var/cache/conftool/dbconfig/20250819-114407-ladsgroup.json [11:45:29] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2204* gradually with 4 steps - Upgraded MariaDB [11:45:35] (03PS1) 10FNegri: wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) [11:47:06] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [11:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:49:33] (03Restored) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [11:51:42] !log installing openjdk-21 security updates [11:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:28] (03PS2) 10Filippo Giunchedi: pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) [11:56:07] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1180122 [11:56:15] (03PS1) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [11:56:56] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [11:59:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81498 and previous config saved to /var/cache/conftool/dbconfig/20250819-115915-ladsgroup.json [11:59:20] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:59:20] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [11:59:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81499 and previous config saved to /var/cache/conftool/dbconfig/20250819-115926-ladsgroup.json [11:59:49] (03PS2) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1200) [12:00:54] (03PS22) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [12:01:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81500 and previous config saved to /var/cache/conftool/dbconfig/20250819-120147-ladsgroup.json [12:03:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [12:03:41] (03CR) 10Filippo Giunchedi: wikireplicas: install scripts from deb package (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [12:05:11] (03CR) 10Filippo Giunchedi: "See also task for more context" [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi) [12:06:41] !log Restarting Jenkins [12:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:42] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:15:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098055 (10Jclark-ctr) [12:15:46] !log installing gnutls28 security updates on bullseye [12:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:16:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P81501 and previous config saved to /var/cache/conftool/dbconfig/20250819-121654-ladsgroup.json [12:17:03] (03PS2) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [12:17:47] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:17:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:18:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for - jclark@cumin1002" [12:18:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for - jclark@cumin1002" [12:18:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:02] (03PS3) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [12:19:45] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:20:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:20:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:20:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:20:58] (03PS4) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [12:20:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:24:54] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11098067 (10ABran-WMF) [12:28:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:28:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:32:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P81502 and previous config saved to /var/cache/conftool/dbconfig/20250819-123201-ladsgroup.json [12:32:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:34:23] (03PS12) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [12:34:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:35:52] (03CR) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [12:36:10] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [12:37:25] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:37:41] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm [12:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:37:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm [12:38:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:38:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:39:09] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [12:40:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:57] (03CR) 10Majavah: [C:03+2] hieradata: Remove old ENC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1179127 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah) [12:43:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:44:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:45:36] (03CR) 10Vgutierrez: "looks good, I'll +1ed after the previous CR has been merged and the service has some realservers pooled" [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [12:45:37] (03CR) 10Majavah: [V:03+1 C:03+2] P:mariadb::cloudinfra: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1179119 (owner: 10Majavah) [12:45:54] (03CR) 10Vgutierrez: [C:03+1] dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [12:47:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81503 and previous config saved to /var/cache/conftool/dbconfig/20250819-124709-ladsgroup.json [12:47:14] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [12:47:24] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:47:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81504 and previous config saved to /var/cache/conftool/dbconfig/20250819-124731-ladsgroup.json [12:49:42] jclark@cumin1002 provision (PID 1993005) is awaiting input [12:49:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81505 and previous config saved to /var/cache/conftool/dbconfig/20250819-124952-ladsgroup.json [12:51:15] 07sre-alert-triage, 06Discovery-Search, 06serviceops: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292 (10tappof) 03NEW [12:52:27] 07sre-alert-triage, 06Discovery-Search, 06serviceops: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098148 (10tappof) I’m not entirely sure about the tags assigned to the task, so please feel free to re... [12:53:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:53:57] 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098152 (10tappof) [12:54:40] 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098166 (10tappof) I’ve adjusted the tags. [12:56:06] (03PS1) 10Ayounsi: esams routed ganeti: add v4 and v6 IP/range [puppet] - 10https://gerrit.wikimedia.org/r/1180130 (https://phabricator.wikimedia.org/T402259) [12:56:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:57:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:58:19] (03CR) 10TChin: [C:03+1] eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse) [12:59:19] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [12:59:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [13:00:56] nothing to deploy :) [13:02:05] !log restart Exim on Phabricator hosts to pick up GNU TLS security updates [13:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:58] !log restart FPM on Phabricator hosts to pick up GNU TLS security updates [13:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:46] jclark@cumin1002 reimage (PID 2030749) is awaiting input [13:04:55] 06SRE, 06Infrastructure-Foundations: Move RPKI hosts to Bookworm - https://phabricator.wikimedia.org/T359502#11098203 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These are already running on Bookworm since last year. [13:05:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P81506 and previous config saved to /var/cache/conftool/dbconfig/20250819-130500-ladsgroup.json [13:05:23] (03CR) 10Brouberol: [C:03+2] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:05:26] (03CR) 10Brouberol: [C:03+2] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:06:03] (03CR) 10Brouberol: [C:03+2] "That's right. We're lacking the proper infrastructure to run the UI-defined ingestion runs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:06:51] !log restart slapd on main LDAP r/w servers hosts to pick up GNU TLS security updates [13:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1019.eqiad.wmnet with OS bullseye [13:07:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:02] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-07-25-064834-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179661 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry) [13:08:07] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1019.eqiad.wmnet with OS bullseye [13:08:10] (03Merged) 10jenkins-bot: airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:08:11] (03Merged) 10jenkins-bot: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:08:25] (03Merged) 10jenkins-bot: Enable visibility of ingestion runs in the datahub UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [13:08:47] jclark@cumin1002 provision (PID 2028919) is awaiting input [13:10:02] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-07-25-064834-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179661 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry) [13:10:23] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage [13:10:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:10:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:11:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1018.eqiad.wmnet with OS bullseye [13:11:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1018.eqiad.wmnet with OS bullseye [13:11:19] (03PS2) 10Muehlenhoff: profile::docker::firewall: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1114661 [13:11:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1017.eqiad.wmnet with OS bullseye [13:11:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1017.eqiad.wmnet with OS bullseye [13:12:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:38] (03PS1) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) [13:13:11] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:14:01] (03PS2) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) [13:14:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage [13:14:35] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [13:14:36] (03CR) 10Ssingh: [C:03+1] "Thanks, folks." [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:16:16] (03CR) 10Ssingh: [C:03+1] durum: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:17:14] (03CR) 10Ssingh: "@bcornwall@wikimedia.org can deploy this from Traffic, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [13:17:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:19:42] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:20:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P81507 and previous config saved to /var/cache/conftool/dbconfig/20250819-132007-ladsgroup.json [13:22:06] (03PS3) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) [13:22:36] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [13:22:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:23:18] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [13:25:33] (03PS5) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:25:59] (03CR) 10CDobbins: [V:03+1] "How would I add doh1001? I assume that it's not done by adding a file named doh1001.yaml :-p" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:28:10] (03PS4) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) [13:29:20] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:29:42] (03CR) 10Ssingh: "No, that's completely correct, adding a file called doh1001.yaml." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:30:36] PROBLEM - Confd vcl based reload on cp4038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:30:44] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [13:32:00] (03PS6) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:32:07] (03CR) 10Btullis: "nit: The example in the commit message isn't the best, because its not the artifact cache where we need to have user specific access." [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:32:48] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6635/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:32:50] (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:33:11] (03CR) 10Ladsgroup: "https://puppet-compiler.wmflabs.org/output/1180134/7255/config-master1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [13:33:30] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:33:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage [13:33:56] (03Abandoned) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [13:34:15] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [13:34:34] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [13:34:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1044.eqiad.wmnet with OS bookworm [13:34:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm completed: - cloudcephosd1044 (**PASS**... [13:35:07] !log Updated Recommendation API to 2025-07-25-064834-production (T399117) [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:11] T399117: Support querying "easy" translation recommendations - https://phabricator.wikimedia.org/T399117 [13:35:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81508 and previous config saved to /var/cache/conftool/dbconfig/20250819-133515-ladsgroup.json [13:35:19] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [13:35:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:35:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81509 and previous config saved to /var/cache/conftool/dbconfig/20250819-133537-ladsgroup.json [13:35:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098303 (10VRiley-WMF) [13:35:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098306 (10VRiley-WMF) cloudcephosd1044 is completed [13:36:30] (03PS7) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:36:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098310 (10Jclark-ctr) [13:36:40] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1018.eqiad.wmnet with reason: host reimage [13:37:00] (03PS1) 10CDobbins: sre.loadbalancer: add cookbook to restart Liberica hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [13:37:11] (03PS8) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:37:46] (03PS9) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:38:07] (03CR) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:38:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage [13:38:39] (03CR) 10Lucas Werkmeister (WMDE): "FTR, I’ve just tried to [revert](https://sal.toolforge.org/log/xHCLwpgB8tZ8Ohr0SrgO) this commit on Beta in order to unbreak the cluster o" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [13:38:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81510 and previous config saved to /var/cache/conftool/dbconfig/20250819-133858-ladsgroup.json [13:39:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6636/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:39:39] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1017.eqiad.wmnet with reason: host reimage [13:40:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:41:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1020.eqiad.wmnet with OS bullseye [13:41:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1020.eqiad.wmnet with OS bullseye [13:41:46] RECOVERY - Confd vcl based reload on cp4038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:41:51] (03CR) 10Lucas Werkmeister (WMDE): "(Also, that URL I posted earlier should be https://www.wikidata.beta.wmcloud.org/wiki/Q11 of course, with beta and wikidata in the right o" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [13:41:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1018.eqiad.wmnet with reason: host reimage [13:43:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:43:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:44:35] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: add cookbook to restart Liberica hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [13:46:24] (03CR) 10Btullis: admin/data: add the analytics-ml system user to the analytics-privatedata users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:47:59] (03PS10) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) [13:48:03] (03CR) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:48:46] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6637/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:48:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1017.eqiad.wmnet with reason: host reimage [13:49:09] PROBLEM - Confd vcl based reload on cp4039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:49:44] <_joe_> !log systemctl reload varnish-frontend.service on cp4039 [13:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] <_joe_> that doesn't have the effect I hoped [13:51:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:51:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81511 and previous config saved to /var/cache/conftool/dbconfig/20250819-135112-fceratto.json [13:51:16] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [13:51:20] (03PS1) 10Ayounsi: magru: add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 [13:51:25] (03PS1) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 [13:51:36] (03CR) 10CI reject: [V:04-1] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson) [13:52:09] RECOVERY - Confd vcl based reload on cp4039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:52:44] (03CR) 10Ayounsi: magru: add sandbox vlan to routed ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (owner: 10Ayounsi) [13:53:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81513 and previous config saved to /var/cache/conftool/dbconfig/20250819-135340-fceratto.json [13:53:58] (03PS2) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 [13:54:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P81514 and previous config saved to /var/cache/conftool/dbconfig/20250819-135405-ladsgroup.json [13:54:40] (03CR) 10Andrew Bogott: [C:04-1] openstack: acquire cfssl certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [13:55:22] (03CR) 10CI reject: [V:04-1] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson) [13:56:15] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:56:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:56:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1019.eqiad.wmnet with OS bullseye [13:57:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1019.eqiad.wmnet with OS bullseye complete... [13:57:35] (03PS2) 10Ayounsi: magru: add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 [13:58:07] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [13:59:54] (03PS3) 10Ayounsi: Add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 [14:00:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:01:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:01:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:03:19] (03PS23) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [14:03:42] (03PS1) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) [14:04:38] (03PS2) 10FNegri: wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) [14:04:52] (03CR) 10FNegri: wikireplicas: install scripts from deb package (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [14:05:22] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:05:55] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:05:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:06:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1018.eqiad.wmnet with OS bullseye [14:06:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1018.eqiad.wmnet with OS bullseye complete... [14:06:20] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [14:06:52] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) (owner: 10Aqu) [14:06:59] (03PS2) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) [14:07:10] (03CR) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:07:23] (03PS1) 10Ayounsi: Add magru sandbox prefixes to routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1180150 [14:07:26] (03PS24) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [14:07:42] (03CR) 10Btullis: [C:03+1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:08:46] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [14:08:48] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:08:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81515 and previous config saved to /var/cache/conftool/dbconfig/20250819-140848-fceratto.json [14:09:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:09:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1017.eqiad.wmnet with OS bullseye [14:09:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P81516 and previous config saved to /var/cache/conftool/dbconfig/20250819-140913-ladsgroup.json [14:09:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1017.eqiad.wmnet with OS bullseye complete... [14:09:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098510 (10Jclark-ctr) [14:10:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:10:51] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:11:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:11:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:12:36] (03PS3) 10Andrew Bogott: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:12:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:13:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:14:40] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 5 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11098531 (10Tgr) The patches are merged, and I added... [14:15:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:16:38] (03PS3) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 [14:19:35] (03CR) 10Krinkle: [C:04-1] "OK. I think what's happening here is that Varnish is stripping the "m" from m.wikidata and not turning it into a www because the VCL is ha" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [14:20:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:22:21] (03PS4) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) [14:22:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:22:52] (03PS1) 10Dreamy Jazz: UserInfoCard: Link to metawiki for Special:CentralAuth links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690) [14:23:11] jouncebot: nowandnext [14:23:11] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [14:23:11] In 0 hour(s) and 6 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430) [14:23:30] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:23:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81517 and previous config saved to /var/cache/conftool/dbconfig/20250819-142355-fceratto.json [14:24:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81518 and previous config saved to /var/cache/conftool/dbconfig/20250819-142420-ladsgroup.json [14:24:25] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [14:24:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690) (owner: 10Dreamy Jazz) [14:24:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:25:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:25:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81519 and previous config saved to /var/cache/conftool/dbconfig/20250819-142514-ladsgroup.json [14:25:33] (03Merged) 10jenkins-bot: UserInfoCard: Link to metawiki for Special:CentralAuth links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690) (owner: 10Dreamy Jazz) [14:26:07] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]] [14:26:11] T397690: User info card global account link should lead to meta - https://phabricator.wikimedia.org/T397690 [14:26:55] (03CR) 10FNegri: [C:03+2] wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [14:27:26] (03PS4) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix beta redirect [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) [14:28:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81521 and previous config saved to /var/cache/conftool/dbconfig/20250819-142832-ladsgroup.json [14:29:42] (03CR) 10Brouberol: [V:03+1 C:03+2] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430) [14:30:12] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:42] <_joe_> !log running requestctl-admin upgrade-schema pattern on alert1002 [14:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:01] (03PS2) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) [14:31:22] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [14:31:40] (03PS5) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix beta redirect [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) [14:31:44] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [14:33:59] (03CR) 10Krinkle: "krinkle@deployment-cache-text08:~$ sudo run-puppet-agent" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [14:34:41] (03CR) 10DCausse: [C:03+1] "lgtm," [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson) [14:37:13] !log Running `/usr/local/bin/foreachwikiindblist group0.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"` [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:18] !log Running `/usr/local/bin/foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"` [14:37:19] jouncebot: nowandnext [14:37:20] For the next 0 hour(s) and 22 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430) [14:37:20] In 0 hour(s) and 22 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1500) [14:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81522 and previous config saved to /var/cache/conftool/dbconfig/20250819-143903-fceratto.json [14:39:08] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [14:39:16] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]] (duration: 13m 09s) [14:39:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:39:20] T397690: User info card global account link should lead to meta - https://phabricator.wikimedia.org/T397690 [14:39:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81523 and previous config saved to /var/cache/conftool/dbconfig/20250819-143926-fceratto.json [14:40:46] 06SRE: Add known-client-ingestion-source objects an logic - https://phabricator.wikimedia.org/T402014#11098722 (10Vgutierrez) p:05Triage→03Medium [14:41:20] (03CR) 10DCausse: [C:03+2] eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse) [14:41:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81524 and previous config saved to /var/cache/conftool/dbconfig/20250819-144158-fceratto.json [14:43:10] (03Merged) 10jenkins-bot: eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse) [14:43:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:43:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P81525 and previous config saved to /var/cache/conftool/dbconfig/20250819-144339-ladsgroup.json [14:44:52] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:45:08] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage [14:45:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:45:36] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:45:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:47:36] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:47:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage [14:48:47] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:50:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:51:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [14:52:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11098799 (10Vgutierrez) @dang could you create a CR on gerrit with your public SSH key to confirm it? thanks! [14:52:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [14:53:11] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:53:33] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:53:40] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:53:46] <_joe_> the puppet failures come from my changes, will resolve [14:54:41] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:55:31] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:55:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:56:10] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:57:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81526 and previous config saved to /var/cache/conftool/dbconfig/20250819-145706-fceratto.json [14:57:59] (03CR) 10Brennen Bearnes: "Thanks for the digging!" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [14:58:19] (03CR) 10Ebernhardson: [C:03+2] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson) [14:58:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P81527 and previous config saved to /var/cache/conftool/dbconfig/20250819-145847-ladsgroup.json [14:59:10] (03PS1) 10FNegri: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) [14:59:57] (03PS2) 10FNegri: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) [15:00:05] jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1500). [15:00:15] (03Merged) 10jenkins-bot: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson) [15:01:29] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T402309 [15:01:33] T402309: Deploy Phabricator/Phorge 2025-08-19 - https://phabricator.wikimedia.org/T402309 [15:01:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11098843 (10Vgutierrez) I'm seeing you have 3 LDAP accounts at the moment: * https://ldap.toolforge.org/user/dang * https://ldap.toolforge.org/user/datwmd... [15:02:29] (03CR) 10Filippo Giunchedi: [C:03+1] sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [15:02:37] !log brennen@deploy1003 Started deploy [phabricator/deployment@22fcde9]: deploy phab2002 for T402309 [15:03:19] !log brennen@deploy1003 Finished deploy [phabricator/deployment@22fcde9]: deploy phab2002 for T402309 (duration: 00m 42s) [15:03:33] (03PS1) 10Chlod Alejandro: Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) [15:03:35] !log brennen@deploy1003 Started deploy [phabricator/deployment@22fcde9]: deploy phab1004 for T402309 [15:04:14] !log brennen@deploy1003 Finished deploy [phabricator/deployment@22fcde9]: deploy phab1004 for T402309 (duration: 00m 39s) [15:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:05:32] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) (owner: 10Aqu) [15:06:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:08:02] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:08:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:08:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1020.eqiad.wmnet with OS bullseye [15:08:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1020.eqiad.wmnet with OS bullseye complete... [15:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:03] (03PS14) 10Brennen Bearnes: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:11:29] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:11:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098896 (10Jclark-ctr) [15:11:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098897 (10Jclark-ctr) 05Open→03Resolved [15:12:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81528 and previous config saved to /var/cache/conftool/dbconfig/20250819-151213-fceratto.json [15:13:46] (03PS15) 10Brennen Bearnes: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:13:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81529 and previous config saved to /var/cache/conftool/dbconfig/20250819-151354-ladsgroup.json [15:13:59] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [15:14:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [15:14:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro) [15:14:39] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:14:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81530 and previous config saved to /var/cache/conftool/dbconfig/20250819-151446-ladsgroup.json [15:15:26] (03CR) 10FNegri: [C:03+2] sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [15:16:54] (03PS3) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) [15:16:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81531 and previous config saved to /var/cache/conftool/dbconfig/20250819-151656-ladsgroup.json [15:17:03] (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:19:10] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1180122 (owner: 10Muehlenhoff) [15:20:00] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [15:21:41] (03Merged) 10jenkins-bot: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [15:21:48] (03CR) 10Brennen Bearnes: [C:03+1] "Confirmed on phabricator-bullseye.devtools.eqiad1.wikimedia.cloud that it needs the `apc.` prefix. With that, it works, as can be seen at " [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [15:21:59] (03CR) 10Dzahn: [C:03+1] nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [15:23:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:33] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81532 and previous config saved to /var/cache/conftool/dbconfig/20250819-152720-fceratto.json [15:27:25] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [15:27:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:27:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81533 and previous config saved to /var/cache/conftool/dbconfig/20250819-152743-fceratto.json [15:27:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [15:30:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81534 and previous config saved to /var/cache/conftool/dbconfig/20250819-153015-fceratto.json [15:31:10] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [15:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:32:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81535 and previous config saved to /var/cache/conftool/dbconfig/20250819-153203-ladsgroup.json [15:32:19] (03CR) 10Andrew Bogott: [C:03+1] "I am Ok with merging this based on the previous pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [15:33:57] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:21] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1042 [15:35:22] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1042 [15:35:55] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:40:46] (03PS5) 10Andrew Bogott: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [15:40:59] (03CR) 10Andrew Bogott: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [15:41:54] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:47:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81536 and previous config saved to /var/cache/conftool/dbconfig/20250819-154711-ladsgroup.json [15:49:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:50:13] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [15:50:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [15:51:13] (03PS1) 10Ahmon Dancy: Allow deployment group to sudo systemctl status spiderpig-{apiserver,jobrunner} [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) [15:52:09] (03PS8) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) [15:52:17] (03CR) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [16:00:05] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:15] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6640/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [16:02:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81538 and previous config saved to /var/cache/conftool/dbconfig/20250819-160218-ladsgroup.json [16:02:23] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [16:02:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:02:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81539 and previous config saved to /var/cache/conftool/dbconfig/20250819-160230-ladsgroup.json [16:04:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81540 and previous config saved to /var/cache/conftool/dbconfig/20250819-160439-ladsgroup.json [16:04:47] (03CR) 10Scott French: [C:03+1] haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [16:05:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:07:46] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [16:10:13] (03PS1) 10Krinkle: varnish: Merge m-dot and X-Subdomain block in cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) [16:12:03] (03PS1) 10Kosta Harlan: AbuseFilterHooks: Handle IP user performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298) [16:15:08] (03PS1) 10Giuseppe Lavagetto: haproxy: re-add blank line for better readability [puppet] - 10https://gerrit.wikimedia.org/r/1180169 [16:15:49] jouncebot: nowandnext [16:15:50] For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1600) [16:15:50] In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700) [16:16:11] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: re-add blank line for better readability [puppet] - 10https://gerrit.wikimedia.org/r/1180169 (owner: 10Giuseppe Lavagetto) [16:19:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81541 and previous config saved to /var/cache/conftool/dbconfig/20250819-161948-ladsgroup.json [16:20:00] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [16:23:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [16:28:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298) (owner: 10Kosta Harlan) [16:30:13] (03Merged) 10jenkins-bot: AbuseFilterHooks: Handle IP user performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298) (owner: 10Kosta Harlan) [16:30:41] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] [16:30:46] T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298 [16:31:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:32:38] !log mszabo@deploy1003 mszabo, kharlan: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:33:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11099320 (10Vgutierrez) I'm already seeing an account (https://ldap.toolforge.org/user/dang) requested on T288355 with some privileges: ` dang: ensu... [16:34:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81542 and previous config saved to /var/cache/conftool/dbconfig/20250819-163455-ladsgroup.json [16:39:12] !log mszabo@deploy1003 Sync cancelled. [16:45:22] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [16:48:27] vriley@cumin1003 reimage (PID 2781057) is awaiting input [16:48:44] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [16:48:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [16:48:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm completed: - cloudcephosd1042 (**PASS**... [16:50:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81543 and previous config saved to /var/cache/conftool/dbconfig/20250819-165003-ladsgroup.json [16:50:08] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [16:50:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [16:50:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81544 and previous config saved to /var/cache/conftool/dbconfig/20250819-165015-ladsgroup.json [16:51:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81545 and previous config saved to /var/cache/conftool/dbconfig/20250819-165124-ladsgroup.json [16:53:14] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579) [16:53:24] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099398 (10Eevans) rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which apparently was corrupted. [16:57:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099403 (10VRiley-WMF) [16:57:32] (03CR) 10Dzahn: [C:03+1] "thanks for pointing out it needed to be a hash, makes sense. should have tried that. also thanks for testing it. I was about to comment li" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [17:00:08] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700). [17:00:13] o/ [17:00:24] mszabo: I see your cancelled deployment in the scroll back. what's the status? does that patch need reverted before deployments can safely proceed? [17:03:07] swfrench-wmf: ah sorry, so it should be fine to resume deployments but it needs a followup patch to actually do what it says on the tin [17:03:08] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1175916/6641/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [17:03:36] I can finish the sync, it would merely end up adjusting the error message of a low-rate error on testwiki [17:04:32] mszabo: ah, thanks for the follow-up. so, just to confirm, it's 100% safe for that patch to proceed to the rest of production as-is. [17:04:45] yeah, let me sync it quickly to avoid confusion [17:04:49] (03CR) 10Dzahn: [V:03+1 C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [17:05:15] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] [17:05:19] T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298 [17:05:46] mszabo: thanks for confirming, and for completing the deployment. I'll wait until you're done to proceed with mine :) [17:06:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81546 and previous config saved to /var/cache/conftool/dbconfig/20250819-170632-ladsgroup.json [17:07:11] !log mszabo@deploy1003 kharlan, mszabo: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:07:33] !log mszabo@deploy1003 kharlan, mszabo: Continuing with sync [17:07:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099428 (10VRiley-WMF) So, here is what has been completed so far cloudcephosd1042 C8 U12 CableID 5204 Port 29 CableID 20220266 (Not set as of yet) Port 28 cloudcephosd1043 C8 U13 CableID... [17:08:22] (03CR) 10Dzahn: [C:03+1] "it's a sudo privileges change - but allowing to see "status" for something you are already allowed to restart seems very harmless" [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [17:10:43] !log phab2002/phab1004 - systemctl restart php7.4-fpm after we increased APCu shared memory segment size (T401157) [17:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:48] T401157: Phorge setup check caching is misbehaving, leading to many duck-sound=quack requests - https://phabricator.wikimedia.org/T401157 [17:12:53] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] (duration: 07m 38s) [17:12:58] T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298 [17:13:57] proceeding with the infra window [17:15:10] !log swfrench@deploy1003 Started scap sync-world: No-op deployment to introduce new build report metadata - T401721 [17:15:15] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [17:17:13] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11099522 (10Dzahn) Thanks for confirming that, Katie! I think there is nothing else to do on this task. I verified Chris is in the NDA spreadsheet SRE looks at and that he has the r... [17:17:25] !log swfrench@deploy1003 Finished scap sync-world: No-op deployment to introduce new build report metadata - T401721 (duration: 02m 52s) [17:17:47] (03CR) 10CDanis: [C:03+2] haproxy: maxconn for varnish threads limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [17:17:53] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11099525 (10Dzahn) 05In progress→03Resolved a:03Dzahn please reopen if you think anything else needs to be done. [17:18:26] mszabo: I don't have anything else planned for the window, so all yours if you're ready for your follow-on patch. [17:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:21:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81548 and previous config saved to /var/cache/conftool/dbconfig/20250819-172139-ladsgroup.json [17:22:00] (03CR) 10BCornwall: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [17:25:33] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11099555 (10ecarg) Thank you so much, @RLazarus [17:27:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11099561 (10cmooney) Probably worth opening a JTAC case to ask about this. One thing I note is that only FPCs 1 and 3 are in use on this box... [17:31:02] (03PS1) 10Btullis: Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103) [17:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:32:48] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6642/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [17:33:48] (03CR) 10Btullis: [C:03+2] Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [17:34:03] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099594 (10andrea.denisse) >>! In T402247#11099398, @Eevans wrote: > rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which apparently was corrupted. Strange, I cleared... [17:35:53] (03Merged) 10jenkins-bot: Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis) [17:36:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81550 and previous config saved to /var/cache/conftool/dbconfig/20250819-173646-ladsgroup.json [17:36:51] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [17:37:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [17:37:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81551 and previous config saved to /var/cache/conftool/dbconfig/20250819-173709-ladsgroup.json [17:37:30] !log zoe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [17:38:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:38:31] !log zoe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [17:38:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81552 and previous config saved to /var/cache/conftool/dbconfig/20250819-173833-ladsgroup.json [17:39:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:43:59] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [17:46:13] (03PS9) 10RLazarus: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle) [17:46:29] (03PS1) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [17:46:50] jouncebot: nowandnext [17:46:51] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700) [17:46:51] In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800) [17:47:07] (03CR) 10CI reject: [V:04-1] zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [17:47:11] swfrench-wmf, mszabo: do you mind if I sneak out Krinkle's patch before the infra window is over? [17:47:22] (03CR) 10Dzahn: [C:03+1] mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle) [17:47:33] rzl: no objections on my end! [17:48:32] 🛫 [17:49:22] (03CR) 10RLazarus: [C:03+2] "Thanks! I made one adjustment, take a look for future reference (top-level keys are domains, not paths) but I'll get this shipped out duri" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle) [17:53:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81553 and previous config saved to /var/cache/conftool/dbconfig/20250819-175340-ladsgroup.json [17:56:34] (03PS2) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [17:57:46] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1174872 [17:58:44] !log rzl@deploy1003 rzl: https://gerrit.wikimedia.org/r/1174872 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:59:45] !log rzl@deploy1003 rzl: Continuing with sync [18:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800). [18:00:27] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [18:00:49] jnuche, jeena: ^ above is warpping up shortly, sorry to run over! [18:00:57] *wrapping [18:01:08] (03PS1) 10Dzahn: cloud: add profile::pki::client::ensure for wikistats VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1180189 [18:02:13] rzl: No problem, we already deployed during the European window [18:02:55] 👍 [18:04:56] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1174872 (duration: 07m 51s) [18:06:31] 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099728 (10Eevans) >>! In T402247#11099594, @andrea.denisse wrote: >>>! In T402247#11099398, @Eevans wrote: >> rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which app... [18:08:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81554 and previous config saved to /var/cache/conftool/dbconfig/20250819-180848-ladsgroup.json [18:09:56] (03CR) 10JHathaway: "Would love a review when you have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [18:11:06] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [18:15:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:47] ^ it's back but yeah [18:18:35] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [18:18:48] sukhe: noted :/ sigh [18:20:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:20:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:16] !log gerrit - deactivated user Keccake256 for spam-like comments and edits on commons [18:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81555 and previous config saved to /var/cache/conftool/dbconfig/20250819-182356-ladsgroup.json [18:24:04] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [18:24:13] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [18:24:17] !log dancy@deploy1003 Installing scap version "4.206.0" for 2 host(s) [18:24:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81556 and previous config saved to /var/cache/conftool/dbconfig/20250819-182419-ladsgroup.json [18:24:46] jouncebot: nowandnext [18:24:46] For the next 1 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800) [18:24:46] In 1 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000) [18:25:13] FYI, since the train already advanced, dancy and I are going to deploy and test a new scap release [18:26:04] !log dancy@deploy1003 Installation of scap version "4.206.0" completed for 2 hosts [18:26:22] swfrench-wmf: Ready for testing [18:26:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81557 and previous config saved to /var/cache/conftool/dbconfig/20250819-182642-ladsgroup.json [18:26:50] dancy: amazing, thank you! I'll start with a `--stop-before-sync` run to verify the resulting diffs make sense [18:26:55] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:27:00] (i.e., no diffs, heh) [18:27:03] (03CR) 10Ssingh: "in this commit itself, you should bring in the changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172056/15/modules/dnsrec" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:27:45] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to verify image build and dependent helmfile values - T401721 [18:27:49] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [18:28:34] !log swfrench@deploy1003 Stopping before sync operations [18:30:12] `php.version` is emitted and diffs are clean as expected [18:30:26] Excellent [18:30:57] just for completeness, I'll run through a full should-not-affect-anything sync-world [18:32:43] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:34:41] !log swfrench@deploy1003 Started scap sync-world: No-code-changes scap sync-world with new helmfile values - T401721 [18:34:46] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [18:36:30] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:36:43] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:36:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:38:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [18:39:29] !log swfrench@deploy1003 Finished scap sync-world: No-code-changes scap sync-world with new helmfile values - T401721 (duration: 06m 28s) [18:39:57] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177471 (owner: 10Ncmonitor) [18:40:20] all done. thank you very much, dancy [18:41:21] (03PS8) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [18:41:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81558 and previous config saved to /var/cache/conftool/dbconfig/20250819-184149-ladsgroup.json [18:42:07] (03CR) 10BCornwall: "Removed wikimedia.ee" [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [18:44:47] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [18:46:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:47:45] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:50:08] (03PS1) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 [18:51:36] (03CR) 10Dzahn: [C:03+2] gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 (owner: 10Dzahn) [18:51:42] (03PS2) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 [18:51:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:53:40] (03PS3) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 [18:54:08] (03CR) 10Dzahn: [C:03+2] gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 (owner: 10Dzahn) [18:55:37] PROBLEM - HTTPS non-canonical-redirect-19 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo [18:55:37] .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [18:55:37] PROBLEM - HTTPS non-canonical-redirect-19 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo [18:55:37] .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [18:56:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81559 and previous config saved to /var/cache/conftool/dbconfig/20250819-185656-ladsgroup.json [18:58:00] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:00:05] (03PS1) 10Dzahn: gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194 [19:00:54] (03PS2) 10Dzahn: gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194 [19:02:48] (03CR) 10Dzahn: [C:03+2] gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194 (owner: 10Dzahn) [19:03:53] PROBLEM - HTTPS non-canonical-redirect-19 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo [19:03:53] .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [19:04:29] (03CR) 10Dzahn: [C:03+2] cloud: add profile::pki::client::ensure for wikistats VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1180189 (owner: 10Dzahn) [19:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:04:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:07:37] RECOVERY - HTTPS non-canonical-redirect-19 on ncredir4001 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:07:37] RECOVERY - HTTPS non-canonical-redirect-19 on ncredir4002 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:07:53] RECOVERY - HTTPS non-canonical-redirect-19 on ncredir6002 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:11:10] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Enable debug logging for the rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/1179753 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [19:12:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81560 and previous config saved to /var/cache/conftool/dbconfig/20250819-191204-ladsgroup.json [19:12:09] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [19:12:14] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [19:12:19] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [19:13:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance [19:13:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81561 and previous config saved to /var/cache/conftool/dbconfig/20250819-191311-ladsgroup.json [19:15:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81562 and previous config saved to /var/cache/conftool/dbconfig/20250819-191537-ladsgroup.json [19:16:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:17:06] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [19:22:22] jouncebot: nowandnext [19:22:22] For the next 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800) [19:22:22] In 0 hour(s) and 37 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000) [19:22:56] (03PS1) 10Máté Szabó: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) [19:23:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó) [19:23:10] (03PS2) 10Kosta Harlan: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó) [19:23:21] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [19:24:34] !log Running `/usr/local/bin/foreachwikiindblist group1.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"` [19:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:48] (03CR) 10TrainBranchBot: "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó) [19:25:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:26:05] (03CR) 10Dzahn: [C:03+2] Allow deployment group to sudo systemctl status spiderpig-{apiserver,jobrunner} [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [19:26:17] (03PS2) 10Krinkle: [LOCAL HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader) [19:26:27] (03PS3) 10Krinkle: [LOCAL HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader) [19:26:41] (03PS4) 10Krinkle: [BETA HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader) [19:30:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (owner: 10Ayounsi) [19:30:42] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [19:30:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81563 and previous config saved to /var/cache/conftool/dbconfig/20250819-193045-ladsgroup.json [19:32:19] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T395571 [19:32:23] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [19:32:57] (03Merged) 10jenkins-bot: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó) [19:33:26] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]] [19:33:30] T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298 [19:35:21] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:36:20] (03PS1) 10BCornwall: Remove wikipediamustdie.com [dns] - 10https://gerrit.wikimedia.org/r/1180200 [19:37:33] (03CR) 10Dzahn: [C:03+1] Remove wikipediamustdie.com [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall) [19:38:05] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571 [19:38:09] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [19:39:35] (03CR) 10BCornwall: [C:03+2] "I'd also remove the NS records" [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall) [19:39:40] !log mszabo@deploy1003 mszabo: Continuing with sync [19:40:55] (03CR) 10BCornwall: [C:03+2] "Already done!" [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall) [19:44:52] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571 [19:44:56] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [19:45:02] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]] (duration: 11m 36s) [19:45:06] T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298 [19:45:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81564 and previous config saved to /var/cache/conftool/dbconfig/20250819-194552-ladsgroup.json [19:46:08] (03PS1) 10Kosta Harlan: hcaptcha: Unset Referer header [puppet] - 10https://gerrit.wikimedia.org/r/1180204 (https://phabricator.wikimedia.org/T397841) [19:47:20] (03PS1) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) [19:47:33] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [19:50:14] !log import ncmonitor 2.0.0 into bookworm-wikimedia [19:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:50:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:50:50] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:50:52] on it [19:51:02] !log brett@dns1004 START - running authdns-update [19:51:19] thanks, this alert is very unforgiving on purpose :/ [19:51:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:51:28] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:51:32] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:52:06] !log brett@dns1004 END - running authdns-update [19:55:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:55:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:55:50] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:56:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:56:28] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:56:32] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:59:20] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180207 [19:59:24] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 [19:59:28] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180209 [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000). Please do the needful. [20:00:06] chlod: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ here [20:01:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81565 and previous config saved to /var/cache/conftool/dbconfig/20250819-200100-ladsgroup.json [20:01:05] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [20:01:15] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance [20:01:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81566 and previous config saved to /var/cache/conftool/dbconfig/20250819-200122-ladsgroup.json [20:03:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81567 and previous config saved to /var/cache/conftool/dbconfig/20250819-200350-ladsgroup.json [20:04:37] I can deploy [20:05:00] yippee [20:05:26] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sretest2001.codfw.wmnet with reason: supermicro [20:05:58] (03CR) 10Zabe: [C:03+2] Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro) [20:13:39] (03CR) 10BCornwall: [V:03+2 C:03+2] "All have correct NS records, no dnssec" [dns] - 10https://gerrit.wikimedia.org/r/1180207 (owner: 10Ncmonitor) [20:13:59] !log brett@dns1004 START - running authdns-update [20:15:07] (03PS1) 10Jdlrobson: Revert "Stop sending more than one og:image to social media platforms" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521) [20:15:16] !log brett@dns1004 END - running authdns-update [20:15:22] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 16 Sep 2025 07:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:15:45] (03Merged) 10jenkins-bot: Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro) [20:16:21] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]] [20:16:25] T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988 [20:17:03] This will take a bit since it rebuilds localisation cache [20:18:05] alrighty, all good with me [20:18:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81568 and previous config saved to /var/cache/conftool/dbconfig/20250819-201858-ladsgroup.json [20:19:33] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor) [20:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:25:39] (03PS8) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [20:29:08] (03CR) 10BCornwall: [V:03+2 C:03+2] "Valid NS records and no DNSSEC enabled." [puppet] - 10https://gerrit.wikimedia.org/r/1180209 (owner: 10Ncmonitor) [20:34:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81569 and previous config saved to /var/cache/conftool/dbconfig/20250819-203405-ladsgroup.json [20:39:45] !log zabe@deploy1003 chlod, zabe: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:39:50] T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988 [20:39:59] finally ready to test [20:40:04] testing now [20:40:19] works perfectly :) [20:40:24] Nice! [20:40:25] !log zabe@deploy1003 chlod, zabe: Continuing with sync [20:41:55] (03PS1) 10Krinkle: varnish: Write docs for some mobile user agent regexen [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [20:49:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81570 and previous config saved to /var/cache/conftool/dbconfig/20250819-204913-ladsgroup.json [20:49:14] (03PS9) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [20:49:18] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [20:49:29] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance [20:49:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81571 and previous config saved to /var/cache/conftool/dbconfig/20250819-204935-ladsgroup.json [20:50:42] (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:52:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81572 and previous config saved to /var/cache/conftool/dbconfig/20250819-205203-ladsgroup.json [20:52:52] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]] (duration: 36m 31s) [20:52:52] (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:52:56] T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988 [20:53:41] thanks for the deploy, zabe! :) [20:54:12] yw :) [20:54:53] (03PS1) 10Andrea Denisse: Revert "centrallog: Enable debug logging for the rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/1180224 [20:59:39] (03PS3) 10Bking: golang: add trixie-based golang-1.24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2100) [21:03:07] (03PS1) 10Eevans: sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226 [21:03:08] (03PS1) 10Eevans: sessionstore: upgrade production to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180227 [21:03:37] (03CR) 10Andrea Denisse: [C:03+2] Revert "centrallog: Enable debug logging for the rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/1180224 (owner: 10Andrea Denisse) [21:05:21] (03PS1) 10Krinkle: varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) [21:05:49] Hey I'll be doing a few deploys for the Web Team deployment window [21:05:59] @zabe are you done with everything? [21:06:02] (03CR) 10CI reject: [V:04-1] varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [21:06:13] yep [21:07:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81573 and previous config saved to /var/cache/conftool/dbconfig/20250819-210710-ladsgroup.json [21:07:17] thanks! [21:07:25] (03PS1) 10Arlolra: Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) [21:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521) (owner: 10Jdlrobson) [21:11:43] (03Abandoned) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [21:12:53] (03CR) 10Eevans: [C:03+2] sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226 (owner: 10Eevans) [21:14:32] (03Merged) 10jenkins-bot: sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226 (owner: 10Eevans) [21:15:58] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [21:16:30] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [21:19:36] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra) [21:22:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81574 and previous config saved to /var/cache/conftool/dbconfig/20250819-212218-ladsgroup.json [21:22:45] (03Merged) 10jenkins-bot: Revert "Stop sending more than one og:image to social media platforms" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521) (owner: 10Jdlrobson) [21:23:09] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]] [21:27:00] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:28:12] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [21:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:35:57] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]] (duration: 12m 47s) [21:37:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81575 and previous config saved to /var/cache/conftool/dbconfig/20250819-213725-ladsgroup.json [21:37:30] T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010 [21:37:46] (03PS1) 10Dzahn: cache::text: set apt-staging to NOT cache [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) [21:38:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11100438 (10Dzahn) How about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180234/1/hieradata/role/common/cache/text.yaml to just turn off the caching a... [21:42:58] (03PS2) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [21:43:53] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [21:44:02] (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor) [21:44:10] (03CR) 10Mstyles: "We're not actually rolling it out quite yet so leaving it at 0 for now. Happy to still update the commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:44:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance [21:44:49] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor) [21:46:28] (03CR) 10Gergő Tisza: "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:46:41] (03CR) 10Catrope: [C:04-1] OATHAuth: Add Config Variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:46:45] (03CR) 10Gergő Tisza: [C:03+1] OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:47:12] (im done with deploy window) [21:48:04] (03CR) 10Mstyles: "Going to hold off on deploying/merging this patch until we decide to initiate the actual rollout" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:48:39] (03PS3) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [21:49:10] (03CR) 10Mstyles: OATHAuth: Add Config Variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:49:14] (03PS1) 10JHathaway: an-test-coord1002: switch to efi [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577) [21:49:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway) [21:49:45] (03CR) 10Mstyles: [C:04-2] "Will not submit until 2FA rollout plan is ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles) [21:52:14] (03CR) 10JHathaway: [C:03+2] an-test-coord1002: switch to efi [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway) [22:03:58] 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11100525 (10VRiley-WMF) dumpsdata1005 and dumpsdata1006 are completed. Moving onto dumpsdata1007 [22:04:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [22:10:39] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180239 [22:18:55] (03CR) 10BCornwall: [V:03+2 C:03+2] "NS records are correct, dnssec disabled." [puppet] - 10https://gerrit.wikimedia.org/r/1180239 (owner: 10Ncmonitor) [22:20:06] (03PS1) 10Catrope: doc.wikimedia.org CSP: Allow sendBeacon for piwik [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) [22:25:18] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [22:29:07] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [22:36:42] (03CR) 10RLazarus: "Hi Roan -- I don't know doc.wm.o well. If you don't mind getting a +1 from someone on your team for the semantics of the change, I can tak" [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope) [22:40:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [22:40:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T399249)', diff saved to https://phabricator.wikimedia.org/P81576 and previous config saved to /var/cache/conftool/dbconfig/20250819-224028-fceratto.json [22:40:33] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:40:47] (03PS2) 10Dzahn: various: fix puppet-lint legacy_fact warnings for collab services [puppet] - 10https://gerrit.wikimedia.org/r/1178619 [22:45:21] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS bookworm [22:46:26] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1178619/6650/" [puppet] - 10https://gerrit.wikimedia.org/r/1178619 (owner: 10Dzahn) [22:47:58] (03PS3) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [22:49:04] (03PS4) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [22:50:00] (03CR) 10Jdrewniak: [C:03+1] doc.wikimedia.org CSP: Allow sendBeacon for piwik [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope) [22:50:12] (03PS2) 10VolkerE: doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope) [22:50:23] (03CR) 10VolkerE: [C:03+1] doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope) [22:58:56] (03CR) 10RLazarus: [C:03+2] doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope) [23:03:00] (03Restored) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [23:03:49] (03PS1) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180244 (https://phabricator.wikimedia.org/T402050) [23:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:06:50] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker10[15-19].eqiad.wmnet} and (A:dse-k8s-master or A:dse-k8s-worker) [23:08:58] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180245 [23:10:04] 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11100666 (10VRiley-WMF) 05In progress→03Resolved dumpsdata1007 is completed [23:10:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:11:43] (03CR) 10BCornwall: [V:03+2 C:03+2] "DNS is proper and dnssec is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1180245 (owner: 10Ncmonitor) [23:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:23:05] jouncebot: nowandnext [23:23:06] No deployments scheduled for the next 6 hour(s) and 36 minute(s) [23:23:06] In 6 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0600) [23:23:15] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [23:24:09] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [23:24:38] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]] [23:24:43] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:27:10] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:28:19] !log zabe@deploy1003 zabe: Continuing with sync [23:33:37] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]] (duration: 08m 58s) [23:33:41] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246 [23:38:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246 (owner: 10TrainBranchBot) [23:45:13] (03PS1) 10Ladsgroup: tables-catalog: Catalog BounceHandler and LoginNotify tables [puppet] - 10https://gerrit.wikimedia.org/r/1180247 (https://phabricator.wikimedia.org/T399302) [23:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246 (owner: 10TrainBranchBot)