[00:00:13] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247 (10Ladsgroup) 03NEW
[00:02:06] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096575 (10Ladsgroup) rsyslog logs don't give anything useful. It turns on, immediately segfaults, tries again and so on.  This showed up only once: ` Aug 18 23:44:07 ms-be1071 rsyslogd[3526091]: fatal...
[00:06:14] <dancy>	 swfrench-wmf/zabe: The last scap deployment before zabe's was a security patch deployment which uses sync-file which skips l10n stuff.   Zabe things should be normal after this deployment but let me know if not.
[00:08:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765
[00:08:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765 (owner: 10TrainBranchBot)
[00:08:05] <swfrench-wmf>	 dancy: ah, there we go. yeah, that would be consistent with potentially leaving latent changes that could trigger a full build, which succeeded for -81 on z.abe's first attempt, but failed for -83 due to the disk ussye
[00:08:09] <swfrench-wmf>	 *issue
[00:10:08] <swfrench-wmf>	 yes, `539 languages rebuilt out of 539` at 23:04:15.217 from scap logs
[00:11:30] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] (duration: 37m 15s)
[00:11:35] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[00:12:28] <zabe>	 Ah, so it rebuilt l10n on my try
[00:12:32] <zabe>	 * first try
[00:12:53] <zabe>	 Which is why it did not do it on the current sync
[00:13:02] <swfrench-wmf>	 exactly, yeah
[00:17:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:23:17] <denisse>	 !log Clearing corrupted logs on ms-be1071 - T402247
[00:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:21] <stashbot>	 T402247: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247
[00:30:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179765 (owner: 10TrainBranchBot)
[00:31:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:33:09] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096653 (10andrea.denisse) I think that the drive is failing:  `sudo dmesg | grep -i 'error\|fail\|ata'`: ` [37968478.484217] blk_update_request: I/O error, dev sdg, sector 348391680 op 0x0:(READ) flags...
[00:36:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:40:38] <maryum>	 I need to revert a security patch that was deployed a few hours earlier during the security deployment window
[00:40:48] <maryum>	 Would that conflict with anything anyone is doing?
[00:40:58] <maryum>	 it's causing an unbreak now issue
[00:41:07] <swfrench-wmf>	 jouncebot: nowandnext
[00:41:07] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 18 minute(s)
[00:41:07] <jouncebot>	 In 1 hour(s) and 18 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0200)
[00:41:13] <swfrench-wmf>	 maryum: I think you're clear
[00:41:23] <maryum>	 thanks! 
[00:41:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:42:26] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11096685 (10andrea.denisse) >>! In T402247#11096653, @andrea.denisse wrote: > I think that the drive is failing: >  > `sudo dmesg | grep -i 'error\|fail\|ata'`: > ` > [37968478.484217] blk_update_request...
[00:45:39] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592)
[00:45:53] <wikibugs>	 (03CR) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[00:49:42] <maryum>	 okay about to run scap to undeploy security fix
[00:50:37] <maryum>	 scap running now
[01:03:16] <maryum>	 scap finished
[01:03:35] <maryum>	 !log undeploy security fix for T397396
[01:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376)
[01:08:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[01:22:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.15 [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1179769 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[01:24:18] <wikibugs>	 (03PS6) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517)
[01:26:49] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:27:16] <wikibugs>	 (03PS7) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517)
[01:27:42] <wikibugs>	 (03PS8) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517)
[01:27:50] <wikibugs>	 (03CR) 10Krinkle: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle)
[01:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0200)
[02:01:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[02:06:07] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[02:15:28] <perryprog>	 huh, that spike in logic successes is massive
[02:15:51] <perryprog>	 login*
[02:32:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:33:14] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:37:59] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:42:59] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0300)
[03:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[03:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:17:57] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:25:38] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] hiera: ncmonitor: add wikimedia.ee to ignored_domains [puppet] - 10https://gerrit.wikimedia.org/r/1179688 (owner: 10Ssingh)
[03:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0400)
[04:04:25] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.12 (duration: 04m 23s)
[04:13:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:17:57] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:28:01] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[04:28:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[04:30:59] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[04:37:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:43:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[04:43:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:45:39] <jinxer-wm>	 FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[04:48:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[04:50:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:08:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Also confirmed out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1178960 (owner: 10Jdlrobson)
[05:08:37] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:15:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:18:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Also update tracked email address [puppet] - 10https://gerrit.wikimedia.org/r/1179712 (https://phabricator.wikimedia.org/T401882) (owner: 10Muehlenhoff)
[05:18:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:19:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:20:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:21:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:25:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:42:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:47:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[05:52:49] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:54:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:55:54] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0600)
[06:02:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[06:11:51] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae5 (External: Arelion Transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:16:51] <jinxer-wm>	 RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:ae5 (External: Arelion Transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:18:37] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ganeti-routed: Enable bird component for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179706 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[06:32:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:35:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[06:37:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:43:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[06:47:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:50:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiddenparma: add policy file [puppet] - 10https://gerrit.wikimedia.org/r/1179971
[06:53:08] <wikibugs>	 (03PS1) 10Muehlenhoff: durum: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:38] <kart_>	 I'm here, will start the deployment..
[07:00:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry)
[07:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: Content Translation: Remove unused configuration parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry)
[07:02:56] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]]
[07:03:01] <stashbot>	 T400671: Cleanup unused ContentTranslation configuration parameters - https://phabricator.wikimedia.org/T400671
[07:04:04] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6625/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:04:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[07:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[07:04:52] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:08:49] <wikibugs>	 (03PS1) 10Muehlenhoff: doh: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392)
[07:10:03] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[07:10:19] <wikibugs>	 (03PS3) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298)
[07:10:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[07:10:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-dev: don't report dag runs to datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178905 (https://phabricator.wikimedia.org/T401932) (owner: 10Brouberol)
[07:12:47] <wikibugs>	 (03PS3) 10Stevemunene: dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298)
[07:14:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11096965 (10ayounsi) 05Resolved→03Open @Jclark-ctr could you ask them if a device reboot would clear the alarm ?  We would ideally need to upgrade all switches of the VXLAN domain (so rows E and...
[07:14:14] <wikibugs>	 (03PS1) 10Slyngshede: hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986
[07:15:22] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179698|Content Translation: Remove unused configuration parameter (T400671)]] (duration: 12m 26s)
[07:15:26] <stashbot>	 T400671: Cleanup unused ContentTranslation configuration parameters - https://phabricator.wikimedia.org/T400671
[07:18:13] <wikibugs>	 (03CR) 10Huei Tan: "we have postponed this backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[07:22:42] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117)
[07:23:03] <kart_>	 Backport done, I'll deploy cxserver as there are no other patches in the window..
[07:24:09] <wikibugs>	 (03CR) 10Ayounsi: "change lgtm but leaving it to Sukhe for the final +1" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[07:24:16] <wikibugs>	 (03CR) 10Ayounsi: "change lgtm but leaving it to Sukhe for the final +1" [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[07:24:16] <wikibugs>	 (03PS19) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[07:25:01] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6626/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:25:09] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry)
[07:26:03] <wikibugs>	 (03CR) 10Stevemunene: "Added the A record for this and updated I83d6df36c9fa08eeabab4b724ed87e9345284175 with the codfw ingres IP as well." [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[07:26:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-08-14-134810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179243 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry)
[07:27:37] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:27:59] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:29:08] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6628/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:30:38] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[07:30:41] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6629/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:31:36] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[07:32:40] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[07:33:13] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[07:33:26] <jnuche>	 kart_: I would like to start working on the train, please let me know when you're finished
[07:33:45] <wikibugs>	 (03PS1) 10DCausse: eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564)
[07:33:53] <kart_>	 !log Updated cxserver to 2025-08-14-134810-production (T399117, T393705)
[07:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:59] <stashbot>	 T399117: Support querying "easy" translation recommendations - https://phabricator.wikimedia.org/T399117
[07:33:59] <stashbot>	 T393705: Remove CXStats related code - https://phabricator.wikimedia.org/T393705
[07:34:02] <kart_>	 jnuche: I'm done.
[07:34:07] <hashar>	 o/
[07:34:15] <hashar>	 I am floating around if you need assistance :]
[07:34:16] <jnuche>	 kart_: ty
[07:34:25] <jnuche>	 thanks hashar :)
[07:36:18] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376)
[07:36:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[07:37:14] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180080 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[07:37:36] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.15  refs T396376
[07:37:40] <stashbot>	 T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376
[07:41:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259 (10MoritzMuehlenhoff) 03NEW
[07:44:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "verified against ops members in modules/admin/data/data.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede)
[07:44:43] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede)
[07:44:46] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] hiddenparma::api_tokens add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1179986 (owner: 10Slyngshede)
[07:44:49] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260 (10ABran-WMF) 03NEW
[07:45:02] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#11097039 (10ABran-WMF)
[07:46:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6630/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[07:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:51:18] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[07:55:45] <wikibugs>	 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11097062 (10Josve05a) > Follow-up from ticket #2025081710002753: >  > - OS: Windows 11   > - Browser: Chromium v126.0.6478.251   > - Browser add-ons: uBlock Origin, Shazam, Don't f***...
[07:58:30] <wikibugs>	 (03PS1) 10Ayounsi: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259)
[08:03:03] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097070 (10ayounsi)
[08:04:00] <wikibugs>	 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11097073 (10hashar) The images overflowing the disk on deploy was previously filed as {T387796} and a follow up action was to have the images to be garbage collected: T387927  We had the issue earlier in Augus...
[08:04:19] <hashar>	 jnuche: did it fail overnight?
[08:04:41] <jnuche>	 hashar: yeah, failed patch as usual
[08:08:42] <hashar>	 :(
[08:09:47] <hashar>	 I should make a self note to verify them on Monday morning
[08:11:12] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352)
[08:12:38] <wikibugs>	 (03PS2) 10Ayounsi: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259)
[08:13:22] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou)
[08:15:10] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou)
[08:16:04] <wikibugs>	 (03PS1) 10Ayounsi: Add esams routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1180083 (https://phabricator.wikimedia.org/T402259)
[08:16:40] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097114 (10ayounsi)
[08:16:40] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update image for readability model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180082 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou)
[08:19:13] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376)
[08:19:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[08:20:10] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180084 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot)
[08:21:51] <wikibugs>	 (03PS1) 10Ayounsi: Remove esams RIPE Atlas measurements [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259)
[08:22:13] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' .
[08:23:47] <wikibugs>	 (03PS20) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[08:24:15] <wikibugs>	 (03CR) 10Phuedx: [C:04-1] MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[08:24:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11097122 (10ayounsi)
[08:25:04] <wikibugs>	 (03CR) 10Phuedx: [C:04-1] MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[08:25:12] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] python-webapp: add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) (owner: 10Elukey)
[08:25:18] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6631/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[08:27:41] <wikibugs>	 (03CR) 10Huei Tan: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[08:28:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi)
[08:28:47] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board), 13Patch-For-Review: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11097126 (10Mvolz)
[08:28:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:28:57] <wikibugs>	 (03PS5) 10Huei Tan: MinT: Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600)
[08:29:21] <wikibugs>	 (03CR) 10Huei Tan: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[08:30:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1179711 (owner: 10Ayounsi)
[08:32:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:37:39] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add all Nokia switches to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/1179711 (owner: 10Ayounsi)
[08:38:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11097142 (10cmooney) >>! In T400783#11096965, @ayounsi wrote: > @Jclark-ctr could you ask them if a device reboot would clear the alarm ?+  Might just be worth giving it a shot.  We know we didn't h...
[08:39:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] doh: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[08:39:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "but per Arzhel's comments prob best wait on Sukhbir" [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[08:50:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261)
[08:52:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi)
[08:52:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "FYI I got a user report that this interacts badly with the mobile redirect, constantly going between `m.` and `www.` until the browser abo" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[08:53:58] <wikibugs>	 (03CR) 10FNegri: [C:03+2] aptrepo: import wikireplicas-utils from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[08:56:23] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.15  refs T396376
[08:56:28] <stashbot>	 T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376
[09:33:15] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[09:33:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[09:43:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite)
[09:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:00:05] <jouncebot>	 claime and hnowlan: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1000).
[10:00:55] <hnowlan>	 😎
[10:05:26] <claime>	 Let's go
[10:05:30] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: Add fractional routing to gateway-check [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[10:11:57] <wikibugs>	 (03CR) 10Tiziano Fogli: "Quick question since we’re using a lot of ext filesystems: 5% of the space is reserved for root, right?" [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[10:13:41] <hnowlan>	 claime: tests look good to me
[10:13:48] <claime>	 hnowlan: same
[10:13:52] <claime>	 \o/
[10:16:04] <logmsgbot>	 jmm@cumin2002 reimage (PID 3139755) is awaiting input
[10:18:10] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[10:18:22] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352)
[10:18:22] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[10:20:05] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1180099 (https://phabricator.wikimedia.org/T402275)
[10:20:17] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "It also enables *creating* new ingestion recipes, even though they are unlikely to work, doesn't it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[10:20:32] <claime>	 !log Fractional routing support for rest API deployed - T400131
[10:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:37] <stashbot>	 T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131
[10:20:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie
[10:21:00] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[10:21:19] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:21:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81485 and previous config saved to /var/cache/conftool/dbconfig/20250819-102126-ladsgroup.json
[10:21:30] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[10:22:05] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1180100 (https://phabricator.wikimedia.org/T402276)
[10:23:42] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T402276
[10:23:46] <stashbot>	 T402276: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T402276
[10:23:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81486 and previous config saved to /var/cache/conftool/dbconfig/20250819-102346-ladsgroup.json
[10:24:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T402276', diff saved to https://phabricator.wikimedia.org/P81487 and previous config saved to /var/cache/conftool/dbconfig/20250819-102414-fceratto.json
[10:25:20] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] resources: Exclude docker|containerd|kubelet mounts from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[10:25:31] <wikibugs>	 (03PS1) 10Ayounsi: gNMI collect more metrics [puppet] - 10https://gerrit.wikimedia.org/r/1180101 (https://phabricator.wikimedia.org/T395998)
[10:28:16] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1180100 (https://phabricator.wikimedia.org/T402276) (owner: 10Gerrit maintenance bot)
[10:30:47] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] dns: Define DNS records for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[10:31:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11097572 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm executed w...
[10:31:56] <logmsgbot>	 !log stevemunene@dns1004 START - running authdns-update
[10:32:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:32:48] <wikibugs>	 (03PS1) 10Mvolz: Remove all references to deprecated parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180103 (https://phabricator.wikimedia.org/T361576)
[10:33:09] <logmsgbot>	 !log stevemunene@dns1004 END - running authdns-update
[10:33:16] <federico3>	 !log Starting s2 codfw failover from db2204 to db2207 - T402276
[10:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:20] <stashbot>	 T402276: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T402276
[10:34:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T402276', diff saved to https://phabricator.wikimedia.org/P81488 and previous config saved to /var/cache/conftool/dbconfig/20250819-103402-fceratto.json
[10:34:25] <wikibugs>	 (03CR) 10Stevemunene: "Thanks, the dns change is merged and deployed via authdns-update we are ready to proceed." [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[10:38:15] <wikibugs>	 (03PS2) 10Muehlenhoff: apereo_cas: Remove some obsolete version checks [puppet] - 10https://gerrit.wikimedia.org/r/1125094
[10:38:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P81489 and previous config saved to /var/cache/conftool/dbconfig/20250819-103854-ladsgroup.json
[10:39:33] <moritzm>	 !log installing openjdk-17 security updates
[10:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:45] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[10:42:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097645 (10VRiley-WMF) 05Resolved→03Open
[10:42:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097646 (10VRiley-WMF) a:03VRiley-WMF
[10:42:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11097648 (10VRiley-WMF) 05Open→03Resolved
[10:44:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11097667 (10VRiley-WMF) @MatthewVernon I wanted to check in and see if any of these are ready for the install of the new card? Thank you!
[10:44:49] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcephosd1044 - vriley@cumin1003"
[10:45:09] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcephosd1044 - vriley@cumin1003"
[10:45:09] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:45:19] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2204.codfw.wmnet
[10:45:28] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2204 - Upgrading db2204.codfw.wmnet
[10:45:37] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2204 - Upgrading db2204.codfw.wmnet
[10:45:37] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044
[10:45:51] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044
[10:46:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[10:46:40] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[10:47:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:50:42] <wikibugs>	 (03CR) 10Tiziano Fogli: "I’m thinking about this because it seems that node_exporter itself avoids exporting these filesystems:" [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[10:51:20] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: simplify gateway-check path globs [puppet] - 10https://gerrit.wikimedia.org/r/1176473 (https://phabricator.wikimedia.org/T400131)
[10:51:49] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2204.codfw.wmnet
[10:52:32] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:53:30] <wikibugs>	 (03CR) 10Tiziano Fogli: "For me, it’s a +1, since I think these would also be better thresholds for the global alert, as reported in https://gerrit.wikimedia.org/r" [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[10:54:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P81491 and previous config saved to /var/cache/conftool/dbconfig/20250819-105401-ladsgroup.json
[10:54:26] <wikibugs>	 (03CR) 10Tiziano Fogli: "Unresolving the previous comment." [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[10:56:26] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:57:05] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Enable debug logging for the rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/1179753 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[11:00:02] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2204* gradually with 4 steps - Upgraded MariaDB
[11:01:36] <wikibugs>	 (03PS21) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[11:01:45] <logmsgbot>	 vriley@cumin1003 provision (PID 2748538) is awaiting input
[11:02:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284 (10fnegri) 03NEW
[11:02:40] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:03:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS trixie
[11:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[11:06:06] <wikibugs>	 (03CR) 10STran: [C:03+1] Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) (owner: 10Tchanders)
[11:08:03] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[11:08:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097809 (10VRiley-WMF) Setting up cloudcesphosd1044, having to decom and provision it again
[11:09:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T402010)', diff saved to https://phabricator.wikimedia.org/P81493 and previous config saved to /var/cache/conftool/dbconfig/20250819-110909-ladsgroup.json
[11:09:14] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[11:09:24] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[11:09:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81494 and previous config saved to /var/cache/conftool/dbconfig/20250819-110931-ladsgroup.json
[11:10:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097818 (10VRiley-WMF)
[11:10:45] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:11:03] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044
[11:11:17] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044
[11:12:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:13:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff)
[11:13:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:13:35] <wikibugs>	 (03CR) 10Vgutierrez: dse-k8s: add dse-k8s-codfw hosts to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[11:13:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81495 and previous config saved to /var/cache/conftool/dbconfig/20250819-111353-ladsgroup.json
[11:17:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:19:01] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6632/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[11:19:38] <wikibugs>	 (03PS4) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298)
[11:20:35] <wikibugs>	 (03PS1) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103)
[11:21:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[11:23:42] <wikibugs>	 (03PS5) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298)
[11:23:42] <wikibugs>	 (03PS1) 10Stevemunene: dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298)
[11:27:02] <wikibugs>	 (03CR) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[11:29:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P81496 and previous config saved to /var/cache/conftool/dbconfig/20250819-112900-ladsgroup.json
[11:33:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11097924 (10Jclark-ctr) a:03Jclark-ctr
[11:33:26] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11097926 (10ABran-WMF) Our current setup trains spamassassin by exporting to mails to the mbox format. This format is not supp...
[11:34:04] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11097929 (10ABran-WMF) p:05Triage→03Medium
[11:35:43] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:37:30] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[11:37:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[11:38:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11097941 (10VRiley-WMF)
[11:38:59] <moritzm>	 !log uploaded openjdk-21 21.0.8+9-1~deb12u1 to bookworm-wikimedia (backport of latest security release)
[11:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:01] <wikibugs>	 (03Abandoned) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[11:42:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] apereo_cas: Remove some obsolete version checks [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff)
[11:44:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P81497 and previous config saved to /var/cache/conftool/dbconfig/20250819-114407-ladsgroup.json
[11:45:29] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2204* gradually with 4 steps - Upgraded MariaDB
[11:45:35] <wikibugs>	 (03PS1) 10FNegri: wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266)
[11:47:06] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[11:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:49:33] <wikibugs>	 (03Restored) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[11:51:42] <moritzm>	 !log installing openjdk-21 security updates
[11:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: add ability to filter sssd users [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261)
[11:56:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1180122
[11:56:15] <wikibugs>	 (03PS1) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[11:56:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[11:59:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T402010)', diff saved to https://phabricator.wikimedia.org/P81498 and previous config saved to /var/cache/conftool/dbconfig/20250819-115915-ladsgroup.json
[11:59:20] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[11:59:20] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[11:59:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81499 and previous config saved to /var/cache/conftool/dbconfig/20250819-115926-ladsgroup.json
[11:59:49] <wikibugs>	 (03PS2) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1200)
[12:00:54] <wikibugs>	 (03PS22) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[12:01:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81500 and previous config saved to /var/cache/conftool/dbconfig/20250819-120147-ladsgroup.json
[12:03:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan)
[12:03:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: wikireplicas: install scripts from deb package (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[12:05:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See also task for more context" [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi)
[12:06:41] <hashar>	 !log Restarting Jenkins
[12:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:14:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[12:15:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098055 (10Jclark-ctr)
[12:15:46] <moritzm>	 !log installing gnutls28 security updates on bullseye
[12:15:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:51] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:16:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P81501 and previous config saved to /var/cache/conftool/dbconfig/20250819-121654-ladsgroup.json
[12:17:03] <wikibugs>	 (03PS2) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[12:17:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[12:17:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:18:16] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for  - jclark@cumin1002"
[12:18:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for  - jclark@cumin1002"
[12:18:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:19:02] <wikibugs>	 (03PS3) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[12:19:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[12:20:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:20:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:20:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:20:58] <wikibugs>	 (03PS4) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[12:20:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:24:54] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11098067 (10ABran-WMF)
[12:28:44] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:28:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:32:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P81502 and previous config saved to /var/cache/conftool/dbconfig/20250819-123201-ladsgroup.json
[12:32:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:34:23] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058)
[12:34:40] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:35:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[12:36:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[12:37:25] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[12:37:41] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[12:37:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:37:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[12:38:17] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:38:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqiad and Arelion (2001:2035:0:a98::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:39:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[12:40:27] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:41:57] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Remove old ENC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1179127 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah)
[12:43:23] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:44:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:45:36] <wikibugs>	 (03CR) 10Vgutierrez: "looks good, I'll +1ed after the previous CR has been merged and the service has some realservers pooled" [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[12:45:37] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:mariadb::cloudinfra: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1179119 (owner: 10Majavah)
[12:45:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene)
[12:47:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T402010)', diff saved to https://phabricator.wikimedia.org/P81503 and previous config saved to /var/cache/conftool/dbconfig/20250819-124709-ladsgroup.json
[12:47:14] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[12:47:24] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[12:47:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81504 and previous config saved to /var/cache/conftool/dbconfig/20250819-124731-ladsgroup.json
[12:49:42] <logmsgbot>	 jclark@cumin1002 provision (PID 1993005) is awaiting input
[12:49:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81505 and previous config saved to /var/cache/conftool/dbconfig/20250819-124952-ladsgroup.json
[12:51:15] <wikibugs>	 07sre-alert-triage, 06Discovery-Search, 06serviceops: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292 (10tappof) 03NEW
[12:52:27] <wikibugs>	 07sre-alert-triage, 06Discovery-Search, 06serviceops: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098148 (10tappof) I’m not entirely sure about the tags assigned to the task, so please feel free to re...
[12:53:02] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1019.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:53:57] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098152 (10tappof)
[12:54:40] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11098166 (10tappof) I’ve adjusted the tags.
[12:56:06] <wikibugs>	 (03PS1) 10Ayounsi: esams routed ganeti: add v4 and v6 IP/range [puppet] - 10https://gerrit.wikimedia.org/r/1180130 (https://phabricator.wikimedia.org/T402259)
[12:56:41] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:57:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:58:19] <wikibugs>	 (03CR) 10TChin: [C:03+1] eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse)
[12:59:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw
[12:59:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw
[13:00:56] <Lucas_WMDE>	 nothing to deploy :)
[13:02:05] <moritzm>	 !log restart Exim on Phabricator hosts to pick up GNU TLS security updates
[13:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1018.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:03:58] <moritzm>	 !log restart FPM on Phabricator hosts to pick up GNU TLS security updates
[13:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:46] <logmsgbot>	 jclark@cumin1002 reimage (PID 2030749) is awaiting input
[13:04:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move RPKI hosts to Bookworm - https://phabricator.wikimedia.org/T359502#11098203 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These are already running on Bookworm since last year.
[13:05:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P81506 and previous config saved to /var/cache/conftool/dbconfig/20250819-130500-ladsgroup.json
[13:05:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:05:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:06:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "That's right. We're lacking the proper infrastructure to run the UI-defined ingestion runs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:06:51] <moritzm>	 !log restart slapd on main LDAP r/w servers hosts to pick up GNU TLS security updates
[13:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1019.eqiad.wmnet with OS bullseye
[13:07:59] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:08:02] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-07-25-064834-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179661 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry)
[13:08:07] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1019.eqiad.wmnet with OS bullseye
[13:08:10] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:08:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable visibility of ingestion runs in the datahub UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180077 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol)
[13:08:47] <logmsgbot>	 jclark@cumin1002 provision (PID 2028919) is awaiting input
[13:10:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update Recommendation API to 2025-07-25-064834-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179661 (https://phabricator.wikimedia.org/T399117) (owner: 10KartikMistry)
[13:10:23] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage
[13:10:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[13:10:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:11:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:11:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1018.eqiad.wmnet with OS bullseye
[13:11:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098222 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1018.eqiad.wmnet with OS bullseye
[13:11:19] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::docker::firewall: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1114661
[13:11:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1017.eqiad.wmnet with OS bullseye
[13:11:56] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1017.eqiad.wmnet with OS bullseye
[13:12:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:38] <wikibugs>	 (03PS1) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943)
[13:13:11] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:14:01] <wikibugs>	 (03PS2) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943)
[13:14:08] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage
[13:14:35] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup)
[13:14:36] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks, folks." [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[13:16:16] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] durum: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff)
[13:17:14] <wikibugs>	 (03CR) 10Ssingh: "@bcornwall@wikimedia.org can deploy this from Traffic, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[13:17:15] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:19:42] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:20:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P81507 and previous config saved to /var/cache/conftool/dbconfig/20250819-132007-ladsgroup.json
[13:22:06] <wikibugs>	 (03PS3) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943)
[13:22:36] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup)
[13:22:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:23:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[13:25:33] <wikibugs>	 (03PS5) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:25:59] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "How would I add doh1001? I assume that it's not done by adding a file named doh1001.yaml :-p" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[13:28:10] <wikibugs>	 (03PS4) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943)
[13:29:20] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:29:42] <wikibugs>	 (03CR) 10Ssingh: "No, that's completely correct, adding a file called doh1001.yaml." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[13:30:36] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[13:30:44] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup)
[13:32:00] <wikibugs>	 (03PS6) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:32:07] <wikibugs>	 (03CR) 10Btullis: "nit: The example in the commit message isn't the best, because its not the artifact cache where we need to have user specific access." [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:32:48] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6635/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:32:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:33:11] <wikibugs>	 (03CR) 10Ladsgroup: "https://puppet-compiler.wmflabs.org/output/1180134/7255/config-master1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup)
[13:33:30] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:33:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage
[13:33:56] <wikibugs>	 (03Abandoned) 10Btullis: Add a system user and group for blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1180113 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[13:34:15] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[13:34:34] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[13:34:35] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[13:34:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm completed: - cloudcephosd1044 (**PASS**...
[13:35:07] <kart_>	 !log Updated Recommendation API to 2025-07-25-064834-production (T399117)
[13:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:11] <stashbot>	 T399117: Support querying "easy" translation recommendations - https://phabricator.wikimedia.org/T399117
[13:35:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81508 and previous config saved to /var/cache/conftool/dbconfig/20250819-133515-ladsgroup.json
[13:35:19] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[13:35:30] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:35:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81509 and previous config saved to /var/cache/conftool/dbconfig/20250819-133537-ladsgroup.json
[13:35:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098303 (10VRiley-WMF)
[13:35:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11098306 (10VRiley-WMF) cloudcephosd1044 is completed
[13:36:30] <wikibugs>	 (03PS7) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:36:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098310 (10Jclark-ctr)
[13:36:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1018.eqiad.wmnet with reason: host reimage
[13:37:00] <wikibugs>	 (03PS1) 10CDobbins: sre.loadbalancer: add cookbook to restart Liberica hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137
[13:37:11] <wikibugs>	 (03PS8) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:37:46] <wikibugs>	 (03PS9) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:38:07] <wikibugs>	 (03CR) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:38:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage
[13:38:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "FTR, I’ve just tried to [revert](https://sal.toolforge.org/log/xHCLwpgB8tZ8Ohr0SrgO) this commit on Beta in order to unbreak the cluster o" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[13:38:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81510 and previous config saved to /var/cache/conftool/dbconfig/20250819-133858-ladsgroup.json
[13:39:31] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6636/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:39:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1017.eqiad.wmnet with reason: host reimage
[13:40:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:41:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1020.eqiad.wmnet with OS bullseye
[13:41:14] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ms-fe1020.eqiad.wmnet with OS bullseye
[13:41:46] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[13:41:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "(Also, that URL I posted earlier should be https://www.wikidata.beta.wmcloud.org/wiki/Q11 of course, with beta and wikidata in the right o" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[13:41:52] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1018.eqiad.wmnet with reason: host reimage
[13:43:22] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:43:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:44:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.loadbalancer: add cookbook to restart Liberica hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins)
[13:46:24] <wikibugs>	 (03CR) 10Btullis: admin/data: add the analytics-ml system user to the analytics-privatedata users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:47:59] <wikibugs>	 (03PS10) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902)
[13:48:03] <wikibugs>	 (03CR) 10Brouberol: admin/data: add the analytics-ml system user to the analytics-privatedata users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:48:46] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6637/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[13:48:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1017.eqiad.wmnet with reason: host reimage
[13:49:09] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4039 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[13:49:44] <_joe_>	 !log systemctl reload varnish-frontend.service on cp4039
[13:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:11] <_joe_>	 that doesn't have the effect I hoped
[13:51:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance
[13:51:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81511 and previous config saved to /var/cache/conftool/dbconfig/20250819-135112-fceratto.json
[13:51:16] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[13:51:20] <wikibugs>	 (03PS1) 10Ayounsi: magru: add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143
[13:51:25] <wikibugs>	 (03PS1) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145
[13:51:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson)
[13:52:09] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4039 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[13:52:44] <wikibugs>	 (03CR) 10Ayounsi: magru: add sandbox vlan to routed ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (owner: 10Ayounsi)
[13:53:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81513 and previous config saved to /var/cache/conftool/dbconfig/20250819-135340-fceratto.json
[13:53:58] <wikibugs>	 (03PS2) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145
[13:54:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P81514 and previous config saved to /var/cache/conftool/dbconfig/20250819-135405-ladsgroup.json
[13:54:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:04-1] openstack: acquire cfssl certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[13:55:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson)
[13:56:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[13:56:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[13:56:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1019.eqiad.wmnet with OS bullseye
[13:57:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1019.eqiad.wmnet with OS bullseye complete...
[13:57:35] <wikibugs>	 (03PS2) 10Ayounsi: magru: add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143
[13:58:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[13:59:54] <wikibugs>	 (03PS3) 10Ayounsi: Add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143
[14:00:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:01:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[14:01:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[14:03:19] <wikibugs>	 (03PS23) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[14:03:42] <wikibugs>	 (03PS1) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698)
[14:04:38] <wikibugs>	 (03PS2) 10FNegri: wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266)
[14:04:52] <wikibugs>	 (03CR) 10FNegri: wikireplicas: install scripts from deb package (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[14:05:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:05:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[14:05:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:06:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1018.eqiad.wmnet with OS bullseye
[14:06:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1018.eqiad.wmnet with OS bullseye complete...
[14:06:20] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[14:06:52] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) (owner: 10Aqu)
[14:06:59] <wikibugs>	 (03PS2) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145)
[14:07:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:07:23] <wikibugs>	 (03PS1) 10Ayounsi: Add magru sandbox prefixes to routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1180150
[14:07:26] <wikibugs>	 (03PS24) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161)
[14:07:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[14:08:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[14:08:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:08:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81515 and previous config saved to /var/cache/conftool/dbconfig/20250819-140848-fceratto.json
[14:09:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:09:08] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1017.eqiad.wmnet with OS bullseye
[14:09:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P81516 and previous config saved to /var/cache/conftool/dbconfig/20250819-140913-ladsgroup.json
[14:09:16] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1017.eqiad.wmnet with OS bullseye complete...
[14:09:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098510 (10Jclark-ctr)
[14:10:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:10:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[14:11:32] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:11:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:12:36] <wikibugs>	 (03PS3) 10Andrew Bogott: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:12:41] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:13:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:14:40] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 5 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11098531 (10Tgr) The patches are merged, and I added...
[14:15:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:16:38] <wikibugs>	 (03PS3) 10Ebernhardson: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145
[14:19:35] <wikibugs>	 (03CR) 10Krinkle: [C:04-1] "OK. I think what's happening here is that Varnish is stripping the "m" from m.wikidata and not turning it into a www because the VCL is ha" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[14:20:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:22:21] <wikibugs>	 (03PS4) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145)
[14:22:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:22:52] <wikibugs>	 (03PS1) 10Dreamy Jazz: UserInfoCard: Link to metawiki for Special:CentralAuth links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690)
[14:23:11] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:23:11] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 6 minute(s)
[14:23:11] <jouncebot>	 In 0 hour(s) and 6 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430)
[14:23:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:23:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81517 and previous config saved to /var/cache/conftool/dbconfig/20250819-142355-fceratto.json
[14:24:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T402010)', diff saved to https://phabricator.wikimedia.org/P81518 and previous config saved to /var/cache/conftool/dbconfig/20250819-142420-ladsgroup.json
[14:24:25] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[14:24:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690) (owner: 10Dreamy Jazz)
[14:24:36] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:25:08] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[14:25:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81519 and previous config saved to /var/cache/conftool/dbconfig/20250819-142514-ladsgroup.json
[14:25:33] <wikibugs>	 (03Merged) 10jenkins-bot: UserInfoCard: Link to metawiki for Special:CentralAuth links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180151 (https://phabricator.wikimedia.org/T397690) (owner: 10Dreamy Jazz)
[14:26:07] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]]
[14:26:11] <stashbot>	 T397690: User info card global account link should lead to meta - https://phabricator.wikimedia.org/T397690
[14:26:55] <wikibugs>	 (03CR) 10FNegri: [C:03+2] wikireplicas: install scripts from deb package [puppet] - 10https://gerrit.wikimedia.org/r/1180121 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[14:27:26] <wikibugs>	 (03PS4) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix beta redirect [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592)
[14:28:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81521 and previous config saved to /var/cache/conftool/dbconfig/20250819-142832-ladsgroup.json
[14:29:42] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] admin/data: add the analytics-ml system user to the analytics-privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/1180123 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430)
[14:30:12] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:30:42] <_joe_>	 !log running requestctl-admin upgrade-schema pattern on alert1002
[14:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:01] <wikibugs>	 (03PS2) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698)
[14:31:22] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[14:31:40] <wikibugs>	 (03PS5) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix beta redirect [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592)
[14:31:44] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[14:33:59] <wikibugs>	 (03CR) 10Krinkle: "krinkle@deployment-cache-text08:~$ sudo run-puppet-agent" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle)
[14:34:41] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "lgtm," [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson)
[14:37:13] <Dreamy_Jazz>	 !log Running `/usr/local/bin/foreachwikiindblist group0.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"`
[14:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:18] <Dreamy_Jazz>	 !log Running `/usr/local/bin/foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"`
[14:37:19] <dcausse>	 jouncebot: nowandnext
[14:37:20] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1430)
[14:37:20] <jouncebot>	 In 0 hour(s) and 22 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1500)
[14:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81522 and previous config saved to /var/cache/conftool/dbconfig/20250819-143903-fceratto.json
[14:39:08] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[14:39:16] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180151|UserInfoCard: Link to metawiki for Special:CentralAuth links (T397690)]] (duration: 13m 09s)
[14:39:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance
[14:39:20] <stashbot>	 T397690: User info card global account link should lead to meta - https://phabricator.wikimedia.org/T397690
[14:39:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81523 and previous config saved to /var/cache/conftool/dbconfig/20250819-143926-fceratto.json
[14:40:46] <wikibugs>	 06SRE: Add known-client-ingestion-source objects an logic - https://phabricator.wikimedia.org/T402014#11098722 (10Vgutierrez) p:05Triage→03Medium
[14:41:20] <wikibugs>	 (03CR) 10DCausse: [C:03+2] eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse)
[14:41:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81524 and previous config saved to /var/cache/conftool/dbconfig/20250819-144158-fceratto.json
[14:43:10] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: use stream_config_overrides for rdf update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180076 (https://phabricator.wikimedia.org/T396564) (owner: 10DCausse)
[14:43:18] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:43:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P81525 and previous config saved to /var/cache/conftool/dbconfig/20250819-144339-ladsgroup.json
[14:44:52] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[14:45:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage
[14:45:31] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:45:36] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[14:45:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:47:36] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[14:47:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage
[14:48:47] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[14:50:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:51:32] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[14:52:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11098799 (10Vgutierrez) @dang could you create a CR on gerrit with your public SSH key to confirm it? thanks!
[14:52:48] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[14:53:11] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[14:53:33] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[14:53:40] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[14:53:46] <_joe_>	 the puppet failures come from my changes, will resolve 
[14:54:41] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[14:55:31] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[14:55:45] <jinxer-wm>	 RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:56:10] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[14:57:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81526 and previous config saved to /var/cache/conftool/dbconfig/20250819-145706-fceratto.json
[14:57:59] <wikibugs>	 (03CR) 10Brennen Bearnes: "Thanks for the digging!" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[14:58:19] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson)
[14:58:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P81527 and previous config saved to /var/cache/conftool/dbconfig/20250819-145847-ladsgroup.json
[14:59:10] <wikibugs>	 (03PS1) 10FNegri: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266)
[14:59:57] <wikibugs>	 (03PS2) 10FNegri: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266)
[15:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1500).
[15:00:15] <wikibugs>	 (03Merged) 10jenkins-bot: flink chart: Add a comment label (2nd try) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180145 (owner: 10Ebernhardson)
[15:01:29] <logmsgbot>	 !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T402309
[15:01:33] <stashbot>	 T402309: Deploy Phabricator/Phorge 2025-08-19 - https://phabricator.wikimedia.org/T402309
[15:01:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11098843 (10Vgutierrez) I'm seeing you have 3 LDAP accounts at the moment: * https://ldap.toolforge.org/user/dang * https://ldap.toolforge.org/user/datwmd...
[15:02:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[15:02:37] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@22fcde9]: deploy phab2002 for T402309
[15:03:19] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@22fcde9]: deploy phab2002 for T402309 (duration: 00m 42s)
[15:03:33] <wikibugs>	 (03PS1) 10Chlod Alejandro: Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988)
[15:03:35] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@22fcde9]: deploy phab1004 for T402309
[15:04:14] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@22fcde9]: deploy phab1004 for T402309 (duration: 00m 39s)
[15:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[15:05:32] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) (owner: 10Aqu)
[15:06:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:08:02] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:08:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:08:25] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1020.eqiad.wmnet with OS bullseye
[15:08:33] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ms-fe1020.eqiad.wmnet with OS bullseye complete...
[15:08:37] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:03] <wikibugs>	 (03PS14) 10Brennen Bearnes: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[15:11:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[15:11:45] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098896 (10Jclark-ctr)
[15:11:56] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11098897 (10Jclark-ctr) 05Open→03Resolved
[15:12:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81528 and previous config saved to /var/cache/conftool/dbconfig/20250819-151213-fceratto.json
[15:13:46] <wikibugs>	 (03PS15) 10Brennen Bearnes: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[15:13:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T402010)', diff saved to https://phabricator.wikimedia.org/P81529 and previous config saved to /var/cache/conftool/dbconfig/20250819-151354-ladsgroup.json
[15:13:59] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[15:14:10] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[15:14:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro)
[15:14:39] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance
[15:14:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81530 and previous config saved to /var/cache/conftool/dbconfig/20250819-151446-ladsgroup.json
[15:15:26] <wikibugs>	 (03CR) 10FNegri: [C:03+2] sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[15:16:54] <wikibugs>	 (03PS3) 10Aqu: analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698)
[15:16:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81531 and previous config saved to /var/cache/conftool/dbconfig/20250819-151656-ladsgroup.json
[15:17:03] <wikibugs>	 (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[15:19:10] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1180122 (owner: 10Muehlenhoff)
[15:20:00] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[15:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: sre.wikireplicas.add-wiki: update script paths [cookbooks] - 10https://gerrit.wikimedia.org/r/1180157 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[15:21:48] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+1] "Confirmed on phabricator-bullseye.devtools.eqiad1.wikimedia.cloud that it needs the `apc.` prefix. With that, it works, as can be seen at " [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[15:21:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb)
[15:23:37] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:24:33] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:27:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81532 and previous config saved to /var/cache/conftool/dbconfig/20250819-152720-fceratto.json
[15:27:25] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[15:27:36] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[15:27:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81533 and previous config saved to /var/cache/conftool/dbconfig/20250819-152743-fceratto.json
[15:27:47] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[15:30:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81534 and previous config saved to /var/cache/conftool/dbconfig/20250819-153015-fceratto.json
[15:31:10] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[15:31:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:32:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81535 and previous config saved to /var/cache/conftool/dbconfig/20250819-153203-ladsgroup.json
[15:32:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "I am Ok with merging this based on the previous pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[15:33:57] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:34:21] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1042
[15:35:22] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1042
[15:35:55] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:40:46] <wikibugs>	 (03PS5) 10Andrew Bogott: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[15:40:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[15:41:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:47:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81536 and previous config saved to /var/cache/conftool/dbconfig/20250819-154711-ladsgroup.json
[15:49:14] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:50:13] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm
[15:50:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm
[15:51:13] <wikibugs>	 (03PS1) 10Ahmon Dancy: Allow deployment group to sudo systemctl status spiderpig-{apiserver,jobrunner} [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945)
[15:52:09] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058)
[15:52:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[16:00:05] <jouncebot>	 jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6640/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[16:02:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T402010)', diff saved to https://phabricator.wikimedia.org/P81538 and previous config saved to /var/cache/conftool/dbconfig/20250819-160218-ladsgroup.json
[16:02:23] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[16:02:23] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance
[16:02:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81539 and previous config saved to /var/cache/conftool/dbconfig/20250819-160230-ladsgroup.json
[16:04:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81540 and previous config saved to /var/cache/conftool/dbconfig/20250819-160439-ladsgroup.json
[16:04:47] <wikibugs>	 (03CR) 10Scott French: [C:03+1] haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[16:05:20] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[16:07:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto)
[16:10:13] <wikibugs>	 (03PS1) 10Krinkle: varnish: Merge m-dot and X-Subdomain block in cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595)
[16:12:03] <wikibugs>	 (03PS1) 10Kosta Harlan: AbuseFilterHooks: Handle IP user performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298)
[16:15:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: haproxy: re-add blank line for better readability [puppet] - 10https://gerrit.wikimedia.org/r/1180169
[16:15:49] <mszabo>	 jouncebot: nowandnext
[16:15:50] <jouncebot>	 For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1600)
[16:15:50] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700)
[16:16:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: re-add blank line for better readability [puppet] - 10https://gerrit.wikimedia.org/r/1180169 (owner: 10Giuseppe Lavagetto)
[16:19:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81541 and previous config saved to /var/cache/conftool/dbconfig/20250819-161948-ladsgroup.json
[16:20:00] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage
[16:23:49] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage
[16:28:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298) (owner: 10Kosta Harlan)
[16:30:13] <wikibugs>	 (03Merged) 10jenkins-bot: AbuseFilterHooks: Handle IP user performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180167 (https://phabricator.wikimedia.org/T402298) (owner: 10Kosta Harlan)
[16:30:41] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]]
[16:30:46] <stashbot>	 T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298
[16:31:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:32:38] <logmsgbot>	 !log mszabo@deploy1003 mszabo, kharlan: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:33:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11099320 (10Vgutierrez) I'm already seeing an account (https://ldap.toolforge.org/user/dang) requested on T288355 with some privileges: `   dang:     ensu...
[16:34:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81542 and previous config saved to /var/cache/conftool/dbconfig/20250819-163455-ladsgroup.json
[16:39:12] <logmsgbot>	 !log mszabo@deploy1003 Sync cancelled.
[16:45:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[16:48:27] <logmsgbot>	 vriley@cumin1003 reimage (PID 2781057) is awaiting input
[16:48:44] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003"
[16:48:45] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1042.eqiad.wmnet with OS bookworm
[16:48:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm completed: - cloudcephosd1042 (**PASS**...
[16:50:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T402010)', diff saved to https://phabricator.wikimedia.org/P81543 and previous config saved to /var/cache/conftool/dbconfig/20250819-165003-ladsgroup.json
[16:50:08] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[16:50:08] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[16:50:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81544 and previous config saved to /var/cache/conftool/dbconfig/20250819-165015-ladsgroup.json
[16:51:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81545 and previous config saved to /var/cache/conftool/dbconfig/20250819-165124-ladsgroup.json
[16:53:14] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579)
[16:53:24] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099398 (10Eevans) rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which apparently was corrupted.
[16:57:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099403 (10VRiley-WMF)
[16:57:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "thanks for pointing out it needed to be a hash, makes sense. should have tried that. also thanks for testing it. I was about to comment li" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[17:00:08] <jouncebot>	 swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700).
[17:00:13] <swfrench-wmf>	 o/
[17:00:24] <swfrench-wmf>	 mszabo: I see your cancelled deployment in the scroll back. what's the status? does that patch need reverted before deployments can safely proceed?
[17:03:07] <mszabo>	 swfrench-wmf: ah sorry, so it should be fine to resume deployments but it needs a followup patch to actually do what it says on the tin
[17:03:08] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1175916/6641/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[17:03:36] <mszabo>	 I can finish the sync, it would merely end up adjusting the error message of a low-rate error on testwiki
[17:04:32] <swfrench-wmf>	 mszabo: ah, thanks for the follow-up. so, just to confirm, it's 100% safe for that patch to proceed to the rest of production as-is.
[17:04:45] <mszabo>	 yeah, let me sync it quickly to avoid confusion
[17:04:49] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn)
[17:05:15] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]]
[17:05:19] <stashbot>	 T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298
[17:05:46] <swfrench-wmf>	 mszabo: thanks for confirming, and for completing the deployment. I'll wait until you're done to proceed with mine :)
[17:06:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81546 and previous config saved to /var/cache/conftool/dbconfig/20250819-170632-ladsgroup.json
[17:07:11] <logmsgbot>	 !log mszabo@deploy1003 kharlan, mszabo: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:07:33] <logmsgbot>	 !log mszabo@deploy1003 kharlan, mszabo: Continuing with sync
[17:07:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11099428 (10VRiley-WMF) So, here is what has been completed so far  cloudcephosd1042 C8 U12 CableID 5204 Port 29 CableID 20220266 (Not set as of yet) Port 28   cloudcephosd1043 C8 U13 CableID...
[17:08:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "it's a sudo privileges change - but allowing to see "status" for something you are already allowed to restart seems very harmless" [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[17:10:43] <mutante>	 !log phab2002/phab1004 - systemctl restart php7.4-fpm after we increased APCu shared memory segment size (T401157)
[17:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:48] <stashbot>	 T401157: Phorge setup check caching is misbehaving, leading to many duck-sound=quack requests - https://phabricator.wikimedia.org/T401157
[17:12:53] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180167|AbuseFilterHooks: Handle IP user performers without actor records (T402298)]] (duration: 07m 38s)
[17:12:58] <stashbot>	 T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298
[17:13:57] <swfrench-wmf>	 proceeding with the infra window
[17:15:10] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: No-op deployment to introduce new build report metadata - T401721
[17:15:15] <stashbot>	 T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721
[17:17:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11099522 (10Dzahn) Thanks for confirming that, Katie!  I think there is nothing else to do on this task.  I verified Chris is in the NDA spreadsheet SRE looks at and that he has the r...
[17:17:25] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: No-op deployment to introduce new build report metadata - T401721 (duration: 02m 52s)
[17:17:47] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: maxconn for varnish threads limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis)
[17:17:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11099525 (10Dzahn) 05In progress→03Resolved a:03Dzahn please reopen if you think anything else needs to be done.
[17:18:26] <swfrench-wmf>	 mszabo: I don't have anything else planned for the window, so all yours if you're ready for your follow-on patch.
[17:20:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:21:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81548 and previous config saved to /var/cache/conftool/dbconfig/20250819-172139-ladsgroup.json
[17:22:00] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[17:25:33] <wikibugs>	 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11099555 (10ecarg) Thank you so much, @RLazarus
[17:27:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11099561 (10cmooney) Probably worth opening a JTAC case to ask about this.  One thing I note is that only FPCs 1 and 3 are in use on this box...
[17:31:02] <wikibugs>	 (03PS1) 10Btullis: Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103)
[17:31:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:32:48] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6642/console" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[17:33:48] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[17:34:03] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099594 (10andrea.denisse) >>! In T402247#11099398, @Eevans wrote: > rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which apparently was corrupted.  Strange, I cleared...
[17:35:53] <wikibugs>	 (03Merged) 10jenkins-bot: Use a specific image version for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180184 (https://phabricator.wikimedia.org/T401103) (owner: 10Btullis)
[17:36:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T402010)', diff saved to https://phabricator.wikimedia.org/P81550 and previous config saved to /var/cache/conftool/dbconfig/20250819-173646-ladsgroup.json
[17:36:51] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[17:37:02] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[17:37:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81551 and previous config saved to /var/cache/conftool/dbconfig/20250819-173709-ladsgroup.json
[17:37:30] <logmsgbot>	 !log zoe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[17:38:21] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[17:38:31] <logmsgbot>	 !log zoe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[17:38:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81552 and previous config saved to /var/cache/conftool/dbconfig/20250819-173833-ladsgroup.json
[17:39:17] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[17:43:59] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[17:46:13] <wikibugs>	 (03PS9) 10RLazarus: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle)
[17:46:29] <wikibugs>	 (03PS1) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614)
[17:46:50] <rzl>	 jouncebot: nowandnext
[17:46:51] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1700)
[17:46:51] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800)
[17:47:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn)
[17:47:11] <rzl>	 swfrench-wmf, mszabo: do you mind if I sneak out Krinkle's patch before the infra window is over?
[17:47:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle)
[17:47:33] <swfrench-wmf>	 rzl: no objections on my end!
[17:48:32] <rzl>	 🛫
[17:49:22] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks! I made one adjustment, take a look for future reference (top-level keys are domains, not paths) but I'll get this shipped out duri" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) (owner: 10Krinkle)
[17:53:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81553 and previous config saved to /var/cache/conftool/dbconfig/20250819-175340-ladsgroup.json
[17:56:34] <wikibugs>	 (03PS2) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614)
[17:57:46] <logmsgbot>	 !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1174872
[17:58:44] <logmsgbot>	 !log rzl@deploy1003 rzl: https://gerrit.wikimedia.org/r/1174872 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:59:45] <logmsgbot>	 !log rzl@deploy1003 rzl: Continuing with sync
[18:00:05] <jouncebot>	 jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800).
[18:00:27] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro
[18:00:49] <rzl>	 jnuche, jeena: ^ above is warpping up shortly, sorry to run over!
[18:00:57] <rzl>	 *wrapping
[18:01:08] <wikibugs>	 (03PS1) 10Dzahn: cloud: add profile::pki::client::ensure for wikistats VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1180189
[18:02:13] <jeena>	 rzl: No problem, we already deployed during the European window
[18:02:55] <rzl>	 👍
[18:04:56] <logmsgbot>	 !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1174872 (duration: 07m 51s)
[18:06:31] <wikibugs>	 10SRE-swift-storage: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11099728 (10Eevans) >>! In T402247#11099594, @andrea.denisse wrote: >>>! In T402247#11099398, @Eevans wrote: >> rsyslog is back up and running after clearing the queue (`/var/spool/rsyslog/*`), which app...
[18:08:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81554 and previous config saved to /var/cache/conftool/dbconfig/20250819-180848-ladsgroup.json
[18:09:56] <wikibugs>	 (03CR) 10JHathaway: "Would love a review when you have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway)
[18:11:06] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm
[18:15:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:17:47] <sukhe>	 ^ it's back but yeah
[18:18:35] <logmsgbot>	 !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm
[18:18:48] <mutante>	 sukhe: noted :/ sigh
[18:20:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:20:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:22:16] <mutante>	 !log gerrit - deactivated user Keccake256 for spam-like comments and edits on commons
[18:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T402010)', diff saved to https://phabricator.wikimedia.org/P81555 and previous config saved to /var/cache/conftool/dbconfig/20250819-182356-ladsgroup.json
[18:24:04] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[18:24:13] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance
[18:24:17] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.206.0" for 2 host(s)
[18:24:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81556 and previous config saved to /var/cache/conftool/dbconfig/20250819-182419-ladsgroup.json
[18:24:46] <swfrench-wmf>	 jouncebot: nowandnext
[18:24:46] <jouncebot>	 For the next 1 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800)
[18:24:46] <jouncebot>	 In 1 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000)
[18:25:13] <swfrench-wmf>	 FYI, since the train already advanced, dancy and I are going to deploy and test a new scap release
[18:26:04] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.206.0" completed for 2 hosts
[18:26:22] <dancy>	 swfrench-wmf: Ready for testing
[18:26:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81557 and previous config saved to /var/cache/conftool/dbconfig/20250819-182642-ladsgroup.json
[18:26:50] <swfrench-wmf>	 dancy: amazing, thank you! I'll start with a `--stop-before-sync` run to verify the resulting diffs make sense
[18:26:55] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:27:00] <swfrench-wmf>	 (i.e., no diffs, heh)
[18:27:03] <wikibugs>	 (03CR) 10Ssingh: "in this commit itself, you should bring in the changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172056/15/modules/dnsrec" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[18:27:45] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to verify image build and dependent helmfile values - T401721
[18:27:49] <stashbot>	 T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721
[18:28:34] <logmsgbot>	 !log swfrench@deploy1003 Stopping before sync operations
[18:30:12] <swfrench-wmf>	 `php.version` is emitted and diffs are clean as expected
[18:30:26] <dancy>	 Excellent
[18:30:57] <swfrench-wmf>	 just for completeness, I'll run through a full should-not-affect-anything sync-world
[18:32:43] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:34:41] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: No-code-changes scap sync-world with new helmfile values - T401721
[18:34:46] <stashbot>	 T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721
[18:36:30] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:36:43] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:36:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:38:20] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm
[18:39:29] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: No-code-changes scap sync-world with new helmfile values - T401721 (duration: 06m 28s)
[18:39:57] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177471 (owner: 10Ncmonitor)
[18:40:20] <swfrench-wmf>	 all done. thank you very much, dancy
[18:41:21] <wikibugs>	 (03PS8) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[18:41:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81558 and previous config saved to /var/cache/conftool/dbconfig/20250819-184149-ladsgroup.json
[18:42:07] <wikibugs>	 (03CR) 10BCornwall: "Removed wikimedia.ee" [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[18:44:47] <logmsgbot>	 !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm
[18:46:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:47:45] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:50:08] <wikibugs>	 (03PS1) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193
[18:51:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 (owner: 10Dzahn)
[18:51:42] <wikibugs>	 (03PS2) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193
[18:51:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:53:40] <wikibugs>	 (03PS3) 10Dzahn: gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193
[18:54:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 (owner: 10Dzahn)
[18:55:37] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-19 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo
[18:55:37] <icinga-wm>	 .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir
[18:55:37] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-19 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo
[18:55:37] <icinga-wm>	 .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir
[18:56:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81559 and previous config saved to /var/cache/conftool/dbconfig/20250819-185656-ladsgroup.json
[18:58:00] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[18:59:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:00:05] <wikibugs>	 (03PS1) 10Dzahn: gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194
[19:00:54] <wikibugs>	 (03PS2) 10Dzahn: gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194
[19:02:48] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194 (owner: 10Dzahn)
[19:03:53] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-19 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to verify wikipediasummitindia.com against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, vo
[19:03:53] <icinga-wm>	 .com, voyagewiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir
[19:04:29] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cloud: add profile::pki::client::ensure for wikistats VPS project [puppet] - 10https://gerrit.wikimedia.org/r/1180189 (owner: 10Dzahn)
[19:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:04:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:07:37] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-19 on ncredir4001 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:07:37] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-19 on ncredir4002 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:07:53] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-19 on ncredir6002 is OK: SSL OK - Certificate wikipediasummitindia.com valid until 2025-11-17 17:54:53 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir
[19:11:10] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] centrallog: Enable debug logging for the rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/1179753 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[19:12:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T402010)', diff saved to https://phabricator.wikimedia.org/P81560 and previous config saved to /var/cache/conftool/dbconfig/20250819-191204-ladsgroup.json
[19:12:09] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[19:12:14] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[19:12:19] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance
[19:13:04] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance
[19:13:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81561 and previous config saved to /var/cache/conftool/dbconfig/20250819-191311-ladsgroup.json
[19:15:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81562 and previous config saved to /var/cache/conftool/dbconfig/20250819-191537-ladsgroup.json
[19:16:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:17:06] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm
[19:22:22] <mszabo>	 jouncebot: nowandnext
[19:22:22] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T1800)
[19:22:22] <jouncebot>	 In 0 hour(s) and 37 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000)
[19:22:56] <wikibugs>	 (03PS1) 10Máté Szabó: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298)
[19:23:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó)
[19:23:10] <wikibugs>	 (03PS2) 10Kosta Harlan: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó)
[19:23:21] <logmsgbot>	 !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm
[19:24:34] <Dreamy_Jazz>	 !log Running `/usr/local/bin/foreachwikiindblist group1.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101"`
[19:24:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:48] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by mszabo@deploy1003 using scap backport" [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó)
[19:25:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[19:26:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Allow deployment group to sudo systemctl status spiderpig-{apiserver,jobrunner} [puppet] - 10https://gerrit.wikimedia.org/r/1180161 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[19:26:17] <wikibugs>	 (03PS2) 10Krinkle: [LOCAL HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader)
[19:26:27] <wikibugs>	 (03PS3) 10Krinkle: [LOCAL HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader)
[19:26:41] <wikibugs>	 (03PS4) 10Krinkle: [BETA HACK] Hack mw-cli-wrapper to work without conftool [puppet] - 10https://gerrit.wikimedia.org/r/1058199 (https://phabricator.wikimedia.org/T370792) (owner: 10Gerrit Patch Uploader)
[19:30:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (owner: 10Ayounsi)
[19:30:42] <logmsgbot>	 !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad
[19:30:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81563 and previous config saved to /var/cache/conftool/dbconfig/20250819-193045-ladsgroup.json
[19:32:19] <logmsgbot>	 !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T395571
[19:32:23] <stashbot>	 T395571: Verify/fix Logstash pipeline/log rotate  for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571
[19:32:57] <wikibugs>	 (03Merged) 10jenkins-bot: AbuseFilterHooks: Gracefully handle performers without actor records [extensions/ORES] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180198 (https://phabricator.wikimedia.org/T402298) (owner: 10Máté Szabó)
[19:33:26] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]]
[19:33:30] <stashbot>	 T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298
[19:35:21] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:35:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:36:20] <wikibugs>	 (03PS1) 10BCornwall: Remove wikipediamustdie.com [dns] - 10https://gerrit.wikimedia.org/r/1180200
[19:37:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Remove wikipediamustdie.com [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall)
[19:38:05] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571
[19:38:09] <stashbot>	 T395571: Verify/fix Logstash pipeline/log rotate  for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571
[19:39:35] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] "I'd also remove the NS records" [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall)
[19:39:40] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[19:40:55] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] "Already done!" [dns] - 10https://gerrit.wikimedia.org/r/1180200 (owner: 10BCornwall)
[19:44:52] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571
[19:44:56] <stashbot>	 T395571: Verify/fix Logstash pipeline/log rotate  for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571
[19:45:02] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180198|AbuseFilterHooks: Gracefully handle performers without actor records (T402298)]] (duration: 11m 36s)
[19:45:06] <stashbot>	 T402298: Wikimedia\Assert\PostconditionException: Postcondition failed: user_name variable must resolve to a UserIdentity - https://phabricator.wikimedia.org/T402298
[19:45:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81564 and previous config saved to /var/cache/conftool/dbconfig/20250819-194552-ladsgroup.json
[19:46:08] <wikibugs>	 (03PS1) 10Kosta Harlan: hcaptcha: Unset Referer header [puppet] - 10https://gerrit.wikimedia.org/r/1180204 (https://phabricator.wikimedia.org/T397841)
[19:47:20] <wikibugs>	 (03PS1) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050)
[19:47:33] <logmsgbot>	 !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad
[19:50:14] <brett>	 !log import ncmonitor 2.0.0 into bookworm-wikimedia
[19:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:44] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:50:46] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:50:50] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:50:52] <brett>	 on it
[19:51:02] <logmsgbot>	 !log brett@dns1004 START - running authdns-update
[19:51:19] <sukhe>	 thanks, this alert is very unforgiving on purpose :/
[19:51:26] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:51:28] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:51:32] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is a0b2b3cd7b7940929757fa1f23b7ccc72ddf9853, dns.git is 483370c9d1fbd4b4a8cf543b7eae1f0001dce808) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:52:06] <logmsgbot>	 !log brett@dns1004 END - running authdns-update
[19:55:44] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:55:46] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:55:50] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:56:26] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:56:28] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:56:32] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:59:20] <wikibugs>	 (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180207
[19:59:24] <wikibugs>	 (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208
[19:59:28] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180209
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2000). Please do the needful.
[20:00:06] <jouncebot>	 chlod: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <chlod>	 o/ here
[20:01:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T402010)', diff saved to https://phabricator.wikimedia.org/P81565 and previous config saved to /var/cache/conftool/dbconfig/20250819-200100-ladsgroup.json
[20:01:05] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[20:01:15] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance
[20:01:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81566 and previous config saved to /var/cache/conftool/dbconfig/20250819-200122-ladsgroup.json
[20:03:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81567 and previous config saved to /var/cache/conftool/dbconfig/20250819-200350-ladsgroup.json
[20:04:37] <zabe>	 I can deploy
[20:05:00] <chlod>	 yippee
[20:05:26] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sretest2001.codfw.wmnet with reason: supermicro
[20:05:58] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro)
[20:13:39] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] "All have correct NS records, no dnssec" [dns] - 10https://gerrit.wikimedia.org/r/1180207 (owner: 10Ncmonitor)
[20:13:59] <logmsgbot>	 !log brett@dns1004 START - running authdns-update
[20:15:07] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Stop sending more than one og:image to social media platforms" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521)
[20:15:16] <logmsgbot>	 !log brett@dns1004 END - running authdns-update
[20:15:22] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 16 Sep 2025 07:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[20:15:45] <wikibugs>	 (03Merged) 10jenkins-bot: Restore inadvertently removed messages [extensions/Nuke] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180159 (https://phabricator.wikimedia.org/T153988) (owner: 10Chlod Alejandro)
[20:16:21] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]]
[20:16:25] <stashbot>	 T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988
[20:17:03] <zabe>	 This will take a bit since it rebuilds localisation cache
[20:18:05] <chlod>	 alrighty, all good with me
[20:18:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81568 and previous config saved to /var/cache/conftool/dbconfig/20250819-201858-ladsgroup.json
[20:19:33] <wikibugs>	 (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor)
[20:20:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:25:39] <wikibugs>	 (03PS8) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246)
[20:29:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] "Valid NS records and no DNSSEC enabled." [puppet] - 10https://gerrit.wikimedia.org/r/1180209 (owner: 10Ncmonitor)
[20:34:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81569 and previous config saved to /var/cache/conftool/dbconfig/20250819-203405-ladsgroup.json
[20:39:45] <logmsgbot>	 !log zabe@deploy1003 chlod, zabe: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:39:50] <stashbot>	 T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988
[20:39:59] <zabe>	 finally ready to test 
[20:40:04] <chlod>	 testing now
[20:40:19] <chlod>	 works perfectly :)
[20:40:24] <zabe>	 Nice!
[20:40:25] <logmsgbot>	 !log zabe@deploy1003 chlod, zabe: Continuing with sync
[20:41:55] <wikibugs>	 (03PS1) 10Krinkle: varnish: Write docs for some mobile user agent regexen [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595)
[20:49:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T402010)', diff saved to https://phabricator.wikimedia.org/P81570 and previous config saved to /var/cache/conftool/dbconfig/20250819-204913-ladsgroup.json
[20:49:14] <wikibugs>	 (03PS9) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246)
[20:49:18] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[20:49:29] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance
[20:49:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81571 and previous config saved to /var/cache/conftool/dbconfig/20250819-204935-ladsgroup.json
[20:50:42] <wikibugs>	 (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:52:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81572 and previous config saved to /var/cache/conftool/dbconfig/20250819-205203-ladsgroup.json
[20:52:52] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180159|Restore inadvertently removed messages (T153988)]] (duration: 36m 31s)
[20:52:52] <wikibugs>	 (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[20:52:56] <stashbot>	 T153988: Migrate Special:Nuke to Codex - https://phabricator.wikimedia.org/T153988
[20:53:41] <chlod>	 thanks for the deploy, zabe! :)
[20:54:12] <zabe>	 yw :)
[20:54:53] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "centrallog: Enable debug logging for the rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/1180224
[20:59:39] <wikibugs>	 (03PS3) 10Bking: golang: add trixie-based golang-1.24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295)
[21:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250819T2100)
[21:03:07] <wikibugs>	 (03PS1) 10Eevans: sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226
[21:03:08] <wikibugs>	 (03PS1) 10Eevans: sessionstore: upgrade production to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180227
[21:03:37] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] Revert "centrallog: Enable debug logging for the rsyslog-receiver" [puppet] - 10https://gerrit.wikimedia.org/r/1180224 (owner: 10Andrea Denisse)
[21:05:21] <wikibugs>	 (03PS1) 10Krinkle: varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595)
[21:05:49] <Jdlrobson>	 Hey I'll be doing a few deploys for the Web Team deployment window 
[21:05:59] <Jdlrobson>	 @zabe are you done with everything?
[21:06:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[21:06:13] <zabe>	 yep
[21:07:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81573 and previous config saved to /var/cache/conftool/dbconfig/20250819-210710-ladsgroup.json
[21:07:17] <Jdlrobson>	 thanks!
[21:07:25] <wikibugs>	 (03PS1) 10Arlolra: Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349)
[21:10:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521) (owner: 10Jdlrobson)
[21:11:43] <wikibugs>	 (03Abandoned) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson)
[21:12:53] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226 (owner: 10Eevans)
[21:14:32] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore: upgrade staging to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180226 (owner: 10Eevans)
[21:15:58] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply
[21:16:30] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[21:19:36] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra)
[21:22:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81574 and previous config saved to /var/cache/conftool/dbconfig/20250819-212218-ladsgroup.json
[21:22:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Stop sending more than one og:image to social media platforms" [extensions/PageImages] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180211 (https://phabricator.wikimedia.org/T295521) (owner: 10Jdlrobson)
[21:23:09] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]]
[21:27:00] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:28:12] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with sync
[21:35:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:35:57] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180211|Revert "Stop sending more than one og:image to social media platforms"]] (duration: 12m 47s)
[21:37:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T402010)', diff saved to https://phabricator.wikimedia.org/P81575 and previous config saved to /var/cache/conftool/dbconfig/20250819-213725-ladsgroup.json
[21:37:30] <stashbot>	 T402010: Add new rc_name_source_patrolled_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T402010
[21:37:46] <wikibugs>	 (03PS1) 10Dzahn: cache::text: set apt-staging to NOT cache [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284)
[21:38:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11100438 (10Dzahn) How about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180234/1/hieradata/role/common/cache/text.yaml to just turn off the caching a...
[21:42:58] <wikibugs>	 (03PS2) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579)
[21:43:53] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[21:44:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor)
[21:44:10] <wikibugs>	 (03CR) 10Mstyles: "We're not actually rolling it out quite yet so leaving it at 0 for now. Happy to still update the commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:44:23] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance
[21:44:49] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180208 (owner: 10Ncmonitor)
[21:46:28] <wikibugs>	 (03CR) 10Gergő Tisza: "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:46:41] <wikibugs>	 (03CR) 10Catrope: [C:04-1] OATHAuth: Add Config Variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:46:45] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:47:12] <Jdlrobson>	 (im done with deploy window)
[21:48:04] <wikibugs>	 (03CR) 10Mstyles: "Going to hold off on deploying/merging this patch until we decide to initiate the actual rollout" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:48:39] <wikibugs>	 (03PS3) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579)
[21:49:10] <wikibugs>	 (03CR) 10Mstyles: OATHAuth: Add Config Variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:49:14] <wikibugs>	 (03PS1) 10JHathaway: an-test-coord1002: switch to efi [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577)
[21:49:31] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway)
[21:49:45] <wikibugs>	 (03CR) 10Mstyles: [C:04-2] "Will not submit until 2FA rollout plan is ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) (owner: 10Mstyles)
[21:52:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] an-test-coord1002: switch to efi [puppet] - 10https://gerrit.wikimedia.org/r/1180236 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway)
[22:03:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11100525 (10VRiley-WMF) dumpsdata1005 and dumpsdata1006 are completed. Moving onto dumpsdata1007
[22:04:51] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm
[22:10:39] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180239
[22:18:55] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] "NS records are correct, dnssec disabled." [puppet] - 10https://gerrit.wikimedia.org/r/1180239 (owner: 10Ncmonitor)
[22:20:06] <wikibugs>	 (03PS1) 10Catrope: doc.wikimedia.org CSP: Allow sendBeacon for piwik [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970)
[22:25:18] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage
[22:29:07] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage
[22:36:42] <wikibugs>	 (03CR) 10RLazarus: "Hi Roan -- I don't know doc.wm.o well. If you don't mind getting a +1 from someone on your team for the semantics of the change, I can tak" [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope)
[22:40:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[22:40:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T399249)', diff saved to https://phabricator.wikimedia.org/P81576 and previous config saved to /var/cache/conftool/dbconfig/20250819-224028-fceratto.json
[22:40:33] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[22:40:47] <wikibugs>	 (03PS2) 10Dzahn: various: fix puppet-lint legacy_fact warnings for collab services [puppet] - 10https://gerrit.wikimedia.org/r/1178619
[22:45:21] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS bookworm
[22:46:26] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1178619/6650/" [puppet] - 10https://gerrit.wikimedia.org/r/1178619 (owner: 10Dzahn)
[22:47:58] <wikibugs>	 (03PS3) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614)
[22:49:04] <wikibugs>	 (03PS4) 10Dzahn: zuul: add systemd service for nodepool (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614)
[22:50:00] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+1] doc.wikimedia.org CSP: Allow sendBeacon for piwik [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope)
[22:50:12] <wikibugs>	 (03PS2) 10VolkerE: doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope)
[22:50:23] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope)
[22:58:56] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] doc.wikimedia.org CSP: Allow sendBeacon for piwik (Matomo) [puppet] - 10https://gerrit.wikimedia.org/r/1180240 (https://phabricator.wikimedia.org/T368970) (owner: 10Catrope)
[23:03:00] <wikibugs>	 (03Restored) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson)
[23:03:49] <wikibugs>	 (03PS1) 10Jdlrobson: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180244 (https://phabricator.wikimedia.org/T402050)
[23:04:33] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:04:33] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[23:06:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker10[15-19].eqiad.wmnet} and (A:dse-k8s-master or A:dse-k8s-worker)
[23:08:58] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180245
[23:10:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11100666 (10VRiley-WMF) 05In progress→03Resolved dumpsdata1007 is completed
[23:10:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:11:43] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] "DNS is proper and dnssec is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1180245 (owner: 10Ncmonitor)
[23:15:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[23:23:05] <zabe>	 jouncebot: nowandnext
[23:23:06] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 36 minute(s)
[23:23:06] <jouncebot>	 In 6 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0600)
[23:23:15] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[23:24:09] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180178 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[23:24:38] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]]
[23:24:43] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[23:27:10] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:28:19] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[23:33:37] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180178|Stop writing to cl_to and cl_collation on more wikis (T399579)]] (duration: 08m 58s)
[23:33:41] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[23:38:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246
[23:38:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246 (owner: 10TrainBranchBot)
[23:45:13] <wikibugs>	 (03PS1) 10Ladsgroup: tables-catalog: Catalog BounceHandler and LoginNotify tables [puppet] - 10https://gerrit.wikimedia.org/r/1180247 (https://phabricator.wikimedia.org/T399302)
[23:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180246 (owner: 10TrainBranchBot)