[00:01:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:02:25] RESOLVED: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142 (10thcipriani) 03NEW [00:06:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1053 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:40:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219659 [00:40:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219659 (owner: 10TrainBranchBot) [00:49:55] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474924 (10phaultfinder) [00:51:27] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219659 (owner: 10TrainBranchBot) [01:00:47] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219662 [01:10:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219662 (owner: 10TrainBranchBot) [01:13:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 04s) [01:26:25] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11474945 (10Ladsgroup) This ticket is touching on something that has been bothering me for some time. The thumbnail steps are "20, 40, 60... [01:26:59] (03CR) 10RLazarus: [C:03+2] scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle) [01:29:23] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.32 - https://phabricator.wikimedia.org/T409510#11474947 (10RLazarus) 05Open→03Resolved [01:29:33] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808#11474949 (10RLazarus) 05Open→03Resolved [01:32:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219662 (owner: 10TrainBranchBot) [01:33:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::26fc:4eff:fe41:5d10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:37:47] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11474994 (10RLazarus) 05Open→03Resolved Still some hosts remaining to upgrade to 1.35 in T410975, but we don't need this umbrella task open to track the multi-stage upgrade anymore. [01:38:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::26fc:4eff:fe41:5d10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:44:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475034 (10phaultfinder) [02:04:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475061 (10phaultfinder) [02:14:25] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops, and 3 others: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11475082 (10RLazarus) Helm timed out again when I tried to deploy machinetranslation for the next round of envoy upgrades. I'll retitle thi... [02:15:20] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11475083 (10RLazarus) [02:45:12] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475119 (10phaultfinder) [03:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:13:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:12] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475180 (10phaultfinder) [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:30] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:14:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:14:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:14:40] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [05:15:09] !incidents [05:15:09] 7212 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [05:15:09] 7211 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [05:15:10] 7210 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [05:15:23] !ack 7212 [05:15:23] 7212 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [05:15:24] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:16:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [05:17:12] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:17:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11475187 (10Marostegui) [05:18:05] what's up [05:18:24] it looks like thanos/swift had a blip? [05:18:53] metrics look like it is recovering [05:18:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:18:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11475188 (10Marostegui) [05:19:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11475189 (10Marostegui) p:05Triage→03Medium [05:19:09] wow great [05:19:11] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:20:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11475192 (10Marostegui) Waiting to verify ssh key out of band. [05:24:22] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11475193 (10Marostegui) [05:25:26] jelto: I can't find anything obivous in the logs/graphs [05:25:38] I'm going to open a task for observability [05:27:58] (03PS1) 10Marostegui: data.yaml: Add bearloga to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) [05:28:03] I'm also still searching, thanos eqiad was slow for one or two minutes but I just see that in the prometheus probes and in the thanos dashboard: https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview?var-interval=5m&orgId=1&from=now-3h&to=now&timezone=UTC&var-datasource=000000026&refresh=30s [05:28:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11475197 (10Marostegui) p:05Triage→03Medium [05:28:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11475199 (10Marostegui) [05:34:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:04] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475217 (10phaultfinder) [05:35:20] jelto: https://phabricator.wikimedia.org/T413156 [05:35:53] thank you! [05:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:29] I guess I'm going to make some coffee [05:45:45] +1 [05:51:30] (03CR) 10Marostegui: [C:03+1] Remove the old non-fido-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219654 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [06:15:01] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475280 (10phaultfinder) [06:39:58] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11475293 (10ABran-WMF) [06:49:17] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11475316 (10Jelto) I moved the GitLab project from https://gitlab.wikimedia.org/toolforge-repos/wikipedia25-years-of-wikipedia to https://git... [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251219T0700) [07:03:31] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11475366 (10Marostegui) 05Open→03Stalled [07:03:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11475367 (10Marostegui) 05Open→03Stalled [07:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:13:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:22:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1219654 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [07:26:03] (03PS1) 10Muehlenhoff: Record LDAP access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1219672 [07:39:34] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1219672 (owner: 10Muehlenhoff) [07:45:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1219625 (owner: 10Herron) [07:46:14] PROBLEM - SSH on titan2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:47] (03CR) 10Marostegui: "Note: this also requires adding the user to ldap spiderpig-access group" [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) (owner: 10Marostegui) [07:48:43] (03CR) 10Muehlenhoff: "The LDAP group for Spiderpig was already granted via Wikimedia IDM" [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) (owner: 10Marostegui) [07:49:12] FIRING: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:15] (03CR) 10Marostegui: "Oh thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) (owner: 10Marostegui) [07:49:30] (03CR) 10Marostegui: "Good to merge @mmuhlenhoff@wikimedia.org?" [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) (owner: 10Marostegui) [07:49:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (since Tyler opened the task it's implictly acked), but still needs manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1219668 (https://phabricator.wikimedia.org/T413142) (owner: 10Marostegui) [07:50:24] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:53:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11475448 (10Marostegui) This requires approval from @kzimmerman as manager. [07:53:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11475451 (10Marostegui) [07:55:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251219T0800) [08:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:19:04] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11475535 (10Marostegui) Looks like es2028 keeps failing after applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219589/2/modules/prof... [08:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:54] (03CR) 10Elukey: [C:03+1] ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [08:32:01] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11475596 (10MatthewVernon) This morning, there are 1519692 objects in `docker_registry_codfw` in eqiad, most... [08:40:33] o/ Hello friends, I have an UBN I'd like to backport. For security reasons, there's a patch at https://phabricator.wikimedia.org/T413139#11475594 thatI don't want to make public unless I can get sign off on getting it in, minimizing the amount of time the issue is visible. I can deploy and test on my own but can someone provide that sign off? [08:41:05] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 8560 [08:44:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8560 [08:44:58] cc jelto XioNoX [08:46:47] Tran: I don't think oncall have the authority for that [08:47:24] sobanski: you around ? maybe you have an answer for the question above ? [08:49:10] Looking [08:49:32] marostegui@cumin1003 reimage (PID 2655611) is awaiting input [08:50:07] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11475652 (10phaultfinder) [08:50:36] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 138881 [08:50:36] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138881 [08:55:24] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:12] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:08] (03PS3) 10Jelto: miscweb: add wikipedia25 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215231 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:08:15] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005#11475760 (10ayounsi) [09:08:58] (03PS5) 10Daniel Kinzler: rest-gateway: improve structure of end-to-end tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (https://phabricator.wikimedia.org/T413179) [09:09:12] (03CR) 10Daniel Kinzler: rest-gateway: improve structure of end-to-end tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (https://phabricator.wikimedia.org/T413179) (owner: 10Daniel Kinzler) [09:16:42] (03CR) 10Jelto: [C:03+2] miscweb: add wikipedia25 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215231 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:19:02] (03Merged) 10jenkins-bot: miscweb: add wikipedia25 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215231 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:21:17] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:21:47] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:38:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11475938 (10ayounsi) a:03Jhancock.wm Alright, let's go for 1G copper then. Please connect it to ge-0/0/4 on the mr1 side, and ge-0/0/47 on lsw1... [09:39:08] RECOVERY - Host lswtest-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [09:41:51] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:42:56] RECOVERY - Host lswtest-d8-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [09:43:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:47:07] (03PS1) 10Mszwarc: Only show temp accounts on IP if temp accounts are known [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219771 (https://phabricator.wikimedia.org/T413139) [09:48:50] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11476001 (10ABran-WMF) @CDanis explained to me that it was not required to have a private IP to move mailman's frontend behind CDN, so... [09:50:24] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:07] 06SRE, 06Infrastructure-Foundations: git::clone can fail to checkout its remote branch, leading to unrecoverable failure - https://phabricator.wikimedia.org/T413193 (10fgiunchedi) 03NEW [09:51:23] (03CR) 10STran: [C:03+1] Only show temp accounts on IP if temp accounts are known [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219771 (https://phabricator.wikimedia.org/T413139) (owner: 10Mszwarc) [09:51:48] PROBLEM - Memcached on titan2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [09:51:51] RESOLVED: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:54:04] RECOVERY - SSH on titan2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:54:12] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:38] RECOVERY - Memcached on titan2001 is OK: TCP OK - 0.030 second response time on 10.192.32.160 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [09:55:24] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:24] RESOLVED: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:51] !log installing Linux 5.10.247 on Bullseye hosts [09:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:53] (03PS1) 10Elukey: Fix frozen-requirements.txt to also work for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219789 [09:59:12] FIRING: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:23] (03CR) 10Elukey: [V:03+2 C:03+2] Fix frozen-requirements.txt to also work for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219789 (owner: 10Elukey) [09:59:52] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1769f71]: (no justification provided) [10:00:24] RESOLVED: [6x] JobUnavailable: Reduced availability for job pint in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:29] o/ Discussed this off-band and I'll be deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1219771 to fix the UBN outlined above [10:00:36] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1769f71]: (no justification provided) (duration: 00m 44s) [10:01:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219771 (https://phabricator.wikimedia.org/T413139) (owner: 10Mszwarc) [10:03:00] (03Merged) 10jenkins-bot: Only show temp accounts on IP if temp accounts are known [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219771 (https://phabricator.wikimedia.org/T413139) (owner: 10Mszwarc) [10:03:30] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1219771|Only show temp accounts on IP if temp accounts are known (T413139)]] [10:05:34] !log stran@deploy2002 mszwarc, stran: Backport for [[gerrit:1219771|Only show temp accounts on IP if temp accounts are known (T413139)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:06:50] confirmed this fixes the issue we saw, will continue with deploy [10:07:00] !log stran@deploy2002 mszwarc, stran: Continuing with sync [10:10:22] (03PS1) 10Cathal Mooney: Nokia: set FEC mode explicity for all 100G links on v25 [homer/public] - 10https://gerrit.wikimedia.org/r/1219809 (https://phabricator.wikimedia.org/T412157) [10:11:07] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219771|Only show temp accounts on IP if temp accounts are known (T413139)]] (duration: 07m 37s) [10:11:31] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:12:04] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:12:17] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:12:43] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:16:39] (03CR) 10Ayounsi: [C:03+1] Nokia: set FEC mode explicity for all 100G links on v25 [homer/public] - 10https://gerrit.wikimedia.org/r/1219809 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [10:16:44] (03CR) 10Btullis: [C:03+2] Use relative path for "latest" symlinks [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob) [10:18:43] (03CR) 10Cathal Mooney: [C:03+2] Nokia: set FEC mode explicity for all 100G links on v25 [homer/public] - 10https://gerrit.wikimedia.org/r/1219809 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [10:20:02] (03Merged) 10jenkins-bot: Nokia: set FEC mode explicity for all 100G links on v25 [homer/public] - 10https://gerrit.wikimedia.org/r/1219809 (https://phabricator.wikimedia.org/T412157) (owner: 10Cathal Mooney) [10:21:53] (03CR) 10Dreamy Jazz: [C:03+1] "LGTM, we don't need to have this data for the beta wikis (even if any of the metrics were relevant). Will need a puppet window for this, a" [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders) [10:23:44] (03PS1) 10Elukey: Remove extra wheels from the Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219811 [10:24:18] (03PS2) 10Elukey: Remove extra wheels from the Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219811 [10:24:34] (03CR) 10Elukey: [V:03+2 C:03+2] Remove extra wheels from the Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219811 (owner: 10Elukey) [10:25:21] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@b6cc5ab]: (no justification provided) [10:25:32] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@b6cc5ab]: (no justification provided) (duration: 00m 12s) [10:30:12] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11476154 (10Jelto) > In a next step I'll do a Kubernetes deployment and prepare the miscweb service in wikikube. wikipedia25 was deployed to... [10:33:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-build1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:43] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: how to approach schema differences in SR-Linux versions - https://phabricator.wikimedia.org/T412157#11476193 (10cmooney) p:05High→03Low [10:41:28] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11476200 (10MatthewVernon) ` Dec 19 10:25:35 ms-be2081 container-sync: Since Fri Dec 19 09:25:31 2025: 12 sy... [10:50:31] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Upsert instance data in Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1214083 (https://phabricator.wikimedia.org/T410084) (owner: 10Federico Ceratto) [10:53:25] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-build1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:58] (03PS1) 10Silvan Heintze: Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) [11:00:58] (03CR) 10Silvan Heintze: "After having finished a workshop on awesome code reviews, I am now trying out review descriptions in gerrit. 😊" [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) (owner: 10Silvan Heintze) [11:08:24] (03PS2) 10Silvan Heintze: Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) [11:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:11:07] (03CR) 10Silvan Heintze: Report progress of Wikibase entity dumps (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) (owner: 10Silvan Heintze) [11:11:26] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [11:11:40] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [11:19:20] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11476299 (10MoritzMuehlenhoff) [11:19:59] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11476300 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All roles were migrated to Puppet 7 (except the remaining puppetmas... [11:26:41] (03PS1) 10Btullis: Add a secret object to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219845 (https://phabricator.wikimedia.org/T406833) [11:48:18] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11476398 (10taavi) >>! In T408592#11475316, @Jelto wrote: > I moved the GitLab project from https://gitlab.wikimedia.org/toolforge-repos/wiki... [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251219T0800) [12:00:04] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251219T1200). [12:05:04] (03PS2) 10Btullis: Add S3 support to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219845 (https://phabricator.wikimedia.org/T406833) [12:08:41] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11476543 (10MoritzMuehlenhoff) [12:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:11:56] (03PS1) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [12:13:13] (03PS2) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [12:13:57] (03PS3) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [12:15:30] (03CR) 10CI reject: [V:04-1] team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [12:18:40] (03PS4) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [12:19:59] (03CR) 10CI reject: [V:04-1] team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [12:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11476585 (10mpopov) [12:27:25] (03PS1) 10Btullis: Allow the analytics-test namespace to access the s3 endpoint in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219855 (https://phabricator.wikimedia.org/T406833) [12:40:11] (03CR) 10Btullis: [C:03+2] Allow the analytics-test namespace to access the s3 endpoint in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219855 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:40:14] (03CR) 10Btullis: [C:03+2] Add S3 support to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219845 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:41:56] (03Merged) 10jenkins-bot: Add S3 support to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219845 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:41:57] (03Merged) 10jenkins-bot: Allow the analytics-test namespace to access the s3 endpoint in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219855 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:44:34] (03PS5) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [12:46:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:47:18] looking [12:48:15] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789#11476640 (10BTullis) We have decommissioned this host, so I can remove the pending certificate. ` btullis@pupp... [12:48:18] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789#11476642 (10BTullis) 05Open→03Resolved a:03BTullis [12:48:35] looking similar to https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin%3A9804&orgId=1&from=2025-12-18T06%3A45%3A02.371Z&to=2025-12-18T15%3A44%3A52.399Z&timezone=utc&var-site=%24__all [12:49:32] !incidents [12:49:33] 7214 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:49:33] 7212 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [12:49:33] 7211 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:50:03] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:50:10] federico3: acking [12:50:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:50:28] !ack 7214 [12:50:28] 7214 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:50:42] ah, you've been quicker :) [12:51:13] !incidents [12:51:13] 7214 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:51:14] 7212 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [12:51:14] 7211 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:56:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturati [13:03:37] (03CR) 10Btullis: [C:03+2] Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [13:05:18] (03Merged) 10jenkins-bot: Allow tests of canary events generation from airflow-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218232 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:42] (03PS6) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [13:06:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatura [13:10:20] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:10:53] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:11:58] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:12:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:17:07] (03PS1) 10Muehlenhoff: Remove cookbooks to migrate roles/hosts to Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1219861 (https://phabricator.wikimedia.org/T349619) [13:17:20] (03PS2) 10Muehlenhoff: Remove cookbooks to migrate roles/hosts to Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1219861 (https://phabricator.wikimedia.org/T349619) [13:32:18] (03PS1) 10Aude: Add wordmark logo to beta cluster for Minerva (mobile) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219865 (https://phabricator.wikimedia.org/T413217) [13:39:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11476833 (10Marostegui) 05Open→03Stalled [13:39:51] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [13:42:00] (03PS1) 10Mforns: Bump up version of page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219866 (https://phabricator.wikimedia.org/T405041) [13:42:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11476837 (10Ahoelzl) Approved. [13:43:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:44:53] (03CR) 10Santiago Faci: [C:03+2] Bump up version of page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219866 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [13:45:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11476840 (10Marostegui) [13:46:42] (03Merged) 10jenkins-bot: Bump up version of page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219866 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [13:47:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11476857 (10Marostegui) @Ahoelzl you are both the manager and one of the approving parties for this group - so marking both as done. [13:47:55] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [13:47:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11476858 (10Marostegui) [13:48:09] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [13:50:19] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [13:50:34] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [13:50:41] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [13:50:53] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [13:52:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11476877 (10Jclark-ctr) Rebalanced power all Well within the 1650 limit per branch power sentry4 Phase, AA:L1-L2, Active Power 1219 power sentry4 Phase, AA:L2-L3,... [13:53:03] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11476878 (10Jclark-ctr) 05Open→03Resolved [14:18:57] (03PS1) 10Cathal Mooney: team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 [14:19:14] (03PS1) 10Muehlenhoff: Phabricator: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219871 [14:20:56] (03CR) 10CI reject: [V:04-1] team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 (owner: 10Cathal Mooney) [14:22:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219871 (owner: 10Muehlenhoff) [14:26:16] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 7609.53 ms [14:27:00] RECOVERY - Host wikikube-worker1053 is UP: PING WARNING - Packet loss = 0%, RTA = 630.47 ms [14:29:15] (03PS1) 10Muehlenhoff: KDC: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219872 [14:32:09] (03PS1) 10Muehlenhoff: mediawiki: Remove icu67 [puppet] - 10https://gerrit.wikimedia.org/r/1219873 [14:33:45] (03PS1) 10Muehlenhoff: mariadb::packages_client: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219874 [14:34:49] (03PS1) 10Muehlenhoff: ci/php: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219875 [14:36:12] (03PS1) 10Muehlenhoff: imagecatalog: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219876 [14:36:59] (03CR) 10CI reject: [V:04-1] ci/php: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219875 (owner: 10Muehlenhoff) [14:37:04] (03PS1) 10Muehlenhoff: nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 [14:37:39] (03CR) 10CI reject: [V:04-1] nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 (owner: 10Muehlenhoff) [14:38:24] (03PS1) 10Muehlenhoff: opensearch/curator: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219879 [14:39:57] (03PS1) 10Muehlenhoff: graphite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219880 [14:40:46] (03PS1) 10Tiziano Fogli: prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) [14:41:03] (03PS2) 10Muehlenhoff: nftables: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219877 [14:41:10] (03PS1) 10Slyngshede: P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 [14:41:32] (03PS1) 10Slyngshede: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 [14:41:49] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 (owner: 10Slyngshede) [14:42:24] (03CR) 10Herron: [C:03+1] prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:42:47] (03CR) 10Cwhite: prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:43:07] (03CR) 10Cwhite: [C:04-1] prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:43:40] (03CR) 10CI reject: [V:04-1] P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [14:44:09] (03CR) 10Tiziano Fogli: prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:44:11] (03PS2) 10Slyngshede: P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 [14:44:40] (03PS2) 10Tiziano Fogli: prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) [14:45:28] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) (owner: 10Cathal Mooney) [14:45:58] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 (owner: 10Slyngshede) [14:47:09] (03PS3) 10Slyngshede: P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 [14:48:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [14:48:55] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 (owner: 10Slyngshede) [14:49:04] (03PS7) 10Cathal Mooney: team-netops: add rule for packet drops in higher-priority queues [alerts] - 10https://gerrit.wikimedia.org/r/1219852 (https://phabricator.wikimedia.org/T384052) [14:49:08] (03CR) 10Tiziano Fogli: prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:50:05] (03PS4) 10Slyngshede: P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 [14:50:57] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808#11476968 (10MoritzMuehlenhoff) [14:51:25] !incidents [14:51:26] 7215 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [14:51:26] 7214 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [14:51:26] 7212 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [14:52:52] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [14:53:04] !ack 7215 [14:53:05] 7215 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [14:53:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [14:55:39] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/k8s-pod: drop mediawiki_action_api_modules_lantecy metric [puppet] - 10https://gerrit.wikimedia.org/r/1219878 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [14:58:17] (03PS2) 10Slyngshede: P:cache haproxy support tagging residential proxies [puppet] - 10https://gerrit.wikimedia.org/r/1219882 [14:58:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [15:02:16] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11477000 (10MatthewVernon) Regarding the iOS questions, I think larger iPads are something of a concern; T412161 is going to do some test... [15:03:51] RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [15:06:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm [15:07:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS bookworm [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:18:18] !incidents [15:18:18] 7215 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [15:18:19] 7214 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [15:18:19] 7212 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:18:39] federico3: that transport link being down is vendor maintenance i think [15:19:21] checking the calendar... [15:20:57] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm [15:21:34] !log restored the correct puppetserver1001's TLS certificate for puppet following https://phabricator.wikimedia.org/T405580#11214327 [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:31] (03PS2) 10Tiziano Fogli: team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 (owner: 10Cathal Mooney) [15:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:05] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:57] (03CR) 10Bernard Wang: [C:03+1] Add wordmark logo to beta cluster for Minerva (mobile) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219865 (https://phabricator.wikimedia.org/T413217) (owner: 10Aude) [15:41:28] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:45:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:55:23] (03CR) 10Dzahn: [C:03+1] ":)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215231 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:05:11] (03PS1) 10Btullis: Update the kerberos principal used by spark in the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219886 (https://phabricator.wikimedia.org/T406833) [16:08:54] (03CR) 10Btullis: [C:03+2] Update the kerberos principal used by spark in the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219886 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:09:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:09:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:10:57] (03Merged) 10jenkins-bot: Update the kerberos principal used by spark in the analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219886 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [16:11:37] (03CR) 10CDanis: [C:03+1] team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 (owner: 10Cathal Mooney) [16:16:27] hey ops -- we have an UBN we need to backport a patch for -- on parsoid read view wikis the section links can get stuck in the postprocessing cache with the wrong user language, so folks on itwiki are sometimes seeing UX components in the wrong language [16:16:35] https://phabricator.wikimedia.org/T413227 [16:16:36] (03CR) 10Cathal Mooney: [C:03+2] team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 (owner: 10Cathal Mooney) [16:16:55] there's a one-liner fix for this in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1219884 that we'd like to backport to wmf.7 [16:17:52] (03Merged) 10jenkins-bot: team-netops: increase duration before alerting on saturation [alerts] - 10https://gerrit.wikimedia.org/r/1219870 (owner: 10Cathal Mooney) [16:18:25] 06SRE, 06Infrastructure-Foundations: git::clone can fail to checkout its remote branch, leading to unrecoverable failure - https://phabricator.wikimedia.org/T413193#11477280 (10dancy) @fnegri wrote in T373815: For future reference, I suspect there was a failure in the bash command substitution that is used to... [16:19:07] federico3, fabfur: are you the SREs on duty? ^ for a backport I'd like to ensure SREs know about. [16:19:37] cscott: yes we are [16:20:15] i can spiderpig deploy the backport, but its out of window so i wanted to make sure you all knew what was going on. [16:28:55] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 63.54 ms [16:44:32] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS bookworm [16:45:53] (03PS1) 10Sbisson: Fix section loading on desktop [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 [16:49:28] Hi SREs, requesting permission to deploy a small but very important Content Translation fix. [16:52:17] (03PS1) 10C. Scott Ananian: Ensure that user interface language is "used" by postprocessing pipeline [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219904 (https://phabricator.wikimedia.org/T413227) [16:57:58] (03PS1) 10Ahmon Dancy: git::clone: Get default branch name a different way [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) [16:59:38] stephanebisson: CTT is also waiting to deploy a fix to prod, but zuul is being slow so if you're ready you're welcome to go first. [16:59:51] federico3 and fabfur are the SREs on duty [17:02:04] Thanks cscott. Can I get a thumbs up from federico3 or fabfur? [17:02:50] (03PS1) 10Mforns: Update edit-analytics to use the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219909 (https://phabricator.wikimedia.org/T405041) [17:04:54] I'm not quite sure what's the process for deciding here, being Friday afternoon. Perhaps doing the deployments on Monday would be an option? [17:07:12] for CTT we're getting invalid cache contents accumulating in the parser cache, so we'd need to either do a config deploy to back out the config change we deployed yesterday, or do the code deploy to fix it. but waiting until monday isn't really an option. [17:09:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [17:09:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [17:11:41] https://wikitech.wikimedia.org/wiki/Deployments/Emergencies says the template is: [17:11:50] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1219904 -- context is T413227, are SRE ok with a deployment? (cc: thcipriani hashar federico3 fabfur). I already have someone to deploy. [17:11:51] T413227: Section edit links appear in wrong user interface language on Parsoid Read View wikis - https://phabricator.wikimedia.org/T413227 [17:12:57] (oh, dancy is the weekly train conductor, i put hashar instead) ^ [17:13:14] o/ [17:13:26] For CX we would prefer to do it today to limit the exposure of the bug but we can wait for Monday. [17:14:39] cscott, it merged. [17:16:43] (03CR) 10Santiago Faci: [C:03+2] Update edit-analytics to use the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219909 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [17:17:14] cscott: if we are confident that deploying the fix introduces less risk of breakage than not deploying I'd be ok with deploying it now but pending fabfur for his confirmation [17:17:43] cscott: thanks for the heads up. [17:18:22] now waiting for confirmation from fabfur? [17:18:48] (03Merged) 10jenkins-bot: Update edit-analytics to use the new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219909 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [17:20:47] (03PS1) 10Cwhite: prometheus/k8s-pods-metrics: drop mediawiki_action_api_modules_latency metric [puppet] - 10https://gerrit.wikimedia.org/r/1219911 (https://phabricator.wikimedia.org/T410152) [17:21:03] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [17:21:15] the emergency deploy proceedure is getting positive confirmation from SRE by messaging SRE's on call. Sounds like federico3 is ok, but would like a second opinion. Do I have that right federico3 ? [17:21:16] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [17:21:19] (03PS2) 10Cwhite: prometheus/k8s-pods-metrics: drop mediawiki_action_api_modules_latency metric [puppet] - 10https://gerrit.wikimedia.org/r/1219911 (https://phabricator.wikimedia.org/T410152) [17:21:42] yes please wait for confirmation from fabfur [17:24:28] (03CR) 10Cwhite: [C:03+2] prometheus/k8s-pods-metrics: drop mediawiki_action_api_modules_latency metric [puppet] - 10https://gerrit.wikimedia.org/r/1219911 (https://phabricator.wikimedia.org/T410152) (owner: 10Cwhite) [17:29:46] +1 for me (sorry read it now, it's late here) [17:30:42] ok, thanks. i can spiderpig it myself. [17:30:43] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1219873 (owner: 10Muehlenhoff) [17:32:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219904 (https://phabricator.wikimedia.org/T413227) (owner: 10C. Scott Ananian) [17:35:43] (03Merged) 10jenkins-bot: Ensure that user interface language is "used" by postprocessing pipeline [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219904 (https://phabricator.wikimedia.org/T413227) (owner: 10C. Scott Ananian) [17:36:03] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1219904|Ensure that user interface language is "used" by postprocessing pipeline (T413227)]] [17:36:07] T413227: Section edit links appear in wrong user interface language on Parsoid Read View wikis - https://phabricator.wikimedia.org/T413227 [17:36:39] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11477686 (10MatthewVernon) This has continued to proceed today: ` Dec 19 17:26:37 ms-be2081 container-sync:... [17:38:03] !log cscott@deploy2002 cscott: Backport for [[gerrit:1219904|Ensure that user interface language is "used" by postprocessing pipeline (T413227)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:40:19] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 7397.74 ms [17:40:27] RECOVERY - Host wikikube-worker1053 is UP: PING WARNING - Packet loss = 0%, RTA = 558.23 ms [17:41:05] !log cscott@deploy2002 cscott: Continuing with sync [17:41:09] testing looks good [17:43:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:45:10] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219904|Ensure that user interface language is "used" by postprocessing pipeline (T413227)]] (duration: 09m 07s) [17:45:14] T413227: Section edit links appear in wrong user interface language on Parsoid Read View wikis - https://phabricator.wikimedia.org/T413227 [17:45:52] ok, CTT is done. thanks fabfur federico3 thcipriani dancy ! [17:46:09] thanks! [17:46:17] thanks all <3 [17:48:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for bearloga - https://phabricator.wikimedia.org/T413142#11477751 (10kzimmerman) Approved as Mikhail's manager [18:02:08] (03PS1) 10Mforns: Bump up page-analytics version to include doc improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219921 (https://phabricator.wikimedia.org/T405041) [18:09:10] (03CR) 10Santiago Faci: [C:03+2] Bump up page-analytics version to include doc improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219921 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [18:09:50] (03CR) 10Marostegui: [C:03+1] mariadb::packages_client: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219874 (owner: 10Muehlenhoff) [18:11:06] (03Merged) 10jenkins-bot: Bump up page-analytics version to include doc improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219921 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [18:11:19] (03PS1) 10Cwhite: prometheus/k8s-pods-metrics: move drop job to metric_relabel_configs [puppet] - 10https://gerrit.wikimedia.org/r/1219923 (https://phabricator.wikimedia.org/T410152) [18:15:00] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [18:15:00] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1219923/7842/" [puppet] - 10https://gerrit.wikimedia.org/r/1219923 (https://phabricator.wikimedia.org/T410152) (owner: 10Cwhite) [18:15:13] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [18:15:46] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [18:16:01] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [18:16:09] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [18:16:21] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [18:17:36] (03CR) 10Dzahn: [C:03+2] Phabricator: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219871 (owner: 10Muehlenhoff) [18:18:01] (03CR) 10Dzahn: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219871 (owner: 10Muehlenhoff) [18:20:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1219871/52/" [puppet] - 10https://gerrit.wikimedia.org/r/1219871 (owner: 10Muehlenhoff) [18:24:21] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1219875 (owner: 10Muehlenhoff) [18:27:10] (03CR) 10Dzahn: [C:04-1] "there is an issue with tests for doc hosts here: profile::doc on debian-10-x86_64 is expected to contain Package[php7.4-fpm]" [puppet] - 10https://gerrit.wikimedia.org/r/1219875 (owner: 10Muehlenhoff) [18:28:07] (03CR) 10Dzahn: [C:03+1] "thanks rzl" [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle) [18:28:18] (03Restored) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [18:37:00] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1219180/7843/" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [19:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:15:57] (03CR) 10Ahmon Dancy: [C:03+1] "Thank you dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [19:16:06] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [19:19:12] (03CR) 10Dr0ptp4kt: trafficserver: Send /evt-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [19:23:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:25:02] (03CR) 10Dzahn: [V:03+1] "I am waiting a moment to see if we have a replacement instance on newer distro that is in the process of being created.. seeing how that g" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [19:27:19] PROBLEM - Host lvs3008 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:53] RECOVERY - Host lvs3008 is UP: PING OK - Packet loss = 0%, RTA = 78.87 ms [19:28:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:39:06] (03CR) 10Dr0ptp4kt: trafficserver: Send /evt-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [20:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:13:42] (03PS1) 10Thcipriani: Beta: update mx host ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219939 [20:14:52] (03PS2) 10Thcipriani: Beta: update mx host ip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219939 (https://phabricator.wikimedia.org/T412975) [20:47:25] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [20:50:47] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2249 to codfw - jhancock@cumin1003" [20:50:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2249 to codfw - jhancock@cumin1003" [20:50:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:54:53] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2249 [20:55:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2249 [20:56:18] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:42] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:59:11] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:59:33] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:00:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:00:28] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:01:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:01:23] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:02:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:02:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:03:07] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:03:18] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11478521 (10Jhancock.wm) @elukey i hit an error running the provisioning script on this one. Could you take a look at it when you have time? Not sure what i missed on it. It is a new... [21:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11478609 (10Jhancock.wm) a:05Jhancock.wm→03ayounsi patch completed [21:21:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11478643 (10Jhancock.wm) [21:43:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:52:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:57:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:01:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11478804 (10Marostegui) Just to double check: this is being provisioned with UEFI right? [22:29:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11478955 (10Jhancock.wm) yes. it's the default now [22:35:38] (03CR) 10Dzahn: [V:03+1] "there is a newer deploy-mx instance now that Tyler created - https://phabricator.wikimedia.org/T412975#11478073 - so this should not be ne" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [22:59:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:04:10] FIRING: [3x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:09:10] RESOLVED: [3x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:26:13] (03PS2) 10ArielGlenn: Remove the old non-fido-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219654 (https://phabricator.wikimedia.org/T413019) [23:27:22] (03CR) 10ArielGlenn: [C:03+2] Remove the old non-fido-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219654 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [23:39:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:44:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown