[00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 (owner: 10TrainBranchBot) [00:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:31:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:32:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 (owner: 10TrainBranchBot) [00:36:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:54:50] (03CR) 10Btullis: [C:03+1] opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [00:56:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:00:44] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:30] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 46s) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:55:48] (03PS1) 10KartikMistry: Update Recommendation API to 2025-09-15-194552-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189380 (https://phabricator.wikimedia.org/T404223) [03:03:18] (03PS1) 10KartikMistry: Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) [03:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:41:21] 06SRE: Update Wikitech "Search Console Data" doc to align with current ITS-first request process - https://phabricator.wikimedia.org/T404927#11192414 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf I was actually coincidentally working on search console documentation, so I've gone ahead and mad... [03:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:56:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:03:17] FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:17] RESOLVED: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:37] RECOVERY - mysqld processes on es2027 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:16:43] RECOVERY - MariaDB read only es3 on es2027 is OK: Version 10.11.13-MariaDB-log, Uptime 7s, read_only: True, event_scheduler: True, 4.15 QPS, connection latency: 0.028914s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:19:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:32:49] (03PS4) 10Arnaudb: gerrit: toggle mod_qos log_only off [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) [05:32:49] (03CR) 10Arnaudb: "I'll send a notice on IRC and slack before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [05:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:54:58] Deploying cxserver.. [05:56:16] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [05:57:56] (03Merged) 10jenkins-bot: Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [05:59:56] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0600) [06:00:05] marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0600). [06:00:21] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:04:36] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:05:10] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:05:20] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [06:05:21] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2027.codfw.wmnet onto es2050.codfw.wmnet [06:05:29] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:06:02] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:08:23] !log Updated cxserver to 2025-09-16-161231-production (T394008, T404567, T404298, T404181) [06:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:32] T394008: CXServer doesn't support section suggestions for "be-tarask" language code - https://phabricator.wikimedia.org/T394008 [06:08:33] T404567: Post-creation work for tokwiki - https://phabricator.wikimedia.org/T404567 [06:08:35] T404298: Can't translate en:Tokyo in Gujarati - https://phabricator.wikimedia.org/T404298 [06:08:35] T404181: When templatedata is missing cxserver fails to extract template params from template source code - https://phabricator.wikimedia.org/T404181 [06:12:23] (03CR) 10Brouberol: opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:37] !log jynus@cumin1003 dbctl commit (dc=all): 'Depool es2027 T404940', diff saved to https://phabricator.wikimedia.org/P83420 and previous config saved to /var/cache/conftool/dbconfig/20250918-063436-jynus.json [06:34:42] T404940: es2027 database unhealthy - https://phabricator.wikimedia.org/T404940 [06:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:16] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189169 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [06:38:54] (03CR) 10Majavah: [C:03+2] backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825 (owner: 10Majavah) [06:45:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [06:46:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:33] (03PS1) 10Muehlenhoff: homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) [06:56:22] (03PS1) 10Slyngshede: Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0700) [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] o/ [07:00:13] I can deploy [07:04:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [07:05:10] (03Merged) 10jenkins-bot: cirrus: Reduce galleries weight in search on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [07:06:00] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] [07:06:04] (03PS2) 10Majavah: P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 [07:06:05] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:06:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:27] (03CR) 10Giuseppe Lavagetto: [C:03+2] Add deprecations to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1180712 (https://phabricator.wikimedia.org/T398161) (owner: 10Giuseppe Lavagetto) [07:12:10] !log dcausse@deploy1003 dcausse, ebernhardson: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:12:15] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:16:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:17:06] !log dcausse@deploy1003 dcausse, ebernhardson: Continuing with sync [07:18:55] (03CR) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [07:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:20:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:46] (03PS4) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [07:22:21] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] (duration: 16m 20s) [07:22:25] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:26:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:27:37] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in eqiad to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:28:53] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1188829 (owner: 10Majavah) [07:29:00] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 (owner: 10Majavah) [07:32:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6980/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188827 (owner: 10Majavah) [07:32:45] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::checker: Remove absent checks [puppet] - 10https://gerrit.wikimedia.org/r/1188827 (owner: 10Majavah) [07:34:33] (03CR) 10Majavah: [V:03+2 C:03+2] "ignoring typos false positive" [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah) [07:36:12] (03PS1) 10Majavah: openstack: Drop obsolete linuxbridge config files [puppet] - 10https://gerrit.wikimedia.org/r/1189393 [07:36:12] (03PS1) 10Majavah: P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 [07:38:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6981/console" [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [07:40:34] (03PS1) 10Majavah: O:aptly::server: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1189395 (https://phabricator.wikimedia.org/T399076) [07:40:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:41:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:05] (03CR) 10Majavah: [C:03+2] O:aptly::server: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1189395 (https://phabricator.wikimedia.org/T399076) (owner: 10Majavah) [07:42:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:45:02] (03CR) 10Slyngshede: "Right now the nda group isn't sync'ed because it's not listed as one of the groups Netbox needs. We did talk about it at a previous Infras" [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [07:46:41] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:48:36] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in eqiad to install1005 [dns] - 10https://gerrit.wikimedia.org/r/1189173 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:48:39] (03CR) 10Jelto: [C:03+1] "lgtm but we should closely monitor metrics and user reports. I recall cloning repos over https opens several connections. So we should mak" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:48:50] !log jmm@dns1004 START - running authdns-update [07:49:20] (03Merged) 10jenkins-bot: Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:50:02] !log jmm@dns1004 END - running authdns-update [07:50:41] (03CR) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [07:55:53] (03CR) 10Arnaudb: [C:03+2] "100%! I'll progressively rollout from spare to primary with puppet-agent disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:57:07] (03CR) 10Muehlenhoff: [C:03+2] Update the proxies used by cloudcumin to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189171 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:08:37] (03CR) 10Elukey: [C:03+1] homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:09:21] (03CR) 10Muehlenhoff: [C:03+2] homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:12:56] (03PS1) 10Slyngshede: IDM: Failover for 0.1.13 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1189434 [08:19:25] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: SwitchCoreInterfaceDown (instance ssw1-f1-codfw:9804) - https://phabricator.wikimedia.org/T404946 (10LSobanski) 03NEW [08:21:22] (03CR) 10Novem Linguae: "As a recently added volunteer NDA, I find these IDP-protected tools a lot like a second Wikitech. There's great info in some of them, and " [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [08:25:51] (03CR) 10Elukey: "I am super sorry to review this only now, thanks a lot for the patch :) I totally understand that this is a poc and it needs more refineme" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 (https://phabricator.wikimedia.org/T397696) (owner: 10CDanis) [08:31:00] (03PS1) 10Gmodena: admin: add sk-ssh-ed25519 key for gmodena [puppet] - 10https://gerrit.wikimedia.org/r/1189435 [08:32:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:33:14] (03CR) 10Slyngshede: [C:03+2] IDM: Failover for 0.1.13 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1189434 (owner: 10Slyngshede) [08:33:35] !log slyngshede@dns1004 START - running authdns-update [08:34:53] !log slyngshede@dns1004 END - running authdns-update [08:36:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed