[00:05:25] FIRING: [2x] SystemdUnitFailed: man-db.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:23] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631267 (10phaultfinder) [00:27:27] (03CR) 10Ssingh: [C:03+1] "(Trusting the v6 ones with your script!)" [puppet] - 10https://gerrit.wikimedia.org/r/1127134 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [00:29:04] (03CR) 10Ssingh: [C:03+1] Add delegations for aux-k8s POD ranges in codfw [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127178 [00:38:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127178 (owner: 10TrainBranchBot) [00:39:11] (03CR) 10Ssingh: "Deferring this to Jelto who is CCed here as I don't have the full context." [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:39:49] (03CR) 10Ssingh: "IN" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:40:09] (03CR) 10Ssingh: "Superfluous comment, please ignore." [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:41:38] (03CR) 10Ssingh: "Adding to the above comments: in short, set up the service first as defined in https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_bala" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:45:08] (03CR) 10Ssingh: "Sorry, one final comment: Keith and I tried to deploy this but didn't finish it; see T381417" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [00:51:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1127178 (owner: 10TrainBranchBot) [00:56:15] (03CR) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [00:56:45] (03PS6) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [00:57:08] (03CR) 10CI reject: [V:04-1] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [01:07:40] (03PS7) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [01:08:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127184 [01:09:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127184 (owner: 10TrainBranchBot) [01:13:55] !log Manually fixing 5 bad abuse_filter_log rows in mediawikiwiki for T388732 [01:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:01] T388732: TypeError: MediaWiki\Extension\AbuseFilter\AbuseFilterPermissionManager::canSeeLogDetailsForFilter(): Argument #2 ($privacyLevel) must be of type int, null given, called in /srv/mediawiki/php-1.44.0-wmf.20/extensions/AbuseFilte - https://phabricator.wikimedia.org/T388732 [01:30:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1127184 (owner: 10TrainBranchBot) [01:38:43] (03CR) 10BCornwall: [C:03+1] Add delegations for aux-k8s POD ranges in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [01:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631417 (10phaultfinder) [01:45:19] (03CR) 10BCornwall: "Looks good, verified with https://noc.wikimedia.org/dbconfig/eqiad.json" [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/4c42e1e9a91d8e3f9cb0312049adc201a26c44f906035b87859c668567c38cc1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:07] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [02:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:07:23] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:11:20] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [02:41:19] (03PS2) 10Dzahn: create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) [02:44:53] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10631457 (10Dzahn) [02:53:25] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [02:55:23] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [02:56:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [02:58:53] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [03:00:40] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [03:06:16] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [03:10:53] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [03:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631513 (10phaultfinder) [03:20:45] (03PS1) 10DLynch: Follow-up Ia4b9f65b6: Fix argument order passed to EditCheckFactory#create [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) [03:37:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10631522 (10Papaul) @elukey @Jclark-ctr because those servers where failing so i tested restbase1043 i am getting the error below. ` RuntimeError: Error while... [03:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631525 (10phaultfinder) [04:00:25] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:25] FIRING: [8x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:55] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.029e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:21:56] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [05:26:25] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10631553 (10phaultfinder) [05:40:00] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [05:41:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:42:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [05:58:18] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [06:13:00] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release [06:16:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10631578 (10Marostegui) Thank you @VRiley-WMF! [06:17:00] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:17:37] (03CR) 10Marostegui: wmnet: update CNAME records for DB masters to eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [06:19:06] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:19:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:06] (03CR) 10Marostegui: [C:04-1] wmnet: update CNAME records for DB masters to eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [06:23:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74207 and previous config saved to /var/cache/conftool/dbconfig/20250313-062341-root.json [06:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631594 (10phaultfinder) [06:27:06] (03PS1) 10Marostegui: installserver: Do not reimage db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1127399 [06:30:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:42] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1127399 (owner: 10Marostegui) [06:33:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10631602 (10DSantamaria) [06:34:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10631603 (10DSantamaria) Regenerated, thanks @BCornwall [06:35:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74208 and previous config saved to /var/cache/conftool/dbconfig/20250313-063624-root.json [06:38:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74209 and previous config saved to /var/cache/conftool/dbconfig/20250313-063846-root.json [06:40:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:59] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 1459 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [06:46:31] (03PS1) 10Kevin Bazira: ml-services: fix image tags for article-country and articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127410 (https://phabricator.wikimedia.org/T385970) [06:51:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74210 and previous config saved to /var/cache/conftool/dbconfig/20250313-065129-root.json [06:53:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74211 and previous config saved to /var/cache/conftool/dbconfig/20250313-065351-root.json [06:55:25] FIRING: [9x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631608 (10phaultfinder) [07:05:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74212 and previous config saved to /var/cache/conftool/dbconfig/20250313-070636-root.json [07:08:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74213 and previous config saved to /var/cache/conftool/dbconfig/20250313-070857-root.json [07:17:14] (03PS1) 10Brouberol: airflow-test-k8s: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127417 (https://phabricator.wikimedia.org/T386282) [07:17:15] (03PS1) 10Brouberol: airflow-main: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127418 (https://phabricator.wikimedia.org/T386282) [07:19:07] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1200-1208].eqiad.wmnet [07:21:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74214 and previous config saved to /var/cache/conftool/dbconfig/20250313-072141-root.json [07:22:44] (03CR) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [07:22:50] (03PS6) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) [07:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74215 and previous config saved to /var/cache/conftool/dbconfig/20250313-072403-root.json [07:24:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. But since this changes the sudo rules for a permission group, it first needs approval in the next SRE IF team meeting (on Mond" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [07:28:56] (03CR) 10Krinkle: [C:03+2] fatal-error: Ensure action=cache max-age is higher than response time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127164 (owner: 10Krinkle) [07:29:03] (03CR) 10Krinkle: fatal-error: Ensure action=cache max-age is higher than response time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127164 (owner: 10Krinkle) [07:29:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127164 (owner: 10Krinkle) [07:29:46] (03Merged) 10jenkins-bot: fatal-error: Ensure action=cache max-age is higher than response time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127164 (owner: 10Krinkle) [07:30:38] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1127164|fatal-error: Ensure action=cache max-age is higher than response time]] [07:32:15] (03PS1) 10Muehlenhoff: Record LDAP access for astein [puppet] - 10https://gerrit.wikimedia.org/r/1127454 [07:33:57] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1127164|fatal-error: Ensure action=cache max-age is higher than response time]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631631 (10phaultfinder) [07:35:02] !log krinkle@deploy2002 krinkle: Continuing with sync [07:36:54] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for astein [puppet] - 10https://gerrit.wikimedia.org/r/1127454 (owner: 10Muehlenhoff) [07:39:52] (03CR) 10Vgutierrez: site,hiera: Reimage lvs6001 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:41:51] !log depool lvs6001 before being reimaged - T384477 [07:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:55] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [07:42:07] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127164|fatal-error: Ensure action=cache max-age is higher than response time]] (duration: 11m 28s) [07:42:19] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs6001.drmrs.wmnet with reason: depooled before reimage [07:44:49] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:44:50] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10631652 (10MoritzMuehlenhoff) Our central SSO for Wikimedia web services (idp.wikimedia.org, running on Apereo CAS) uses LDAP... [07:45:51] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6001 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127062 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:48:10] (03CR) 10Volans: "Sorry for the intrusion, saw this passing by in my inbox and had a question." [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [07:48:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:05] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS4265006001/IPv6: Idle - asw1-b12-drmrs, AS4265006001/IPv4: Idle - asw1-b12-drmrs https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:50:05] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6001.drmrs.wmnet with OS bookworm [07:50:38] ^^ BGP alert is the lvs6001 reimage [07:51:05] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:43] (03PS1) 10Volans: sre.switchdc.databases: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1127455 [07:57:00] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:57:00] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:58:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:58:22] (03CR) 10Marostegui: [C:03+1] sre.switchdc.databases: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1127455 (owner: 10Volans) [08:00:04] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10631677 (10elukey) @Jclark-ctr feel free to ping me on IRC when you have some provisioning issues with Supermicros, I'll try to help when I am online! Please... [08:02:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1034.eqiad.wmnet with OS bookworm [08:02:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10631679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1034.eqiad.wmnet with OS bookworm [08:03:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:04:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631702 (10phaultfinder) [08:06:20] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [08:07:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:39] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1204.eqiad.wmnet [08:10:12] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [08:11:03] (03CR) 10Volans: [C:03+2] sre.switchdc.databases: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1127455 (owner: 10Volans) [08:12:07] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1204.eqiad.wmnet [08:14:30] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1204.eqiad.wmnet [08:14:51] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:15:35] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1204.eqiad.wmnet [08:18:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:20:19] Amir1, urbanecm, awight: Are any of you available for a namespace deployment during the current (empty) window, or should I wait for the next window? [08:20:47] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs6001.drmrs.wmnet with OS bookworm [08:21:43] (03Merged) 10jenkins-bot: sre.switchdc.databases: fix help message [cookbooks] - 10https://gerrit.wikimedia.org/r/1127455 (owner: 10Volans) [08:22:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:22:46] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix image tags for article-country and articlequality in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127410 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:23:32] (03CR) 10Kevin Bazira: [C:03+2] ml-services: fix image tags for article-country and articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127410 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:24:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage [08:25:04] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:25:12] (03Merged) 10jenkins-bot: ml-services: fix image tags for article-country and articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127410 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:25:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631731 (10phaultfinder) [08:26:40] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix image tags for article-country and articlequality in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127410 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:28:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1034.eqiad.wmnet with reason: host reimage [08:28:40] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [08:29:58] (03CR) 10Ilias Sarantopoulos: [C:03+1] inference-services: Deploy edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127059 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [08:30:40] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on gerrit2003.wikimedia.org with reason: testing [08:45:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1034.eqiad.wmnet with OS bookworm [08:46:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10631777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1034.eqiad.wmnet with OS bookworm completed: - ganeti103... [08:46:11] (03CR) 10JMeybohm: [C:03+1] prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [08:46:18] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs6001.drmrs.wmnet with OS bookworm [08:47:32] (03CR) 10JMeybohm: [C:03+1] admin_ng: use the correct helm version for each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127011 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [08:48:28] (03PS1) 10Volans: sre.hosts.reimage: puppetdb rollback fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1127460 [08:48:35] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [08:51:42] (03CR) 10Vgutierrez: [C:03+1] sre.hosts.reimage: puppetdb rollback fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1127460 (owner: 10Volans) [08:51:57] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6001.drmrs.wmnet with reason: host reimage [08:55:51] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127461 [08:56:06] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v5.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127461 (owner: 10Volans) [08:57:53] (03PS4) 10Ilias Sarantopoulos: api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) [09:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T0900) [09:01:16] (03PS1) 10Gergő Tisza: Enable SUL3 signup for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127462 (https://phabricator.wikimedia.org/T384218) [09:02:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6001.drmrs.wmnet with OS bookworm [09:04:50] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1204.eqiad.wmnet [09:06:46] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1204.eqiad.wmnet [09:06:49] (03CR) 10Gkyziridis: [C:03+2] "Merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127059 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:08:14] (03Merged) 10jenkins-bot: inference-services: Deploy edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127059 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:09:33] (03PS15) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [09:10:04] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:10:48] (03PS1) 10Vgutierrez: hiera: Restore lvs6001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127464 (https://phabricator.wikimedia.org/T384477) [09:11:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127464 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:12:26] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:12:48] (03CR) 10Elukey: [C:03+1] sre.hosts.reimage: puppetdb rollback fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1127460 (owner: 10Volans) [09:13:49] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v5.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1127461 (owner: 10Volans) [09:15:16] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:16:50] (03CR) 10Elukey: [C:03+2] restbase: new hosts (refresh) restbase104[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1111717 (https://phabricator.wikimedia.org/T383673) (owner: 10Eevans) [09:16:54] (03PS1) 10Volans: Upstream release v5.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127466 [09:17:24] (03CR) 10Volans: [C:03+2] Upstream release v5.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127466 (owner: 10Volans) [09:17:54] (03CR) 10Volans: [C:03+2] sre.hosts.reimage: puppetdb rollback fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1127460 (owner: 10Volans) [09:20:49] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1204.eqiad.wmnet [09:22:07] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1204.eqiad.wmnet [09:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631853 (10phaultfinder) [09:24:35] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1043.eqiad.wmnet with OS bullseye [09:24:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10631854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host restbase1043.eqiad.wmnet with OS bullseye [09:25:56] (03Merged) 10jenkins-bot: sre.hosts.reimage: puppetdb rollback fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1127460 (owner: 10Volans) [09:29:21] (03CR) 10Volans: [C:03+1] "LGTM, just that one bit TBD inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [09:30:24] (03CR) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [09:32:09] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs6001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1127464 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:35:33] (03Merged) 10jenkins-bot: Upstream release v5.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1127466 (owner: 10Volans) [09:36:16] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1043.eqiad.wmnet with reason: host reimage [09:37:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs6001.drmrs.wmnet} and A:liberica (T384477) [09:37:24] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [09:37:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs6001.drmrs.wmnet} and A:liberica (T384477) [09:39:02] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:39:04] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10631879 (10Vgutierrez) I've submitted https://gerrit.wikimedia.org/r/1127066 as a first step to switch the web interface of l... [09:40:02] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:40:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1043.eqiad.wmnet with reason: host reimage [09:40:30] (03PS1) 10Vgutierrez: cumin: Update (liberica|lvs)-drmrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/1127471 (https://phabricator.wikimedia.org/T384477) [09:42:51] !log uploaded cumin_5.1.0 to apt.wikimedia.org bullseye-wikimedia [09:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:25] (03CR) 10Stevemunene: [C:03+2] hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [09:45:58] (03PS1) 10Jon Harald Søby: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) [09:48:20] (03PS2) 10Jon Harald Søby: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) [09:52:51] (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [09:53:40] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [09:56:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [09:56:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1043.eqiad.wmnet with OS bullseye [09:56:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10631909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host restbase1043.eqiad.wmnet with OS bullseye completed: -... [09:56:50] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1044.eqiad.wmnet with OS bullseye [09:56:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10631910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host restbase1044.eqiad.wmnet with OS bullseye [10:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T0900) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1000) [10:00:48] (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 [10:02:14] (03PS2) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 [10:04:25] (03CR) 10CI reject: [V:04-1] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 (owner: 10JMeybohm) [10:05:09] (03PS3) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 [10:05:43] (03CR) 10Brouberol: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [10:05:49] (03CR) 10Brouberol: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123688 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [10:07:20] (03CR) 10CI reject: [V:04-1] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 (owner: 10JMeybohm) [10:08:28] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1044.eqiad.wmnet with reason: host reimage [10:08:51] 06SRE, 06Infrastructure-Foundations, 10Nagf: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770 (10cmooney) 03NEW p:05Triage→03Medium [10:09:28] (03PS4) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 [10:09:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [10:11:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1044.eqiad.wmnet with reason: host reimage [10:11:59] (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127474 (owner: 10JMeybohm) [10:12:34] (03PS1) 10Cathal Mooney: Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) [10:14:05] 06SRE, 06Infrastructure-Foundations, 10Nagf, 13Patch-For-Review: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10631948 (10cmooney) I tested the patch on netbox-next, though I had to rig the setup to replicate the scena... [10:15:46] (03PS1) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127476 (https://phabricator.wikimedia.org/T383845) [10:15:48] (03PS2) 10Hnowlan: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) [10:16:14] (03CR) 10Marostegui: [C:03+1] wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:16:23] (03CR) 10Hnowlan: wmnet: update CNAME records for DB masters to eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:16:53] 06SRE: Visual editor doesn't work on ca.wikipedia.org - https://phabricator.wikimedia.org/T388772#10631974 (10Peachey88) [10:17:37] (03CR) 10Marostegui: [C:03+1] wmnet: update CNAME records for DB masters to eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:18:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [10:20:08] (03CR) 10Volans: [C:03+1] "LGTM, but I'm not expert on all the cases we can have child objects in netbox, if it always makes sense to delete." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [10:20:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10631978 (10phaultfinder) [10:21:18] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_scholarly@codfw [10:21:21] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123688 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [10:22:24] (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127478 (https://phabricator.wikimedia.org/T383845) [10:23:16] (03PS2) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127476 (https://phabricator.wikimedia.org/T383845) [10:24:18] (03PS1) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127480 [10:24:18] (03PS9) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [10:25:41] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:25:53] (03PS10) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [10:27:24] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:27:43] (03CR) 10Klausman: [C:03+1] api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [10:28:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:28:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal_scholarly@codfw [10:29:14] (03CR) 10Hnowlan: [C:03+1] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127476 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:29:19] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) [10:29:21] (03CR) 10Ilias Sarantopoulos: [C:03+2] api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [10:29:33] (03CR) 10Hnowlan: [C:03+1] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127478 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:30:45] (03Merged) 10jenkins-bot: api_gateway: add editcheck experimental to api-gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126985 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [10:31:03] (03PS1) 10Filippo Giunchedi: site: provision prometheus100[78] with role prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) [10:31:15] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_scholarly@eqiad [10:31:18] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123689 (https://phabricator.wikimedia.org/T387320) (owner: 10Vgutierrez) [10:31:26] (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1127480 (owner: 10JMeybohm) [10:31:31] (03CR) 10Filippo Giunchedi: "In preparation for eqiad new prometheus hardware" [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [10:31:53] (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127484 [10:31:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [10:32:22] vgutierrez: okay to merge : hiera,wdqs: Enable IPIP on wdqs-internal-scholarly@eqiad [10:32:57] duh.. I thought I did it, please go ahead jayme [10:33:17] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 10863MiB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:33:25] (03CR) 10Hnowlan: [C:03+1] mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:47] jouncebot: refresh [10:33:48] I refreshed my knowledge about deployments. [10:33:52] jouncebot: now [10:33:52] For the next 0 hour(s) and 26 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T0900) [10:33:52] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1000) [10:33:56] (03CR) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [10:33:57] vgutierrez: ack, done [10:34:01] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:34:30] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:34:44] hashar: just making sure, any leftovers from the train? the 2 windows are overlapping for the time being [10:34:44] (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1127484 (owner: 10JMeybohm) [10:34:47] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: wdqs::internal_scholarly@eqiad [10:34:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::internal_scholarly@eqiad [10:35:35] hashar: excellent! [10:36:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:36:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [10:36:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1044.eqiad.wmnet with OS bullseye [10:36:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10632024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host restbase1044.eqiad.wmnet with OS bullseye completed: -... [10:36:59] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1045.eqiad.wmnet with OS bullseye [10:37:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10632027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host restbase1045.eqiad.wmnet with OS bullseye [10:38:37] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:39:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:39:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::internal_scholarly@eqiad [10:40:09] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127476 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:40:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127462 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [10:41:43] (03CR) 10Vgutierrez: [C:03+1] trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1125461 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [10:42:34] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127486 [10:46:16] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127486 (owner: 10PipelineBot) [10:47:40] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127486 (owner: 10PipelineBot) [10:48:16] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1045.eqiad.wmnet with reason: host reimage [10:48:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:48:59] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:49:15] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:50:48] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:51:09] (03CR) 10Hnowlan: [C:03+2] trafficserver: route citoid via rest-gateway for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1125461 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [10:51:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1045.eqiad.wmnet with reason: host reimage [10:54:16] (03CR) 10Stevemunene: [C:03+2] hdfs: Assign the right role to new hdfs workers 1[187-208] [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [10:56:58] (03PS2) 10Cathal Mooney: Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) [10:57:02] (03PS1) 10Klausman: APIGW: fix wrong host for LW staging service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127487 (https://phabricator.wikimedia.org/T388269) [10:57:36] (03CR) 10Ilias Sarantopoulos: [C:03+1] APIGW: fix wrong host for LW staging service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127487 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [10:58:09] 06SRE, 06Infrastructure-Foundations, 10Nagf, 13Patch-For-Review: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10632060 (10cmooney) Ok the second approach is also working as expected: ` 11 2025-03-13T10:55:26.702680+00... [10:58:31] (03CR) 10Cathal Mooney: "Right here it should, it ought to only reach this stage once all interfaces that actually exist on the device have been processed (so if c" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [10:59:08] (03CR) 10CI reject: [V:04-1] Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [10:59:45] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127478 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:00:24] jouncebot: refresh [11:00:26] I refreshed my knowledge about deployments. [11:00:39] (03PS3) 10Cathal Mooney: Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) [11:01:38] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127478 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:03:04] (03CR) 10Klausman: [C:03+2] APIGW: fix wrong host for LW staging service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127487 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [11:04:46] 06SRE, 06Infrastructure-Foundations, 10Nagf, 13Patch-For-Review: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10632073 (10cmooney) [11:04:51] (03Merged) 10jenkins-bot: APIGW: fix wrong host for LW staging service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127487 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [11:05:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:40] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:05:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10632077 (10MoritzMuehlenhoff) [11:05:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10632078 (10phaultfinder) [11:06:14] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [11:08:19] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:08:35] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:12:06] (03CR) 10Nikerabbit: AX: Add quick survey for MinT for Wikireaders (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [11:15:44] !log jiji@deploy2002 Started scap sync-world: (T383845) mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 [11:15:48] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [11:17:48] (03PS1) 10Ladsgroup: Bump the thumbnail steps ratio to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127490 (https://phabricator.wikimedia.org/T360589) [11:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10632103 (10phaultfinder) [11:20:19] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:20:28] something is not going well with scap [11:20:36] expect errors until it rolls back [11:20:57] should we worry or just a normal failure? [11:21:12] there will be a little bit of blood [11:21:28] ok maybe more [11:23:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:37] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:41] volans ^ [11:25:15] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 41.17s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:25:37] ES overload [11:25:41] jynus: volans wait for scap to fail [11:25:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:25:53] I think there is something bad with the rollour [11:26:03] that causes a domino effect [11:26:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [11:26:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1045.eqiad.wmnet with OS bullseye [11:26:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10632115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host restbase1045.eqiad.wmnet with OS bullseye completed: -... [11:27:23] taking IC [11:27:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10632116 (10elukey) @Jclark-ctr I've ran provision with the new version of the cookbook that is still being tested (not yet released) and ran reimage, all good... [11:27:26] FIRING: [2x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:40] <_joe_> same explosion of ES, sighwaht [11:28:02] <_joe_> is it that job again? [11:28:10] _joe_: I think it is related to the deployment [11:28:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:28:24] <_joe_> yes, in this case I'd say it is [11:28:24] lets wait again from scap to rollback [11:28:38] <_joe_> what were you rolling to php 8.1? [11:28:46] <_joe_> things seem to be recovering now [11:28:52] mw-parsoid, mw-jobrunner, and -int [11:28:53] !log jiji@deploy2002 scap failed: 'production' (scap version: 4.140.0) (duration: 13m 16s) [11:29:00] how does *just* canaries blow up es [11:29:09] <_joe_> effie: ok let's do -parsoid and -int [11:29:18] <_joe_> claime: was it just canaries? [11:29:29] <_joe_> effie: even better, one at a time [11:29:43] <_joe_> claime: I suspect there's some bug that triggers when everything's on 8.1 [11:29:53] _joe_: scap should roll to canaries, test, then rolls forward [11:30:00] _joe_: also, only canaries alert [11:30:01] let me check scap log, 'production' [11:30:15] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 43.57s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:30:19] <_joe_> effie: I don't think that's important rn [11:30:32] <_joe_> effie: has scap rolled back? [11:30:40] 200K errors from overloading es [11:30:46] not ongoing [11:30:51] FIRING: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:30:54] finished at 11:24 [11:30:55] <_joe_> jynus: meaning the errors stopped? [11:30:58] yes [11:31:01] ok [11:31:20] <_joe_> but we still have high throughput on es AFAICT [11:31:33] mw is not reporting any errors [11:31:34] es however seems high loaded (degraded) but not hard down [11:31:36] <_joe_> effie: once the rollback is complete [11:31:46] so things are ongoing, but we are not in an outage [11:31:56] <_joe_> please roll restart changeprop-jobqueue [11:32:00] scap has exited, and looking at the graphs, we are back in the pre-scap state [11:32:01] <_joe_> I have a suspicion here [11:32:05] https://grafana.wikimedia.org/goto/3xtLcO2Ng?orgId=1 [11:32:08] _joe_: via k8s? [11:32:12] <_joe_> effie: yes [11:32:14] restart the pods [11:32:15] ok [11:32:17] <_joe_> yes [11:32:22] on it [11:32:28] <_joe_> my suspicion is that some job blows up when running on php 8.1 [11:32:46] <_joe_> and restarting changeprop resets its persistent connections [11:33:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:33:16] <_joe_> so what "solved" the issue when amir reduced the concurrency of that job, was actually us restarting changeprop [11:33:27] <_joe_> uhm things seem far from ok? [11:33:35] effie: helmfile -e $CLUSTER --state-values-set roll_restart=1 sync [11:33:40] for roll restart [11:33:41] claime: yes [11:33:44] tx [11:34:10] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [11:34:37] RESOLVED: [2x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:37] <_joe_> effie: if I'm right, this means we'll have to migrate jobs one by one until we find the culprit :/ [11:35:03] <_joe_> so let's hope I'm wrong [11:35:03] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [11:35:11] _joe_: noooooo [11:35:14] _joe_: there is one way to do so [11:35:15] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 6.894s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:35:20] <_joe_> effie: there is [11:35:25] I can leave jobrunners out [11:35:32] and see what gives [11:35:37] Then we burn them to the stake [11:35:46] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [11:35:46] They're haunted [11:35:50] <_joe_> so "good news" [11:35:51] RESOLVED: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:36:01] <_joe_> I don't think the restart in eqiad did anything [11:36:08] <_joe_> let's see codfw, where most jobs are produced [11:36:10] ok let's move this to -sre [11:36:13] too much nois [11:36:14] e [11:36:36] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [11:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:37:26] FIRING: [2x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:31] <_joe_> do we really have no visibility in what queries happen on ES? [11:37:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:37:47] <_joe_> marostegui, Amir1 your help diagnosing what is happening there would be appreciated [11:37:55] _joe_: the queeries are just fetch blob [11:38:04] they are point queries that all look the same [11:38:13] _joe_: anything on trace ? [11:38:22] which is "give me one row" so there is very little info on the dbs [11:38:40] <_joe_> claime: not looked rn [11:38:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:47] _joe_: yep [11:38:48] yeah, as I said several times already, calls to ES go through sql blob store [11:38:49] the outage comes from having each db 10K concurrent connections [11:38:57] which strips away all that information [11:39:06] is this the same thing as yesterday? [11:39:17] <_joe_> marostegui: apparently [11:39:29] we saw category membership change job in traces [11:39:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:38] <_joe_> you can at least see which pods make more queries? [11:39:45] Amir1: can you try to decrease it again? which was what worried me yesterday as we never found the issue [11:39:48] should I bring the concurrency down further? [11:39:52] yeah, I can [11:40:05] (03PS1) 10Ilias Sarantopoulos: ml-services: apply inference batching on reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127494 (https://phabricator.wikimedia.org/T387019) [11:40:10] But this is the thing I mentioned, that we may reach a point that it has to be 0 if we don't find the root cause [11:40:45] (03CR) 10Fabfur: [C:03+1] cumin: Update (liberica|lvs)-drmrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/1127471 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:40:58] (03PS1) 10Ladsgroup: changeprop-jobqueue: Reduce CategoryMembership change job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127495 [11:41:03] marostegui: Amir1 _joe_ let's speak on -sre [11:41:07] sure [11:41:13] (03PS2) 10Ilias Sarantopoulos: ml-services: apply inference batching on reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127494 (https://phabricator.wikimedia.org/T387019) [11:41:26] (03CR) 10Marostegui: [C:03+1] changeprop-jobqueue: Reduce CategoryMembership change job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127495 (owner: 10Ladsgroup) [11:41:39] (03CR) 10Ladsgroup: [C:03+2] changeprop-jobqueue: Reduce CategoryMembership change job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127495 (owner: 10Ladsgroup) [11:41:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:42:04] !incidents [11:42:04] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [11:42:07] !incidents [11:42:08] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [11:42:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:42:26] RESOLVED: [2x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:26] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:43:15] FIRING: [6x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:43:18] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [11:43:25] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:43:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:43:34] (03Merged) 10jenkins-bot: changeprop-jobqueue: Reduce CategoryMembership change job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127495 (owner: 10Ladsgroup) [11:43:39] !log rolling restarting mw-api-int [11:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:20] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:44:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1127471 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:45:06] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:45:21] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:45:48] (03PS1) 10Effie Mouzeli: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127496 [11:46:06] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:46:25] (03PS1) 10Effie Mouzeli: Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3)" [puppet] - 10https://gerrit.wikimedia.org/r/1127497 [11:46:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:46:56] !incidents [11:46:56] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [11:46:58] (03CR) 10Effie Mouzeli: [C:03+2] Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127496 (owner: 10Effie Mouzeli) [11:47:05] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:47:15] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:47:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:26] (03CR) 10Effie Mouzeli: [C:03+2] Revert "hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3)" [puppet] - 10https://gerrit.wikimedia.org/r/1127497 (owner: 10Effie Mouzeli) [11:47:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:48:02] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:48:15] FIRING: [6x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:48:53] (03Merged) 10jenkins-bot: Revert "mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127496 (owner: 10Effie Mouzeli) [11:50:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:51:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:52:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:26] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:52:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [11:52:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:52:54] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:53:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:55:01] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [11:55:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 41.81s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:56:02] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old dns entries for lvs6xxx vlan sub-int IPs - cmooney@cumin1002" [11:56:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old dns entries for lvs6xxx vlan sub-int IPs - cmooney@cumin1002" [11:56:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:56:14] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:56:31] (03CR) 10Cathal Mooney: [C:03+2] Add delegations for aux-k8s POD ranges in codfw [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [11:56:42] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:56:48] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:56:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:56:53] !log cmooney@dns2005 START - running authdns-update [11:57:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:58:06] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:58:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:58:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [11:58:43] !log cmooney@dns2005 END - running authdns-update [11:58:51] !incidents [11:58:52] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [11:58:52] 5731 (UNACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [11:58:55] !ack 5731 [11:58:56] 5731 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [11:58:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:02] !incidents [11:59:02] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [11:59:02] 5731 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [11:59:02] 5732 (UNACKED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [11:59:06] !ack 5732 [11:59:07] 5732 (ACKED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [11:59:59] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1200) [12:01:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:02:01] !incidents [12:02:02] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [12:02:02] 5731 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:02:02] 5732 (ACKED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [12:03:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:03:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:14] (03PS1) 10Hnowlan: mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127498 [12:04:25] effie: https://gerrit.wikimedia.org/r/1127498 [12:04:37] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:06:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:07:00] (03CR) 10Effie Mouzeli: [C:03+2] mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127498 (owner: 10Hnowlan) [12:07:46] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:07:51] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:08:15] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 21.88% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:08:26] (03Merged) 10jenkins-bot: mw-api-int: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127498 (owner: 10Hnowlan) [12:08:45] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:09:14] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:09:49] (03CR) 10Vgutierrez: [C:03+2] haproxy: Don't set h2 initial-window-size on haproxy 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/1125393 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [12:12:26] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10632253 (10phaultfinder) [12:14:56] (03PS1) 10Ladsgroup: changeprop-jobqueue: Fully disable categorymembership job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127500 [12:17:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:18:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:19:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:19:44] Deployment mw-parsoid.codfw.main in mw-parsoid at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-parsoid&var-deployment=mw-parsoid.codfw.main - ... [12:19:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:21:36] !incidentes [12:21:38] !incidents [12:21:38] 5730 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [12:21:39] 5733 (ACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [12:21:39] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [12:21:39] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:22:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:22:16] (03PS1) 10Effie Mouzeli: mw-jobrunner: remove PHP 8.1 pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127504 [12:23:26] (03PS2) 10Gergő Tisza: Enable SUL3 signup for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127462 (https://phabricator.wikimedia.org/T384218) [12:23:26] (03PS1) 10Gergő Tisza: Set $wgSul3RolloutUserPercentage on some testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127505 (https://phabricator.wikimedia.org/T384153) [12:24:26] (03CR) 10Clément Goubert: [C:03+1] mw-jobrunner: remove PHP 8.1 pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127504 (owner: 10Effie Mouzeli) [12:24:37] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:40] (03CR) 10Effie Mouzeli: [C:03+2] mw-jobrunner: remove PHP 8.1 pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127504 (owner: 10Effie Mouzeli) [12:25:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127505 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [12:25:34] (03PS1) 10Effie Mouzeli: Revert "mw-api-int: bump replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127511 [12:26:04] (03Merged) 10jenkins-bot: mw-jobrunner: remove PHP 8.1 pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127504 (owner: 10Effie Mouzeli) [12:26:57] <_joe_> wait everyone for the backport window [12:27:04] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [12:27:05] <_joe_> we're in an ongoing issue [12:27:26] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:28:25] !log installing tiff security updates [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:31] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [12:31:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:32:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [12:32:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:32:54] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:33:15] RESOLVED: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 8.333% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:33:27] 07sre-alert-triage, 06serviceops: Alert in need of triage: Postgres Replication Lag (instance maps-test2002) - https://phabricator.wikimedia.org/T388782 (10LSobanski) 03NEW [12:34:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:34:44] Deployment mw-parsoid.codfw.main in mw-parsoid at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-parsoid&var-deployment=mw-parsoid.codfw.main - ... [12:34:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:35:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 17.76s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:37:26] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:43] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:41:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:41:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:41:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:42:26] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:43:43] PROBLEM - Disk space on kafka-logging1004 is CRITICAL: DISK CRITICAL - free space: /srv 156191 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-logging1004&var-datasource=eqiad+prometheus/ops [12:44:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) (owner: 10Jon Harald Søby) [12:44:37] RESOLVED: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:55] (03PS2) 10Effie Mouzeli: Revert "mw-api-int: bump replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127511 [12:47:36] (03CR) 10Ladsgroup: [C:03+2] changeprop-jobqueue: Fully disable categorymembership job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127500 (owner: 10Ladsgroup) [12:48:58] (03Merged) 10jenkins-bot: changeprop-jobqueue: Fully disable categorymembership job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127500 (owner: 10Ladsgroup) [12:49:03] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:49:33] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:49:39] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:50:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) (owner: 10DLynch) [12:50:31] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:50:39] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:51:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:51:44] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:54:58] (03PS1) 10Lucas Werkmeister (WMDE): Reapply "Make WikibaseQualityConstraints use split-graph query service" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127516 (https://phabricator.wikimedia.org/T374021) [12:56:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127516 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [12:56:33] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10632525 (10Sreejithk2000) Mathew, I have been using SplitFileHistory.js to split the file histories of overwritten files for a while now. I have... [12:56:54] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10632528 (10cmooney) FWIW should also mention here the info in the below slide deck: https://www.lacnic.net/innovaportal/file/3207/1/lacnog2018-douglasfischer_anal... [12:57:59] (03CR) 10Lucas Werkmeister (WMDE): "port numbers should still be up to date according to https://gerrit.wikimedia.org/g/operations/puppet/+/32db81064e/hieradata/common/profil" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127516 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [12:58:33] (03CR) 10Cathal Mooney: [C:03+2] Add cloud IPv6 ranges to Capirca IP block definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1126035 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:59:11] (03Merged) 10jenkins-bot: Add cloud IPv6 ranges to Capirca IP block definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1126035 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:59:40] (03CR) 10Gkyziridis: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127494 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:00:55] PROBLEM - SSH on gerrit2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:00] (03CR) 10Hashar: tox: simplify tox configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [13:08:45] RECOVERY - SSH on gerrit2003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:11:38] (03CR) 10Brouberol: Add initial configmaps for mediawiki-dumps-legacy (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [13:14:13] (03CR) 10Brouberol: Add initial configmaps for mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [13:15:01] (03PS1) 10Cathal Mooney: cr-labs: remove term allowing in from cloud vrf to cloudcontrol [homer/public] - 10https://gerrit.wikimedia.org/r/1127518 (https://phabricator.wikimedia.org/T269457) [13:15:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10632633 (10phaultfinder) [13:17:28] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: apply inference batching on reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127494 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:18:55] (03Merged) 10jenkins-bot: ml-services: apply inference batching on reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127494 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:19:16] (03CR) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [13:21:06] (03PS1) 10Hashar: gerrit: ban bad crawler [puppet] - 10https://gerrit.wikimedia.org/r/1127520 [13:21:12] (03CR) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [13:21:31] (03PS7) 10Abijeet Patro: AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) [13:21:54] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:22:41] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:22:44] Can we continue with deployments on k8s or there is still work in flight for the incident ? [13:22:55] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:23:13] (03CR) 10Arnaudb: [C:03+1] gerrit: ban bad crawler [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [13:23:34] nemo-yiannis: the incident is still ongoing [13:23:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10632688 (10MoritzMuehlenhoff) [13:23:44] ok, thanks jynus [13:26:24] (03Abandoned) 10Cathal Mooney: cr-labs: remove term allowing in from cloud vrf to cloudcontrol [homer/public] - 10https://gerrit.wikimedia.org/r/1127518 (https://phabricator.wikimedia.org/T269457) (owner: 10Cathal Mooney) [13:27:43] (03PS5) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) [13:33:44] (03CR) 10Hashar: "I am confused by how we can ban an IP entirely. Historically we have used Apache to 403 the bad actors which is what this patch is doing." [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [13:34:43] (03PS1) 10Cathal Mooney: Cloud-in: Add specific term allowing ICMPv6 from cloud-transports [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) [13:35:26] (03PS2) 10Cathal Mooney: Cloud-in: Add specific term allowing ICMPv6 from cloud-transports [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) [13:35:26] (03CR) 10Majavah: Cloud-in: Add specific term allowing ICMPv6 from cloud-transports (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [13:35:56] (03CR) 10Cathal Mooney: Cloud-in: Add specific term allowing ICMPv6 from cloud-transports (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [13:40:14] (03PS1) 10Arnaudb: nftables: add a newline at the end of GERRIT_ABUSERS_ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) [13:41:59] (03PS2) 10Arnaudb: nftables: add a newline at the end of GERRIT_ABUSERS_ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) [13:44:13] !log bking@cumin2002 conftool action : set/pooled=no; selector: service=cloudelastic,name=cloudelastic1012.eqiad.wmnet [13:45:06] (03PS1) 10Ladsgroup: Revert "changeprop-jobqueue: Fully disable categorymembership job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127528 [13:45:21] (03CR) 10Ladsgroup: [C:03+2] Revert "changeprop-jobqueue: Fully disable categorymembership job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127528 (owner: 10Ladsgroup) [13:45:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10632804 (10phaultfinder) [13:45:36] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [13:45:46] (03CR) 10Ladsgroup: Revert "changeprop-jobqueue: Fully disable categorymembership job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127528 (owner: 10Ladsgroup) [13:46:13] (03CR) 10Brouberol: Add initial configmaps for mediawiki-dumps-legacy (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [13:46:33] (03PS3) 10Filippo Giunchedi: pontoon: expand and reformat README.md [puppet] - 10https://gerrit.wikimedia.org/r/1126915 [13:46:49] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1012.eqiad.wmnet with OS bullseye [13:47:11] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125234 (owner: 10Bking) [13:47:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:28] (03PS2) 10Bking: cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125234 [13:47:39] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125234 (owner: 10Bking) [13:47:41] (03CR) 10Bking: [V:03+2 C:03+2] cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125234 (owner: 10Bking) [13:47:42] (03CR) 10Btullis: Add initial configmaps for mediawiki-dumps-legacy (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [13:49:00] (03CR) 10Herron: "thanks for having a look and writing this patch 👍. the codfw aux cluster is being built at the moment, but should be ready for these soon." [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [13:50:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [13:50:48] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:55] (03CR) 10Eevans: [C:03+1] pontoon: expand and reformat README.md [puppet] - 10https://gerrit.wikimedia.org/r/1126915 (owner: 10Filippo Giunchedi) [13:52:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:52:21] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: expand and reformat README.md [puppet] - 10https://gerrit.wikimedia.org/r/1126915 (owner: 10Filippo Giunchedi) [13:52:40] (03CR) 10Alexandros Kosiaris: "Yup, it was resolved yesterday due to having new facts, per https://puppet-compiler.wmflabs.org/. I noticed that while investigating which" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [13:55:41] (03CR) 10Alexandros Kosiaris: [C:03+1] "I can pick this up and shepherd it in production after the Monday meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [13:55:48] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:31] (03CR) 10Brouberol: [C:03+1] "Let's try that!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [14:00:08] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1400). [14:00:09] tgr, Jhs, Kemayo, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:25] 👋 [14:00:43] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@554407c]: T362615 [14:00:46] T362615: ETL pipeline for flaggedrevs metrics (pending frevs hourly) - https://phabricator.wikimedia.org/T362615 [14:00:49] hiya [14:01:06] o/ [14:01:18] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@554407c]: T362615 (duration: 01m 39s) [14:01:49] I’d say let’s start with Kemayo, because train blocker? [14:02:01] Works for me! [14:02:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) (owner: 10DLynch) [14:02:43] o/ [14:03:51] (03CR) 10Lucas Werkmeister (WMDE): "deployments on hold for now" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) (owner: 10DLynch) [14:03:55] sorry, deployments are on hold for now [14:03:58] per -sre [14:04:30] Ah, I thought they were done with that incident. Curses. [14:04:34] :/ [14:04:53] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:05:42] !log restarting parsoid on codfw [14:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:06:19] It’s the semi-ironic situation where an editing outage is blocking deployment of a separate editing fix. [14:06:51] (03CR) 10Volans: Ensure child interfaces of physicals are removed before physical (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [14:07:16] Kemayo: is that the visual editor issue? [14:07:31] kamila_: yes [14:08:02] It should only be affecting group 0/1 at the moment, fortunately. [14:08:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:09:00] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw enable bgp - https://phabricator.wikimedia.org/T388586#10632946 (10herron) 05Open→03Resolved a:03herron Great -- Just re-set BGP true on the aux-k8s-(ctrl|worker)2* nodes in netbox and the homer... [14:09:14] !incidents [14:09:14] 5734 (UNACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [14:09:14] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [14:09:14] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [14:09:15] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [14:09:15] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [14:09:18] !ack 5734 [14:09:19] 5734 (ACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [14:09:45] (03CR) 10Tiziano Fogli: [C:03+1] site: provision prometheus100[78] with role prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:09:50] (03PS1) 10Ilias Sarantopoulos: admin_ng: increase cpu resource_quota for revision-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127530 (https://phabricator.wikimedia.org/T387019) [14:10:52] Kemayo: we (SRE) are hoping to be able to deploy later today, I'll make sure we get the fix out if possible [14:11:01] (03CR) 10Cathal Mooney: Ensure child interfaces of physicals are removed before physical (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [14:11:16] kamila_: thanks! [14:11:55] Kemayo: at least we now already know how long the gate-and-submit will take once we can deploy ^^ [14:12:02] ca. 2½ minutes seemingly, nice [14:12:22] !log installing gnutls security updates [14:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:21] (03CR) 10Herron: [C:03+1] prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:19:32] (03CR) 10Herron: [C:03+1] "🧹🧼" [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:19:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:20:37] (03CR) 10Herron: [C:03+1] hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:21:28] (03CR) 10Herron: [C:03+1] site: provision prometheus100[78] with role prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:21:45] (03CR) 10Volans: [C:03+1] "If you've tested on netbox-next and works on the specific lvs but also a normal host LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [14:24:21] (03PS1) 10Stevemunene: hdfs: Fix an-worker regex to include more hosts [puppet] - 10https://gerrit.wikimedia.org/r/1127532 (https://phabricator.wikimedia.org/T388512) [14:24:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:25:37] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127418 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:26:25] (03CR) 10Federico Ceratto: "1) This was a bit unexpected and confused me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:26:25] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127417 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:26:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [14:26:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10633018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [14:27:22] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [14:27:31] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127417 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:27:36] (03CR) 10Brouberol: [C:03+2] airflow-main: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127418 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:28:59] (03Merged) 10jenkins-bot: airflow-test-k8s: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127417 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:29:04] (03Merged) 10jenkins-bot: airflow-main: render /etc/refinery/event_intake_service_urls.yaml in task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127418 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [14:29:16] (03CR) 10Btullis: [C:03+2] Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [14:29:23] <_joe_> Lucas_WMDE: please deploy [14:29:27] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:29:30] ack [14:29:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) (owner: 10DLynch) [14:29:42] thanks _joe_, effie and others! [14:29:45] Kemayo: ^ fyi [14:30:03] (03CR) 10Nikerabbit: [C:03+1] AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [14:30:28] (03CR) 10Herron: "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1127151 (https://phabricator.wikimedia.org/T381417) (owner: 10Cathal Mooney) [14:30:35] (03Merged) 10jenkins-bot: Add initial configmaps for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127111 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [14:30:57] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:31:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:31:39] tgr_: after that I’d do both of yours together if that’s okay [14:31:45] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1012.eqiad.wmnet with OS bullseye [14:31:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:32:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:32:12] (03Merged) 10jenkins-bot: Follow-up Ia4b9f65b6: Fix argument order passed to EditCheckFactory#create [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127208 (https://phabricator.wikimedia.org/T388722) (owner: 10DLynch) [14:32:25] (03CR) 10Cathal Mooney: [C:03+2] Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [14:32:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:32:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1127208|Follow-up Ia4b9f65b6: Fix argument order passed to EditCheckFactory#create (T388722)]] [14:32:48] T388722: Error: No class registered by that key: null - https://phabricator.wikimedia.org/T388722 [14:33:37] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [14:33:45] Lucas_WMDE: I can test it, or it's pretty good to roll out if you'd rather. [14:33:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:33:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:34:06] Kemayo: it’s not quite ready for testing yet ^^ [14:34:55] Kemayo: now you can test on k8s-mwdebug :) [14:35:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [14:35:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [14:35:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kemayo: Backport for [[gerrit:1127208|Follow-up Ia4b9f65b6: Fix argument order passed to EditCheckFactory#create (T388722)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:35:58] (03PS1) 10Volans: puppetdb: add inventory endpoint to the proxy [puppet] - 10https://gerrit.wikimedia.org/r/1127534 (https://phabricator.wikimedia.org/T372666) [14:37:48] Lucas_WMDE: Looks good! [14:37:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, kemayo: Continuing with sync [14:37:53] yay, thanks! [14:38:09] (03CR) 10CI reject: [V:04-1] puppetdb: add inventory endpoint to the proxy [puppet] - 10https://gerrit.wikimedia.org/r/1127534 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [14:38:24] Sorry, took me a minute to find a suitably valid edit to make in the main namespace on one of those wikis. 😂 [14:39:15] :'D [14:39:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:28] Lucas_WMDE: neither of my patches needs testing, feel free to batch them with whatever [14:40:30] (03PS2) 10Volans: puppetdb: add inventory endpoint to the proxy [puppet] - 10https://gerrit.wikimedia.org/r/1127534 (https://phabricator.wikimedia.org/T372666) [14:40:41] ack [14:40:53] might as well combine them wtih Jhs then [14:41:39] if they come back, that is [14:42:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:43:24] (03CR) 10Bking: [C:03+2] Require opensearch package to be installed before configuring [puppet] - 10https://gerrit.wikimedia.org/r/1126653 (https://phabricator.wikimedia.org/T387904) (owner: 10Ebernhardson) [14:43:36] FIRING: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:44:16] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127208|Follow-up Ia4b9f65b6: Fix argument order passed to EditCheckFactory#create (T388722)]] (duration: 11m 31s) [14:44:20] T388722: Error: No class registered by that key: null - https://phabricator.wikimedia.org/T388722 [14:44:37] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:42] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127462 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:45:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127505 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [14:45:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127516 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [14:45:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [14:46:25] (03Merged) 10jenkins-bot: Enable SUL3 signup for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127462 (https://phabricator.wikimedia.org/T384218) (owner: 10Gergő Tisza) [14:46:27] (03Merged) 10jenkins-bot: Set $wgSul3RolloutUserPercentage on some testwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127505 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [14:46:29] (03Merged) 10jenkins-bot: Reapply "Make WikibaseQualityConstraints use split-graph query service" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127516 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [14:47:00] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1127462|Enable SUL3 signup for everyone (T384218)]], [[gerrit:1127505|Set $wgSul3RolloutUserPercentage on some testwikis (T384153)]], [[gerrit:1127516|Reapply "Make WikibaseQualityConstraints use split-graph query service" (T374021)]] [14:47:05] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:47:06] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [14:47:06] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [14:48:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [14:49:43] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10633174 (10TheDJ) Where do we find SplitFileHistory.js ? [14:50:00] !log lucaswerkmeister-wmde@deploy2002 tgr, lucaswerkmeister-wmde: Backport for [[gerrit:1127462|Enable SUL3 signup for everyone (T384218)]], [[gerrit:1127505|Set $wgSul3RolloutUserPercentage on some testwikis (T384153)]], [[gerrit:1127516|Reapply "Make WikibaseQualityConstraints use split-graph query service" (T374021)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:08] testing… [14:50:27] (03CR) 10Btullis: [C:03+1] "Apologies for missing this." [puppet] - 10https://gerrit.wikimedia.org/r/1127532 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:50:44] !log restarting slapd on serpens/seaborgium to pick up gnutls updates [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] WBQC change seems to be working \o/ [14:50:56] !log lucaswerkmeister-wmde@deploy2002 tgr, lucaswerkmeister-wmde: Continuing with sync [14:51:27] (FTR, I skipped the kaawiki change based on a quick Telegram chat with Jhs, it’ll be rescheduled later) [14:52:27] (03PS3) 10JMeybohm: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [14:52:27] (03PS1) 10JMeybohm: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127536 (https://phabricator.wikimedia.org/T388390) [14:53:36] FIRING: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:54:25] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [14:54:40] !log restarting FPM on Phabricator to pick up gnutls security updates [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [14:56:26] (03CR) 10Stevemunene: [C:03+2] hdfs: Fix an-worker regex to include more hosts [puppet] - 10https://gerrit.wikimedia.org/r/1127532 (https://phabricator.wikimedia.org/T388512) (owner: 10Stevemunene) [14:56:52] (03CR) 10Elukey: [C:03+1] puppetdb: add inventory endpoint to the proxy [puppet] - 10https://gerrit.wikimedia.org/r/1127534 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [14:57:24] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127462|Enable SUL3 signup for everyone (T384218)]], [[gerrit:1127505|Set $wgSul3RolloutUserPercentage on some testwikis (T384153)]], [[gerrit:1127516|Reapply "Make WikibaseQualityConstraints use split-graph query service" (T374021)]] (duration: 10m 24s) [14:57:29] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [14:57:30] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [14:57:30] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [14:57:48] !log UTC afternoon backport+config window done [14:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] just in time ^^ [14:58:36] (03Merged) 10jenkins-bot: Ensure child interfaces of physicals are removed before physical [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1127475 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [15:00:05] jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1500) [15:01:14] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:01:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:01:38] (03PS1) 10Btullis: Update the php version used for running mediawiki-dumps-legacy to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127539 (https://phabricator.wikimedia.org/T388707) [15:01:59] (03PS3) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) [15:02:12] (03PS2) 10Btullis: Update the php version used for running mediawiki-dumps-legacy to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127539 (https://phabricator.wikimedia.org/T388707) [15:03:29] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:03:36] RESOLVED: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:03:45] !incidents [15:03:45] 5734 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:03:45] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:03:46] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [15:03:46] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [15:03:46] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:03:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:05:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:26] 06SRE, 06Infrastructure-Foundations, 10Nagf, 13Patch-For-Review: LVS: Error with Netbox PuppetDB import script after device moved to Liberica and upgraded - https://phabricator.wikimedia.org/T388770#10633269 (10cmooney) 05Open→03Resolved [15:05:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10633273 (10phaultfinder) [15:06:08] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:06:13] (03PS1) 10Klausman: api-gw: Add inference-staging.svc.codfw.wmnet/10.2.1.58 to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127540 (https://phabricator.wikimedia.org/T388269) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:08] (03PS1) 10Ilias Sarantopoulos: ml-services: increase ref-risk autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127541 (https://phabricator.wikimedia.org/T387019) [15:09:41] (03CR) 10Hnowlan: [C:03+1] api-gw: Add inference-staging.svc.codfw.wmnet/10.2.1.58 to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127540 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [15:09:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS bullseye [15:10:02] (03PS1) 10Majavah: Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 [15:10:25] (03CR) 10Klausman: [C:03+2] api-gw: Add inference-staging.svc.codfw.wmnet/10.2.1.58 to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127540 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [15:11:16] (03CR) 10Volans: [C:03+2] puppetdb: add inventory endpoint to the proxy [puppet] - 10https://gerrit.wikimedia.org/r/1127534 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [15:12:03] (03Merged) 10jenkins-bot: api-gw: Add inference-staging.svc.codfw.wmnet/10.2.1.58 to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127540 (https://phabricator.wikimedia.org/T388269) (owner: 10Klausman) [15:12:04] (03CR) 10Ahmon Dancy: "Thanks everyone!" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:12:44] !incidents [15:12:44] You're not allowed to perform this action. [15:12:45] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:13:04] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:13:15] !incidents [15:13:16] 5734 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:13:16] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:13:16] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [15:13:16] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [15:13:17] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:13:17] nemo-yiannis: ^^ [15:13:24] thanks [15:13:49] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10633321 (10Sreejithk2000) https://commons.wikimedia.org/wiki/User:Sreejithk2000/SplitFileHistory.js [15:15:26] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:15:35] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:15:38] (03CR) 10Bking: [C:03+1] icinga: route relforge icinga alerts to data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1126486 (https://phabricator.wikimedia.org/T388270) (owner: 10Filippo Giunchedi) [15:16:08] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: route relforge icinga alerts to data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1126486 (https://phabricator.wikimedia.org/T388270) (owner: 10Filippo Giunchedi) [15:17:10] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:17:33] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:21:25] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:21:34] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:21:45] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:22:25] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:23:28] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:23:55] (03CR) 10BryanDavis: [C:03+1] tox: simplify tox configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [15:24:04] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10633381 (10phaultfinder) [15:24:51] 15 [15:24:52] uff [15:30:26] (03CR) 10Brouberol: [C:03+1] Update the php version used for running mediawiki-dumps-legacy to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127539 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [15:30:32] (03PS2) 10Hashar: gerrit: ban bad crawler [puppet] - 10https://gerrit.wikimedia.org/r/1127520 [15:30:43] (03CR) 10Btullis: global_config: register the analytics wikireplica in external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:30:47] (03CR) 10Kamila Součková: [C:03+1] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127536 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [15:31:50] (03CR) 10Btullis: [C:03+2] Update the php version used for running mediawiki-dumps-legacy to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127539 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [15:32:13] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10633440 (10TheDJ) So in quick succession generally what happens with the script is: - delete all current and old file versions - restore specific... [15:33:15] (03Merged) 10jenkins-bot: Update the php version used for running mediawiki-dumps-legacy to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127539 (https://phabricator.wikimedia.org/T388707) (owner: 10Btullis) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:52] (03CR) 10Kamila Součková: [C:03+2] admin_ng: use the correct helm version for each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127011 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:40:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:40:43] (03CR) 10Klausman: [C:03+2] ml-services: increase ref-risk autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127541 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:40:51] !incidents [15:40:51] 5735 (UNACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:40:52] 5734 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:40:52] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [15:40:52] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [15:40:52] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [15:40:52] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:40:55] !ack 5735 [15:40:55] 5735 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [15:42:06] (03PS2) 10Ilias Sarantopoulos: admin_ng: increase cpu resource_quota for revision-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127530 (https://phabricator.wikimedia.org/T387019) [15:43:04] (03PS1) 10Volans: puppetdb: fix inventory query with quoted part [software/cumin] - 10https://gerrit.wikimedia.org/r/1127548 [15:43:34] PROBLEM - Hadoop DataNode on an-worker1191 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:43:56] (03PS1) 10Jsn.sherman: Revert "Add MP event stream for MassDelete workflows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127549 [15:43:57] (03Merged) 10jenkins-bot: admin_ng: use the correct helm version for each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127011 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:44:27] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:44:34] RECOVERY - Hadoop DataNode on an-worker1191 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [15:45:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from wikifeeds_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:46:55] (03Merged) 10jenkins-bot: ml-services: increase ref-risk autoscaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127541 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:47:59] (03PS2) 10Jsn.sherman: Revert "Add MP event stream for MassDelete workflows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127549 (https://phabricator.wikimedia.org/T382147) [15:48:10] !log klausman@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:48:21] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:48:26] (03CR) 10Klausman: [C:03+2] admin_ng: increase cpu resource_quota for revision-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127530 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:48:40] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:49:17] 06SRE: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809 (10Reedy) 03NEW [15:51:43] (03PS1) 10Reedy: certificates.yaml: Add pywikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) [15:52:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127549 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [15:52:12] FIRING: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:53:32] (03Merged) 10jenkins-bot: admin_ng: increase cpu resource_quota for revision-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127530 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:53:42] (03PS4) 10JMeybohm: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:53:57] (03PS2) 10Reedy: certificates.yaml: Add pywikipedia.org to non-canonical-redirect [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) [15:54:27] (03CR) 10Kamila Součková: [C:03+2] Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127536 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [15:54:31] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10633569 (10phaultfinder) [15:55:03] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:56:09] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:57:19] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:57:57] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:05] (03CR) 10Reedy: "Or does this want to go into non-canonical-redirect-7? I'm guessing there's some limit, but nothing mentioned in the file (nor does CI co" [puppet] - 10https://gerrit.wikimedia.org/r/1127551 (https://phabricator.wikimedia.org/T388809) (owner: 10Reedy) [16:08:41] (03CR) 10BCornwall: [C:03+1] wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [16:13:06] (03Merged) 10jenkins-bot: Update admin_ng fixtures to reflect puppet changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127536 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [16:14:20] (03CR) 10Ottomata: [C:03+1] Revert "Add MP event stream for MassDelete workflows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127549 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [16:15:40] (03PS13) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) [16:17:21] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127553 [16:18:30] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp[3073,3081].esams.wmnet} and A:cp for 9.2.9-1wm1 [16:18:43] (03CR) 10Ssingh: [C:03+1] "Looks good, nice work, let's merge this today." [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [16:19:09] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127553 (owner: 10PipelineBot) [16:19:24] 06SRE, 06Data-Persistence, 06DBA, 07Sustainability: Setup processlist monitoring for MySQL instances outside of public prometheus - https://phabricator.wikimedia.org/T388813 (10jcrespo) 03NEW [16:19:38] (03PS5) 10JMeybohm: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:20:41] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127553 (owner: 10PipelineBot) [16:21:02] (03CR) 10CI reject: [V:04-1] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [16:21:40] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10633728 (10BCornwall) > In my limited testing, I saw this error mostly in codfw... [16:21:56] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:22:18] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:23:41] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:23:45] 06SRE, 06Data-Persistence, 06DBA, 07Sustainability: Setup processlist monitoring for MySQL instances outside of public prometheus - https://phabricator.wikimedia.org/T388813#10633741 (10Marostegui) [16:24:16] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:24:28] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10633748 (10phaultfinder) [16:25:00] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:27:50] (03CR) 10Jforrester: [C:03+1] "CI usages migrated, good to go from our end. I think we're the only people outside SRE using this component." [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:28:27] (03PS3) 10Herron: add aux-k8s-codfw to environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 (https://phabricator.wikimedia.org/T381417) [16:28:32] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "good catch. This is the traditional surprise with all firewall engines." [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [16:28:52] (03CR) 10Cathal Mooney: [C:03+2] Cloud-in: Add specific term allowing ICMPv6 from cloud-transports [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [16:29:35] (03Merged) 10jenkins-bot: Cloud-in: Add specific term allowing ICMPv6 from cloud-transports [homer/public] - 10https://gerrit.wikimedia.org/r/1127526 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [16:29:50] (03PS15) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) [16:29:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10633773 (10MoritzMuehlenhoff) >>! In T381576#10631074, @wiki_willy wrote: > Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x... [16:30:06] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp[3073,3081].esams.wmnet} and A:cp for 9.2.9-1wm1 [16:30:24] (03CR) 10Brouberol: [V:03+1] global_config: register the analytics wikireplica in external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:30:55] (03CR) 10Muehlenhoff: nftables: add a newline at the end of GERRIT_ABUSERS_ipv4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [16:33:12] (03PS16) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) [16:33:22] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10633796 (10MoritzMuehlenhoff) Can you please try updating the idrac, maybe it helps? If this fails, then maybe we'll simply have to live without idrac. We're migrating away from these servers, but they are really tri... [16:34:25] (03PS17) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) [16:34:41] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10633803 (10BCornwall) [16:35:48] (03CR) 10Herron: [C:03+2] add aux-k8s-codfw to environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:36:04] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5076/co" [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:36:56] !log installing gunicorn security updates [16:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] bd808: dancy: I have mediawiki-config patches that are solely touching tox.ini , I imagine I can get them merged, pull on deployment server and then don't need to deploy? [16:39:10] (03CR) 10Jforrester: "Nice work. >70% improvement in some file sizes!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [16:39:24] hashar: Yep, that should be fine. [16:39:30] I think for patches that only acts on beta scap is smart enough to skip [16:39:37] Correct. [16:39:40] cool I am doing it now :) [16:39:43] (03PS1) 10Scott French: mediawiki: enable udp2log forwarding on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127556 (https://phabricator.wikimedia.org/T388799) [16:39:47] (03CR) 10Hashar: [C:03+2] tox: remove never used "doc" environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127131 (owner: 10Hashar) [16:39:52] (03CR) 10Hashar: [C:03+2] tox: extend flake8 ignore list instead of overriding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127135 (owner: 10Hashar) [16:40:33] (03Merged) 10jenkins-bot: tox: remove never used "doc" environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127131 (owner: 10Hashar) [16:40:42] (03Merged) 10jenkins-bot: tox: extend flake8 ignore list instead of overriding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127135 (owner: 10Hashar) [16:40:56] (03CR) 10Hashar: [C:03+2] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [16:40:59] (03Merged) 10jenkins-bot: add aux-k8s-codfw to environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126568 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:41:04] (03CR) 10CI reject: [V:04-1] tox: simplify tox configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [16:41:11] (03PS2) 10JHathaway: puppetserver: add an option to set a git directory as private [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) [16:41:20] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [16:41:24] (03PS5) 10Hashar: tox: simplify tox configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 [16:41:24] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [16:41:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [16:41:34] !log restart swift-proxy on ms-fe2009 [16:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:59] jouncebot: nowandnext [16:42:00] For the next 0 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1600) [16:42:00] In 0 hour(s) and 18 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700) [16:42:00] In 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700) [16:42:31] (03CR) 10Hashar: [C:03+2] tox: simplify tox configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [16:42:47] !log Upgrading cp3074 to Varnish 7 (T378737) [16:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:51] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [16:43:01] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3074.esams.wmnet [16:43:13] (03Merged) 10jenkins-bot: tox: simplify tox configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127132 (owner: 10Hashar) [16:43:25] (03CR) 10Effie Mouzeli: [C:03+1] mediawiki: enable udp2log forwarding on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127556 (https://phabricator.wikimedia.org/T388799) (owner: 10Scott French) [16:43:59] (03CR) 10Stoyofuku-wmf: "Does this still need to be backported, or is it no longer necessary given the less change was reverted?" [skins/Vector] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126550 (https://phabricator.wikimedia.org/T388475) (owner: 10Jforrester) [16:44:02] !log deployment server: rebased /srv/mediawiki-staging for 3 noop changes (d4e1c561e..a66406939) [16:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] (03PS1) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127557 (https://phabricator.wikimedia.org/T383845) [16:46:46] (03PS1) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127558 (https://phabricator.wikimedia.org/T383845) [16:46:57] (03PS5) 10Hashar: logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) [16:47:12] RESOLVED: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [16:47:44] (03CR) 10CI reject: [V:04-1] logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [16:48:08] FIRING: [6x] KubernetesCalicoDown: aux-k8s-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:48:20] (03PS6) 10Hashar: logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) [16:49:02] (03PS1) 10BCornwall: upgrade cp3074 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127559 (https://phabricator.wikimedia.org/T378737) [16:49:27] (03PS2) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127558 (https://phabricator.wikimedia.org/T383845) [16:49:51] (03CR) 10Ssingh: [C:03+1] upgrade cp3074 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127559 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:49:56] (03PS2) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127557 (https://phabricator.wikimedia.org/T383845) [16:50:05] (03CR) 10Hashar: "When merging this change, one can also Code-Review +2 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1127110 which causes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [16:50:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1127559 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:51:42] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp3074 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127559 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:51:49] (03CR) 10Jforrester: [C:03+1] logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [16:52:41] (03CR) 10Scott French: [C:03+1] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127558 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [16:53:37] (03CR) 10Scott French: [C:03+1] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127557 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [16:54:49] (03CR) 10Hashar: "Scheduled for deployment in the [Thursday, March 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [16:56:42] (03Abandoned) 10Brouberol: global_config: register the analytics wikireplica in external services [puppet] - 10https://gerrit.wikimedia.org/r/1127538 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:57:45] (03CR) 10Pppery: "The reason for that being that the person who added these somehow did it manually without running the optimization tools that the instruct" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [16:57:55] (03PS3) 10Pppery: Rebuild logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) [16:58:36] (03CR) 10Elukey: [C:03+1] puppetdb: fix inventory query with quoted part [software/cumin] - 10https://gerrit.wikimedia.org/r/1127548 (owner: 10Volans) [16:58:45] jouncebot: refresh [16:58:46] I refreshed my knowledge about deployments. [16:59:09] (03CR) 10Jforrester: "Oh indeed, I meant more that it was good to spot that they weren't being run. The CI change to enforce this in future will avoid needing t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [17:00:05] bd808: May I have your attention please! Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700) [17:00:05] effie and swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700). [17:01:10] PROBLEM - Webrequests Varnishkafka log producer on cp3074 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:01:17] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/1127558 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [17:01:39] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127557 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [17:03:25] nothing for my window today [17:03:39] (03PS1) 10Bking: deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) [17:04:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:05:10] RECOVERY - Webrequests Varnishkafka log producer on cp3074 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:05:35] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3074.esams.wmnet [17:05:58] (03PS2) 10Bking: deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) [17:06:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:06:25] FYI, we're paused while investigating an unrelated puppet issue. it is currently *not safe* to deploy. [17:07:06] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127557 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [17:08:56] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:09:33] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Taking scap lock while awaiting coordinated puppet change [17:10:08] ^ took the scap lock to prevent surprises [17:10:15] (03PS3) 10Bking: deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) [17:10:17] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [17:10:21] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [17:10:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:10:37] (03CR) 10CI reject: [V:04-1] deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:11:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [17:11:39] (03PS4) 10Bking: deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) [17:12:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [17:12:42] (03CR) 10Volans: [C:03+2] puppetdb: fix inventory query with quoted part [software/cumin] - 10https://gerrit.wikimedia.org/r/1127548 (owner: 10Volans) [17:15:07] !incidents [17:15:07] 5736 (ACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:15:07] 5735 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [17:15:07] 5734 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:15:08] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:15:08] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [17:15:08] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [17:15:08] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [17:15:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634002 (10phaultfinder) [17:16:32] !incidents [17:16:33] 5736 (ACKED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:16:33] 5735 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [17:16:33] 5734 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:16:33] 5733 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [17:16:34] 5730 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [17:16:34] 5732 (RESOLVED) ProbeDown sre (10.2.1.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 codfw) [17:16:34] 5731 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [17:16:36] (03Abandoned) 10Jsn.sherman: Revert "Add MP event stream for MassDelete workflows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127549 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [17:18:12] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:20:15] (03PS1) 10Hnowlan: api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 [17:21:25] (03PS1) 10Bking: Revert "cloudelastic: migrate cloudelastic1012 to opensearch role" [puppet] - 10https://gerrit.wikimedia.org/r/1127565 [17:21:42] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cloudelastic: migrate cloudelastic1012 to opensearch role" [puppet] - 10https://gerrit.wikimedia.org/r/1127565 (owner: 10Bking) [17:21:52] (03CR) 10CI reject: [V:04-1] api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [17:22:20] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1127565 (owner: 10Bking) [17:24:03] effie: check experimental without hosts specified will run PCC on all hosts ^ [17:24:22] (03PS8) 10Vgutierrez: sre.loadbalancer: upgrade/restart cookbook for liberica [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) [17:24:32] probably needs Hosts: there or a specific host provided to PCC [17:24:32] sukhe: I assumed there was one, my bad [17:25:05] effie: yeah I would say this is a failure on part of the model itself but maybe also intentional (though I see no usecase for running it on all 2100 hosts) [17:26:21] if you use the web UI at https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/ directly, it will be smarter [17:26:35] as in "one host per regex in site.pp" if you supply no list or * [17:26:55] (03PS1) 10Reedy: PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false [extensions/TheWikipediaLibrary] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127566 (https://phabricator.wikimedia.org/T388073) [17:26:56] only advantage of check experimental is not having to manually go to that form [17:27:08] but this one is disadvantage [17:27:21] (03PS2) 10Hnowlan: api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 [17:27:33] because then it often makes the compiler run out of disk and not usable [17:28:33] (03CR) 10CI reject: [V:04-1] api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 (owner: 10Hnowlan) [17:30:13] (03Merged) 10jenkins-bot: puppetdb: fix inventory query with quoted part [software/cumin] - 10https://gerrit.wikimedia.org/r/1127548 (owner: 10Volans) [17:34:55] (03PS1) 10Dzahn: gerrit: drop traffic from abusive scraper IP [puppet] - 10https://gerrit.wikimedia.org/r/1127568 [17:35:49] (03PS2) 10Dzahn: gerrit: drop traffic from abusive scraper IP [puppet] - 10https://gerrit.wikimedia.org/r/1127568 [17:36:15] (03CR) 10Dzahn: [C:03+2] gerrit: drop traffic from abusive scraper IP [puppet] - 10https://gerrit.wikimedia.org/r/1127568 (owner: 10Dzahn) [17:37:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1012* for ban host prior to reimage - bking@cumin2002 - T387904 [17:37:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1012* for ban host prior to reimage - bking@cumin2002 - T387904 [17:37:05] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [17:37:55] (03Abandoned) 10David Caro: wmcs.labstore: add some alerts for labstore [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [17:38:19] (03Abandoned) 10David Caro: Revert "toolforge k8s: add a PodSecurityPolicy to be used by buildpacks" [puppet] - 10https://gerrit.wikimedia.org/r/853539 (owner: 10David Caro) [17:38:54] (03Abandoned) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [17:39:11] (03Abandoned) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (https://phabricator.wikimedia.org/T313444) (owner: 10David Caro) [17:39:20] (03CR) 10Dzahn: "I blocked this IP using this method: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127568" [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [17:39:52] (03Abandoned) 10David Caro: puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [17:39:53] FYI (status update), we're still working on troubleshooting puppet issues on the deployment server, still holding the scap lock [17:40:09] (03CR) 10Dzahn: "nothing wrong about this patch except now it should not do anything extra and we would have an IP in 2 places to maintain" [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [17:40:10] (03Abandoned) 10David Caro: puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 (owner: 10David Caro) [17:40:16] (03Abandoned) 10David Caro: puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 (owner: 10David Caro) [17:40:23] (03Abandoned) 10David Caro: puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro) [17:40:43] (03PS1) 10Reedy: SidebarBeforeOutputHookHandler::getItemId: Bail early if Title is null [extensions/ArticlePlaceholder] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127570 (https://phabricator.wikimedia.org/T388474) [17:40:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634107 (10phaultfinder) [17:42:16] (03CR) 10Dzahn: "fwiw, I just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127568 and nftables refreshed and I saw no issue" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [17:42:33] jouncebot: nowandnext [17:42:33] For the next 0 hour(s) and 17 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700) [17:42:33] For the next 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1700) [17:42:33] In 0 hour(s) and 17 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1800) [17:43:11] (03CR) 10Dzahn: "oh yea, because that is 0.9.8 and not 1.0.6 as on the new host, gotcha!" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [17:43:53] puppet issues on deployment hosts resolved [17:44:00] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Taking scap lock while awaiting coordinated puppet change (duration: 34m 27s) [17:46:50] (03Restored) 10Jforrester: Process strip markers recursively in split [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [17:46:56] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [17:47:03] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [17:47:06] (03PS2) 10Jforrester: Fixes to "Parsoid Fragment Support v2" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [17:47:08] !log jiji@deploy2002 Started scap sync-world: No-sync scap run to switch image flavours to PHP 8.1 - T383845 [17:47:11] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:47:15] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127571 [17:47:25] (03CR) 10Anzx: [C:03+1] "looks good, thanks for working on this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [17:47:28] FIRING: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cloudelastic1012:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:48:23] !log jiji@deploy2002 Stopping before sync operations [17:48:25] FIRING: [5x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:14] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [17:49:53] !log Upgrading cp3066 to Varnish 7 (T378737) [17:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:56] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [17:49:59] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [17:50:10] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [17:50:42] (03PS1) 10Reedy: CharInsert: If $data is null, bail out [extensions/CharInsert] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127572 (https://phabricator.wikimedia.org/T388820) [17:50:48] (03PS1) 10Reedy: Score: Handle parser passing $code of null and bail out [extensions/Score] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127573 (https://phabricator.wikimedia.org/T388821) [17:51:35] (03PS1) 10BCornwall: upgrade cp3066 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127575 (https://phabricator.wikimedia.org/T378737) [17:52:28] RESOLVED: SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cloudelastic1012:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:52:53] (03CR) 10Ssingh: [C:03+1] upgrade cp3066 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127575 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:53:08] RESOLVED: [6x] KubernetesCalicoDown: aux-k8s-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:53:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Relabel Relforge hosts to Elastic hosts - https://phabricator.wikimedia.org/T388133#10634169 (10VRiley-WMF) 05Open→03Resolved This is completed [17:54:03] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp3066 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1127575 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:54:23] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:55:08] (03PS3) 10Hnowlan: api-gateway: p.age on high errors, alert on lower [alerts] - 10https://gerrit.wikimedia.org/r/1127564 [17:55:14] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:56:25] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:56:25] (03PS1) 10Reedy: FilterEvaluator::rmdoubles: Disable PCRE JIT for this call [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127582 (https://phabricator.wikimedia.org/T385452) [17:56:55] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T1800) [18:00:10] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:01:23] jeena: apologies, we're running a bit behind on the infra window due to an issue on the deployment server, could you please wait a couple of minutes before moving the train forward? [18:01:30] (03CR) 10CI reject: [V:04-1] Fixes to "Parsoid Fragment Support v2" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [18:01:42] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [18:01:44] np i am in a meeting rn [18:01:56] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [18:02:06] jeena: great, thank you! we'll keep you posted [18:02:14] thanks! [18:02:51] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:04:14] jeena: When it's all settled down, and the train can move... I wouldn't mind https://gerrit.wikimedia.org/r/q/branch:wmf/1.44.0-wmf.20+status:open+owner:reedy@wikimedia.org deploying to clear up some logspam over the weekend (I'm going out in the next half an hour or so... but can potentially do them myself later tonight too) [18:04:32] Reedy: before or after the train? [18:04:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634236 (10phaultfinder) [18:04:41] (03CR) 10Hnowlan: changeprop: Rollout more wikis for PCS/RESTBase sunset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos) [18:05:17] (03CR) 10Hnowlan: changeprop: Rollout more wikis for PCS/RESTBase sunset (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos) [18:05:37] jeena: shouldn't really make much difference, it's PHP 8.1 logspam more than anything [18:06:00] (03CR) 10Hnowlan: [C:03+1] changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos) [18:06:01] But with the increased rollout, wanting to get ahead of them a little so we can continue to find new ones easily [18:06:22] Certainly the act of moving the train itself in .20 isn't going to make any of those worse [18:06:27] (ie not new code in .20) [18:06:39] okay, since there are a few of them I'll roll the train and then you can do those if you want [18:07:25] I'll be out in ~20 mins and not back till after the window [18:07:33] (03CR) 10JHathaway: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [18:07:34] I can always do them after the backport window etc [18:07:37] that shouldn't be an issue :) [18:08:01] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [18:08:15] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [18:09:15] !log jiji@deploy2002 Started scap sync-world: scap run to deploy switch to PHP 8.1 images - T383845 [18:09:18] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:09:49] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [18:09:52] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [18:10:14] !log root@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [18:10:15] !log root@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [18:11:31] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [18:11:35] reedy: 👍 [18:13:25] FIRING: [5x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:30] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [18:17:38] !log jiji@deploy2002 Finished scap sync-world: scap run to deploy switch to PHP 8.1 images - T383845 (duration: 10m 28s) [18:17:42] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:18:25] FIRING: [5x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:58] (03PS3) 10Effie Mouzeli: Revert "mw-api-int: bump replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127511 [18:24:44] (03PS1) 10Ebernhardson: Drop elasticsearch from external services definition [puppet] - 10https://gerrit.wikimedia.org/r/1127588 [18:24:49] jeena: all clear on our end, and thank you for your patience. as yesterday, if there's time later in your window for me to apply some cleanups, that would be greatly appreciated, but no worries if not (I can find time later today) [18:26:29] thanks swfrench-wmf ! [18:26:42] (03Abandoned) 10Ebernhardson: Drop elasticsearch from external services definition [puppet] - 10https://gerrit.wikimedia.org/r/1127588 (owner: 10Ebernhardson) [18:26:56] (03CR) 10Ebernhardson: [C:03+1] "seems plausible" [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [18:27:43] There will probably be time after I roll train [18:28:20] (03CR) 10Bking: [C:03+2] deployment_server: remove elasticsearch external services config [puppet] - 10https://gerrit.wikimedia.org/r/1127560 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [18:28:30] swfrench-wmf: BTW, if you want to roll mw-wikifunctions whenever, please go ahead; it's a trivial installation in practice, and we're watching it quite closely. [18:29:19] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127589 (https://phabricator.wikimedia.org/T386215) [18:29:21] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127589 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:29:37] James_F: thank you! I have a note to follow up with you on that and wanted to get your take. sounds like a "flag day" kind of switch (as long as we monitor it and coordinate with y'all) should be alright? [18:29:49] Sure, go for it. [18:30:03] awesome, I'll prep some patches and follow up with you [18:30:06] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127589 (https://phabricator.wikimedia.org/T386215) (owner: 10TrainBranchBot) [18:30:09] We'll definitely shout if there are any issues. :-) [18:39:16] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.20 refs T386215 [18:39:20] T386215: 1.44.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T386215 [18:39:27] (03PS1) 10Cathal Mooney: Support setting custom arp-policer on CR interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1127592 (https://phabricator.wikimedia.org/T384774) [18:39:28] (03PS1) 10Reedy: ApiLogin: Don't break BotPasswords if password or user is blank, just error [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127591 (https://phabricator.wikimedia.org/T388255) [18:44:54] 10ops-magru, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10634362 (10cmooney) FWIW the router CPU is still fine with the arp policer set to 2MB size, which is how high I had to go before it stopped i... [18:46:03] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127595 (https://phabricator.wikimedia.org/T385571) [18:48:55] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127595 (https://phabricator.wikimedia.org/T385571) (owner: 10Ebernhardson) [18:49:08] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.029e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [18:50:00] (03CR) 10Subramanya Sastry: [C:04-1] "We aren't cherry-picking this today. The patch has gotten much bigger and having it ride the train and having the opportunity to play with" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [18:50:27] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127595 (https://phabricator.wikimedia.org/T385571) (owner: 10Ebernhardson) [18:50:50] (03CR) 10Subramanya Sastry: [C:04-1] "A very small chance we may do it Monday, but unlikely." [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [18:51:31] swfrench-wmf: I've done the train deploy, all ready for you now [18:51:43] (03CR) 10Jforrester: "Ack." [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [18:53:01] jeena: great, thank you! I'll start preparing my changes [18:56:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdd) failed on ms-be1081 - https://phabricator.wikimedia.org/T388697#10634418 (10VRiley-WMF) created a Service Request Number: 206964425 This hard drive has been ordered. As it turns out, we don't have any spares onsite. The only spares we might hav... [18:58:43] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:58:48] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634433 (10phaultfinder) [19:00:22] (03PS11) 10Scott French: mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:01:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1031:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1031 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:03:48] FIRING: PuppetFailure: Puppet has failed on cloudelastic1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:04:34] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:04:41] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:04:47] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:04:54] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:05:20] (03CR) 10Scott French: "Thanks for preparing this, Effie! Still LGTM, and I've rebased to address merge conflicts and undo the mw-api-int resize (no longer needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:05:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:40] (03CR) 10Scott French: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:06:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1031:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1031 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:07:05] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all pods to -main (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:08:34] starting my cleanups now [19:08:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10634465 (10VRiley-WMF) [19:08:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [19:08:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [19:09:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [19:09:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [19:09:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10634466 (10VRiley-WMF) 05Open→03Resolved [19:10:27] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [19:10:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [19:11:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [19:11:21] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [19:12:40] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [19:12:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [19:13:00] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [19:13:08] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [19:13:25] FIRING: [5x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:15:17] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:15:44] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:15:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:16:22] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:16:40] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:16:45] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:17:17] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:18:25] FIRING: [5x] SystemdUnitFailed: elasticsearch-disable-readahead.service on cloudelastic1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:55] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:19:03] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:21:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [19:24:04] !log mw-(api-int|jobrunner|parsoid): reverted all traffic back to 'main' release - T383845 [19:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:08] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [19:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634493 (10phaultfinder) [19:26:41] (03CR) 10Scott French: "Thanks for the review, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127556 (https://phabricator.wikimedia.org/T388799) (owner: 10Scott French) [19:26:44] (03CR) 10Scott French: [C:03+2] mediawiki: enable udp2log forwarding on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127556 (https://phabricator.wikimedia.org/T388799) (owner: 10Scott French) [19:28:02] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1127607 [19:28:49] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1127607 (owner: 10Bking) [19:29:01] (03Merged) 10jenkins-bot: mediawiki: enable udp2log forwarding on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127556 (https://phabricator.wikimedia.org/T388799) (owner: 10Scott French) [19:29:25] (03CR) 10Bking: [C:03+2] "self-merging , as the previously-reverted patch is known to be safe" [puppet] - 10https://gerrit.wikimedia.org/r/1127607 (owner: 10Bking) [19:30:09] (03CR) 10Scardenasmolinar: [C:03+1] PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false [extensions/TheWikipediaLibrary] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127566 (https://phabricator.wikimedia.org/T388073) (owner: 10Reedy) [19:32:37] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [19:33:55] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:34:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:36:12] PROBLEM - Host db1248 #page is DOWN: PING CRITICAL - Packet loss = 100% [19:36:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [19:37:35] I think it just died [19:38:02] looks like one of two vslow group hosts on s4? [19:40:07] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:40:17] jynus: appropriate to depool db1248? [19:40:21] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1118797544 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:40:36] looks like there's at least one other vslow group host pooled, so I'm thinking yes? [19:40:57] cannot connect from series [19:41:06] but if a DBA has a more informed opinion, then I defer :) [19:41:08] can you depool while I force it power [19:41:16] ack, depooling [19:41:21] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 132144 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:41:36] beat me to it, thanks cwhite ! [19:42:06] !log cwhite@cumin2002 dbctl commit (dc=all): 'depool db1245', diff saved to https://phabricator.wikimedia.org/P74224 and previous config saved to /var/cache/conftool/dbconfig/20250313-194204-cwhite.json [19:42:35] sorry, I had to abort a dog walk, but I'm here [19:42:47] gah, fat-fingered the db number on the commit, it was db1248 [19:42:47] cwhite: wrong host? [19:43:02] repoool and depool the other better np [19:43:16] jynus: nope, was a typo [19:43:24] db1248 is depooled [19:43:35] ah, I see [19:43:41] only the comment was typoed [19:43:48] I missunderstood [19:44:10] then just send an extra log in IRC just so manuel doesn't get worried tomorrow [19:44:19] cwhite: I'm about to apply some helmfile changes to mediawiki's rsyslog, but paused when I saw the page. any objections if I go ahead with that now that the host is depooled? [19:44:40] I am trying to get it up, but it doesn't repond on console [19:44:42] !log depooled db1248, unchanged db1245 [19:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:05] swfrench-wmf: I see no issue with that [19:45:19] cwhite: awesome, thanks [19:46:10] created https://phabricator.wikimedia.org/T388837 [19:47:45] !log swfrench@deploy2002 Started scap sync-world: apply rsyslog config changes - T388799 [19:47:47] !log forcing a reboot of db1248 from console T388837 [19:47:49] T388799: php-wmerrors rsyslog rule selects on php7 only - https://phabricator.wikimedia.org/T388799 [19:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:53] T388837: db1248 crash (?) - https://phabricator.wikimedia.org/T388837 [19:48:39] typically it is because some memory corruption that it crashes [19:51:55] hard reset did it! yay [19:52:32] RECOVERY - Host db1248 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:52:38] \o/ [19:52:53] do not pool it, dbas will handle it tomorrow [19:53:04] ack, thanks jynus! [19:53:10] normally it has to be recloned, or it may have hw issues ongoing [19:54:36] !log swfrench@deploy2002 Finished scap sync-world: apply rsyslog config changes - T388799 (duration: 08m 09s) [19:54:40] T388799: php-wmerrors rsyslog rule selects on php7 only - https://phabricator.wikimedia.org/T388799 [19:56:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS bullseye [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T2000). [20:00:05] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] here [20:01:09] swfrench-wmf: do we need to pause the backport window? [20:01:34] jeena: apologies, forgot to say explicitly - all done on my end, and thanks again [20:02:17] oh okay. It seemed like there was an incident above so I wasn't sure if it was okay to continue? [20:03:02] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [20:03:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [20:03:08] that latest incident is resolved jeena, feel free to continue :) [20:03:17] thanks cwhite ! [20:03:39] Pppery: I'll start your config changes now [20:03:44] OK [20:03:53] the first change is a no-op. Only the second one does anything [20:03:58] And make sure to purge all of the logo URLs [20:04:05] And then the third one is CI-only [20:04:14] I am supposed to purge logo URLS? [20:04:38] (03PS4) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [20:04:51] I'm not sure how to do that [20:04:57] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [20:05:01] oh thanks [20:09:50] Pppery: just to confirm, I should do the sync (skip check on mwdebug?), then run the purge script (any url is fine, such as https://en.wikipedia.org/static/images/project-logos/newikibooks.png?), and then you will confirm it's working? [20:10:32] (Sorry, I somehow got disconnected) [20:10:45] jeena: hi, please let me know once you're done [20:11:00] You have to pass the specific URL to purgeList (for example https://en.wikipedia.org/static/images/mobile/copyright/wikimedia-wordmark-co.svg) [20:11:39] And I don't know when the right step for this is [20:11:52] oh, so a url for every file in that change? [20:12:02] Yeah [20:12:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:26] okay, i'll double check, give me a minute or two [20:12:27] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:45] Amir1: will do. I was also going to backport some changes for reedy [20:13:12] I would be fine with postponing this to next week if you don't want to have to deal with the complicated deployment process [20:13:19] no worries. Thanks for doing the deploys [20:14:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:19] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:47] pppery: I think we can do it now [20:15:59] I'm just gonna prepare a list of the files to purge [20:16:32] (03Abandoned) 10Bking: cirrus: Ensure opensearch rundirs are created [puppet] - 10https://gerrit.wikimedia.org/r/1126643 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:20:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634625 (10phaultfinder) [20:21:36] Also please don't skip mwdebug for my changes - I think there's a possibility that Varnish skips caching if mwdebug is on so I will be able to test it [20:21:37] (03CR) 10Hashar: "I am still not sure whether gerrit_abusers ban/block or whether it merely throttles and then block for X minutes." [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [20:21:46] (even though you will need to do the purge after full deploy) [20:23:29] yeah I won't skip it [20:23:38] I think you can test it on mwdebug [20:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [20:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127156 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [20:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [20:24:42] oh already! :) [20:25:05] (03Merged) 10jenkins-bot: Logos: Fix order of guwwikinews in yaml file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127156 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [20:25:06] (03Merged) 10jenkins-bot: Rebuild logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127160 (https://phabricator.wikimedia.org/T387448) (owner: 10Pppery) [20:25:15] (03Merged) 10jenkins-bot: logos: have CI fail on uncommited logos.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127110 (https://phabricator.wikimedia.org/T341412) (owner: 10Hashar) [20:25:34] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1127160|Rebuild logo files (T387448)]], [[gerrit:1127156|Logos: Fix order of guwwikinews in yaml file (T387448)]], [[gerrit:1127110|logos: have CI fail on uncommited logos.php changes (T341412)]] [20:25:39] T387448: Logo update script tries to make some logos huge, puts things in non-alphabetic order - https://phabricator.wikimedia.org/T387448 [20:25:39] T341412: CI on mediawiki-config should assert that the logos.php is generated by logos/manage.py - https://phabricator.wikimedia.org/T341412 [20:28:28] !log jhuneidi@deploy2002 hashar, pppery, jhuneidi: Backport for [[gerrit:1127160|Rebuild logo files (T387448)]], [[gerrit:1127156|Logos: Fix order of guwwikinews in yaml file (T387448)]], [[gerrit:1127110|logos: have CI fail on uncommited logos.php changes (T341412)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:36] looking [20:30:02] This is in fact testable (I confirmed I can see the new logo files with k8s-mwdebug). I still have more testing to do [20:30:28] i'll wait for your signal to continue sync [20:30:47] I'm loading every affected wiki and confirming that the logo still looks reasonable [20:33:31] Amir1: were you going to do a backport? [20:33:42] yup [20:33:42] Still looking, about half-way done [20:34:10] Amir1: is it a config one or another repo? [20:34:41] I have a couple actually but I think all are mw config (one of them is somewhat important the rest can wait) [20:35:10] your should do yours after this then since the config ones should be quick [20:35:28] then I'll continue from there [20:35:35] sure, thanks! [20:35:40] ping me once I can move forward [20:35:46] yup [20:37:04] OK, tested, and ready to proceed [20:37:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10634653 (10Jclark-ctr) @elukey thanks looks to be right to me! I do have another ticket with same issue T384979 would you be able to assist? when will t... [20:37:32] The changes to new Vector logos are noticable but minor enough that they're fine. The changes to legacy Vector logos are completely imperceptible [20:37:46] (but I did download the new version and compare the file size to make sure that it really was the new logo) [20:37:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10634655 (10Jclark-ctr) a:05Papaul→03Jclark-ctr [20:37:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10634657 (10Jclark-ctr) 05Open→03Resolved [20:38:26] continuing with sync [20:38:32] !log jhuneidi@deploy2002 hashar, pppery, jhuneidi: Continuing with sync [20:40:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634666 (10phaultfinder) [20:40:48] pppery: thank you so much for your assistance! :) [20:41:06] pppery: cause I don't know much about how logos are managed nowadays! [20:41:28] I didn't want to know, but I found myself involved in post-creation for new wikis which often involves a logo change [20:41:32] and so I had to learn [20:44:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634668 (10Jclark-ctr) [20:44:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634669 (10Jclark-ctr) [20:44:53] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127160|Rebuild logo files (T387448)]], [[gerrit:1127156|Logos: Fix order of guwwikinews in yaml file (T387448)]], [[gerrit:1127110|logos: have CI fail on uncommited logos.php changes (T341412)]] (duration: 19m 18s) [20:44:58] T387448: Logo update script tries to make some logos huge, puts things in non-alphabetic order - https://phabricator.wikimedia.org/T387448 [20:44:58] T341412: CI on mediawiki-config should assert that the logos.php is generated by logos/manage.py - https://phabricator.wikimedia.org/T341412 [20:45:04] about to run the purge script [20:45:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634672 (10Jclark-ctr) [20:45:18] pppery: done [20:45:39] Amir1: ready for you [20:45:45] awesome [20:45:56] (03CR) 10Ladsgroup: [C:03+2] Bump the thumbnail steps ratio to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127490 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [20:46:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127490 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [20:46:44] (03Merged) 10jenkins-bot: Bump the thumbnail steps ratio to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127490 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [20:46:58] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1127490|Bump the thumbnail steps ratio to 10% (T360589)]] [20:47:02] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [20:48:35] (03CR) 10Ladsgroup: [C:03+2] Revert "changeprop-jobqueue: Fully disable categorymembership job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127528 (owner: 10Ladsgroup) [20:48:45] Thanks Jeena, and you're welcome hashar [20:49:17] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [20:49:23] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [20:49:56] (03Merged) 10jenkins-bot: Revert "changeprop-jobqueue: Fully disable categorymembership job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127528 (owner: 10Ladsgroup) [20:50:00] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1127490|Bump the thumbnail steps ratio to 10% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:52:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634686 (10Jclark-ctr) @RKemper Hey node /^elastic11(0[8-9]|1[0-9]|2[0-2])\.eqiad\./ { is incorrect for servers i a... [20:52:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634687 (10Jclark-ctr) a:03RKemper [20:53:11] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [20:54:11] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [20:54:26] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [20:54:36] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [20:56:11] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [20:56:22] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [20:57:21] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [20:59:55] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127490|Bump the thumbnail steps ratio to 10% (T360589)]] (duration: 12m 56s) [20:59:58] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250313T2100) [21:00:06] jeena: I'm done for now, back to you [21:00:11] Thanks Amir1 [21:00:29] I'll wait 5 minutes in case there are any web team deploys [21:00:37] We're gonna be using it today! [21:00:41] sorry [21:00:54] Let me see what we're deploying, but I'll try to make it as quick as possible [21:03:21] okay, thanks [21:07:57] jeena: we're gonna hold off on our deploy so [21:07:59] go ahead! [21:08:12] alright, thanks toyofuku [21:08:21] np!! [21:08:26] :) [21:10:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/TheWikipediaLibrary] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127566 (https://phabricator.wikimedia.org/T388073) (owner: 10Reedy) [21:10:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127582 (https://phabricator.wikimedia.org/T385452) (owner: 10Reedy) [21:10:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/Score] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127573 (https://phabricator.wikimedia.org/T388821) (owner: 10Reedy) [21:10:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/ArticlePlaceholder] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127570 (https://phabricator.wikimedia.org/T388474) (owner: 10Reedy) [21:10:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/CharInsert] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127572 (https://phabricator.wikimedia.org/T388820) (owner: 10Reedy) [21:10:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127591 (https://phabricator.wikimedia.org/T388255) (owner: 10Reedy) [21:11:18] (03PS1) 10Eevans: data-gateway: update image to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127640 (https://phabricator.wikimedia.org/T370470) [21:11:43] (03Merged) 10jenkins-bot: PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false [extensions/TheWikipediaLibrary] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127566 (https://phabricator.wikimedia.org/T388073) (owner: 10Reedy) [21:12:00] (03Merged) 10jenkins-bot: FilterEvaluator::rmdoubles: Disable PCRE JIT for this call [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127582 (https://phabricator.wikimedia.org/T385452) (owner: 10Reedy) [21:12:58] (03PS2) 10Eevans: data-gateway: update image to v1.0.12 (staging-only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127640 (https://phabricator.wikimedia.org/T370470) [21:13:39] (03CR) 10Eevans: [C:03+2] data-gateway: update image to v1.0.12 (staging-only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127640 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans) [21:13:44] (03CR) 10Eevans: [V:03+2 C:03+2] data-gateway: update image to v1.0.12 (staging-only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127640 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans) [21:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634740 (10phaultfinder) [21:14:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:16:21] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [21:16:41] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [21:23:53] (03PS1) 10Reedy: FilterEvaluator::rmspecials: Disable PCRE JIT for this call too [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127646 (https://phabricator.wikimedia.org/T385452) [21:24:45] (03Merged) 10jenkins-bot: Score: Handle parser passing $code of null and bail out [extensions/Score] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127573 (https://phabricator.wikimedia.org/T388821) (owner: 10Reedy) [21:24:47] (03Merged) 10jenkins-bot: SidebarBeforeOutputHookHandler::getItemId: Bail early if Title is null [extensions/ArticlePlaceholder] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127570 (https://phabricator.wikimedia.org/T388474) (owner: 10Reedy) [21:24:48] (03Merged) 10jenkins-bot: CharInsert: If $data is null, bail out [extensions/CharInsert] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127572 (https://phabricator.wikimedia.org/T388820) (owner: 10Reedy) [21:24:52] (03Merged) 10jenkins-bot: ApiLogin: Don't break BotPasswords if password or user is blank, just error [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127591 (https://phabricator.wikimedia.org/T388255) (owner: 10Reedy) [21:25:18] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1127566|PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false (T388073)]], [[gerrit:1127582|FilterEvaluator::rmdoubles: Disable PCRE JIT for this call (T385452)]], [[gerrit:1127573|Score: Handle parser passing $code of null and bail out (T388821)]], [[gerrit:1127570|SidebarBeforeOutputHookHandler::getItemId: Bail early if [21:25:18] Title is null (T388474)]], [[gerrit:1127572|CharInsert: If $data is null, bail out (T388820)]], [[gerrit:1127591|ApiLogin: Don't break BotPasswords if password or user is blank, just error (T388255)]] [21:25:25] T388073: PHP Deprecated: Automatic conversion of false to array is deprecated - https://phabricator.wikimedia.org/T388073 [21:25:25] T385452: PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385452 [21:25:25] T388821: PHP Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T388821 [21:25:26] T388474: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T388474 [21:25:26] T388820: PHP Deprecated: preg_replace_callback(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T388820 [21:25:27] T388255: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T388255 [21:28:03] !log jhuneidi@deploy2002 reedy, jhuneidi: Backport for [[gerrit:1127566|PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false (T388073)]], [[gerrit:1127582|FilterEvaluator::rmdoubles: Disable PCRE JIT for this call (T385452)]], [[gerrit:1127573|Score: Handle parser passing $code of null and bail out (T388821)]], [[gerrit:1127570|SidebarBeforeOutputHookHandler::getItemId: Bail early if Title i [21:28:03] s null (T388474)]], [[gerrit:1127572|CharInsert: If $data is null, bail out (T388820)]], [[gerrit:1127591|ApiLogin: Don't break BotPasswords if password or user is blank, just error (T388255)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:10] !log jhuneidi@deploy2002 reedy, jhuneidi: Continuing with sync [21:29:56] (03PS8) 10Andrea Denisse: grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) [21:30:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:25] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127566|PreferenceHelper: Handle another case of getGlobalPreferencesValues returning false (T388073)]], [[gerrit:1127582|FilterEvaluator::rmdoubles: Disable PCRE JIT for this call (T385452)]], [[gerrit:1127573|Score: Handle parser passing $code of null and bail out (T388821)]], [[gerrit:1127570|SidebarBeforeOutputHookHandler::getItemId: Bail early i [21:34:25] f Title is null (T388474)]], [[gerrit:1127572|CharInsert: If $data is null, bail out (T388820)]], [[gerrit:1127591|ApiLogin: Don't break BotPasswords if password or user is blank, just error (T388255)]] (duration: 09m 06s) [21:34:31] T388073: PHP Deprecated: Automatic conversion of false to array is deprecated - https://phabricator.wikimedia.org/T388073 [21:34:31] T385452: PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385452 [21:34:31] T388821: PHP Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T388821 [21:34:32] T388474: PHP Deprecated: str_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T388474 [21:34:32] T388820: PHP Deprecated: preg_replace_callback(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T388820 [21:34:33] T388255: PHP Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T388255 [21:35:15] (03PS1) 10Gergő Tisza: Fix some SUL3 shared domain settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T375796) [21:35:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:31] !log lists1004 - systemctl start wmf_auto_restart_exim4 which was failed for some reason [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:58] backports finished [21:37:32] thanks jeena [21:37:37] gonna do another one now :D [21:37:43] (03CR) 10Reedy: [C:03+2] FilterEvaluator::rmspecials: Disable PCRE JIT for this call too [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127646 (https://phabricator.wikimedia.org/T385452) (owner: 10Reedy) [21:37:45] you're welcome! 😆 [21:40:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [21:45:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [21:46:37] (03PS4) 10Ryan Kemper: elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) [21:47:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634865 (10RKemper) Pushed out https://gerrit.wikimedia.org/r/c/operations/puppet/+/1115122 which should fix the afor... [21:48:36] (03Merged) 10jenkins-bot: FilterEvaluator::rmspecials: Disable PCRE JIT for this call too [extensions/AbuseFilter] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127646 (https://phabricator.wikimedia.org/T385452) (owner: 10Reedy) [21:49:25] (03CR) 10Bking: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [21:49:36] (03CR) 10Ryan Kemper: [C:03+2] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [21:50:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634871 (10phaultfinder) [21:50:45] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1127646|FilterEvaluator::rmspecials: Disable PCRE JIT for this call too (T385452)]] [21:50:49] T385452: PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385452 [21:51:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10634875 (10RKemper) a:05RKemper→03Jclark-ctr Okay, we should be good on our end [21:52:59] (03Abandoned) 10Subramanya Sastry: Fixes to "Parsoid Fragment Support v2" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127076 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [21:53:33] !log reedy@deploy2002 reedy: Backport for [[gerrit:1127646|FilterEvaluator::rmspecials: Disable PCRE JIT for this call too (T385452)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:00:11] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 1040 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [22:00:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:25] FIRING: [10x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:29] (03CR) 10Dzahn: "if profile::gerrit:gerrit_abusers is used then it's a permanent drop that does not expire." [puppet] - 10https://gerrit.wikimedia.org/r/1127520 (owner: 10Hashar) [22:10:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10634948 (10phaultfinder) [22:12:06] (03PS1) 10Ladsgroup: Temporarily enable mobile sitenotice for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127654 [22:12:20] (03CR) 10Dzahn: "if the comment of Moritz is addressed then this is a +1. can be tested on gerrit2003 without affecting servers in production" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [22:12:49] (03CR) 10Cwhite: grafana: Normalize user fields and validate input in LDAP sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [22:13:10] (03CR) 10Dzahn: "I'd disable puppet on gerrit1004 and test it on gerrit2003 and gerrit2002. then enabled puppet on gerrit1004." [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [22:52:15] Reedy: let me know when you're done [22:52:34] !log reedy@deploy2002 reedy: Continuing with sync [22:52:37] Amir1: bahh [22:52:39] forgot about that [22:52:46] been sat for an hour [22:53:05] I was like "the test for this seems to be quite complex" [22:53:20] this is why I usually use sync-dir :P [22:58:51] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127646|FilterEvaluator::rmspecials: Disable PCRE JIT for this call too (T385452)]] (duration: 68m 05s) [22:58:56] T385452: PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated - https://phabricator.wikimedia.org/T385452 [22:58:58] Amir1: there you go, sorry [22:59:43] thanks! no worries [23:00:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127654 (owner: 10Ladsgroup) [23:01:14] (03Merged) 10jenkins-bot: Temporarily enable mobile sitenotice for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127654 (owner: 10Ladsgroup) [23:01:34] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1127654|Temporarily enable mobile sitenotice for fawiki]] [23:04:16] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1127654|Temporarily enable mobile sitenotice for fawiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:05:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [23:11:42] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127654|Temporarily enable mobile sitenotice for fawiki]] (duration: 10m 07s) [23:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10635100 (10phaultfinder) [23:29:34] Reedy: I think you can skip the prompt if you add '--yes' to your scap backport command [23:36:37] (03PS1) 10Jdlrobson: Enable Vector 2022 on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) [23:38:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson)