[00:44:27] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:31] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.07% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:59:17] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:41] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:47] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:37] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:35] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:51] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:51] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:43] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:49] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:11] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:19] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [04:03:59] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:59] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:11] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:17] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:11] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:05] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) [05:29:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @Joe did so, thanks. [05:31:15] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:57] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) I run an initial test running some 1000s of production URLs. It appears that we are about to hit max_accelerated_files (curren... [05:34:35] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:19] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 13.73 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:37:31] (03PS1) 10Effie Mouzeli: mwdebug: bump max_accelerated_files [deployment-charts] - 10https://gerrit.wikimedia.org/r/725500 (https://phabricator.wikimedia.org/T280497) [05:40:33] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:49] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [05:44:54] (03Merged) 10jenkins-bot: changeprop-jobqueue: Make new jobs of Wikidata dispatcher high priority [deployment-charts] - 10https://gerrit.wikimedia.org/r/725287 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [05:45:47] (03PS1) 10Ladsgroup: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) [05:46:44] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for TTaylor - https://phabricator.wikimedia.org/T292299 (10Joe) Hi @ttaylor, I guess in your case we don't need signoff from the analytics team :) I don't know how we should proceed re: manager approval for access though. [05:47:28] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) [05:47:39] !log ladsgroup@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [05:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:06] !log ladsgroup@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [05:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:10] _joe_: deploying btw [05:50:45] <_joe_> Amir1: ack, I'm around in case of need [05:50:46] !log ladsgroup@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [05:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:52] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Joe) [05:51:11] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Joe) p:05Triage→03Medium [05:51:54] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Joe) p:05Triage→03Medium [05:52:37] 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Joe) [05:53:58] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) [05:54:19] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) Hi @DannyH can you please approve this access request? [05:55:11] 10SRE, 10ops-eqiad: Degraded RAID on db1126 - https://phabricator.wikimedia.org/T292325 (10Joe) p:05Triage→03High [05:56:14] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Joe) p:05Triage→03Medium [05:56:51] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Joe) p:05Triage→03High [06:02:07] (03PS2) 10Ladsgroup: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) [06:02:14] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) Hi @SWakiyama once I have approval from your manager, I'll enable your account. Thanks for your patience. [06:03:31] <_joe_> <3 [06:04:51] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:39] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 12.35 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:10:57] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:43] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 2.268 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:22:39] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:37] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:39] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:59] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:45] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 11.29 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:38:58] (03Abandoned) 10Elukey: helmfile: add secrets for the admin_ng configs [puppet] - 10https://gerrit.wikimedia.org/r/722877 (owner: 10Elukey) [06:40:59] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:20] !log depool + restart blazegraph + restart updater on wdqs1004 [06:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:55] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:57] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 9.409e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:45:02] <_joe_> elukey: all of the nodes seem in a crisis [06:45:33] _joe_ I saw only two of them, 1004 and 1006, with the updater stopped [06:45:35] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:46:12] 1004 looks ok atm, but it needs to catch up IIRC [06:46:56] blazegraph was not feeling well [06:48:34] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti216 also for ganeti2025 [puppet] - 10https://gerrit.wikimedia.org/r/725327 (owner: 10Muehlenhoff) [06:48:57] yep https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8&refresh=1m&from=now-3h&to=now [06:49:33] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 1.722 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:50:31] 1006 stopped sending metrics a day ago [06:55:31] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:43] mmmmm [06:57:51] I haven't touched 1006 [06:58:15] yeah ok the updater is still down [06:59:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31469/console" [puppet] - 10https://gerrit.wikimedia.org/r/725326 (owner: 10Elukey) [07:01:22] (03CR) 10Elukey: Create new deploy group for k8s ML services [puppet] - 10https://gerrit.wikimedia.org/r/725326 (owner: 10Elukey) [07:01:31] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:57] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 8 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) We have gone from ~200 to... [07:02:06] !log swift eqiad-prod: add weight to ms-be10[64-67] - T290546 [07:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:13] T290546: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 [07:06:20] (03PS3) 10Ladsgroup: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) [07:08:30] (03PS1) 10Muehlenhoff: Remove Parsoid jessie debs [puppet] - 10https://gerrit.wikimedia.org/r/725670 [07:10:31] (03PS1) 10Muehlenhoff: Remove stray comment [puppet] - 10https://gerrit.wikimedia.org/r/725671 [07:10:38] !log joal@deploy1002 Started deploy [analytics/refinery@38f3adc]: Hotfix analytics deploy [analytics/refinery@38f3adc] [07:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Drop absented systemd timers of test wikidata change dispatching [puppet] - 10https://gerrit.wikimedia.org/r/725261 (https://phabricator.wikimedia.org/T291610) (owner: 10Ladsgroup) [07:13:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:16:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [07:18:02] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1004.wmnet [07:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:11] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1006.wmnet [07:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:27] !log depool + restart blazegraph + restart updater for wdqs1006 [07:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:39] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:19:49] !log restarting blazegraph on wdqs2001 & wdqs2004 (allocators burning too quickly) [07:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:31] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 1.678e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:21:04] (03CR) 10Elukey: [C: 03+2] Remove stray comment [puppet] - 10https://gerrit.wikimedia.org/r/725671 (owner: 10Muehlenhoff) [07:21:33] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:02] (03PS3) 10Elukey: Create new deploy group for k8s ML services [puppet] - 10https://gerrit.wikimedia.org/r/725326 [07:26:59] 10SRE, 10ops-eqiad, 10Analytics-Clusters: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10elukey) @BTullis @razzi can you sync with Chris to perform this maintenance during the next days? [07:27:35] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:27:37] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:28:35] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:29:56] !log joal@deploy1002 Finished deploy [analytics/refinery@38f3adc]: Hotfix analytics deploy [analytics/refinery@38f3adc] (duration: 19m 18s) [07:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:14] !log joal@deploy1002 Started deploy [analytics/refinery@38f3adc] (thin): Hotfix analytics deploy THIN [analytics/refinery@38f3adc] [07:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:21] !log joal@deploy1002 Finished deploy [analytics/refinery@38f3adc] (thin): Hotfix analytics deploy THIN [analytics/refinery@38f3adc] (duration: 00m 06s) [07:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:03] !log joal@deploy1002 Started deploy [analytics/refinery@38f3adc] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@38f3adc] [07:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:35] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:31:35] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:32:35] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:32:46] (03PS1) 10Ladsgroup: mediawiki: Stop wikidata dispatching via systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/725673 (https://phabricator.wikimedia.org/T48643) [07:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2001:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:34:30] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [07:34:48] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724838 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [07:37:17] !log joal@deploy1002 Finished deploy [analytics/refinery@38f3adc] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@38f3adc] (duration: 06m 14s) [07:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:05] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Joe) p:05Triage→03Medium [07:38:54] kormat: I'm going to resolve the db lag incident in VO, FYI [07:39:10] it never recovered of course [07:40:39] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:41:59] godog: i'll admit i've never understood the VO model [07:44:27] kormat: heheh I think in this case icinga never sent a recovery to VO [07:44:45] I should have resolved the incident yesterday heh [07:57:30] (03PS1) 10Elukey: Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/725687 (https://phabricator.wikimedia.org/T288625) [08:03:56] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10JMeybohm) You think we can piggyback the necessary helmfile.yaml changes for the helm3 migration (T251305) with this @Jelto ? [08:06:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:06:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:06:53] looking [08:06:59] * volans here if needed [08:07:58] looks like an analytics job [08:08:02] is saturating row A [08:09:31] lovely [08:09:41] lemme check [08:10:13] not 100% sure about analytics yet [08:10:16] still digging [08:10:35] if you can give me an host etc.. I can quickly check [08:10:38] and kill the job (in case) [08:10:48] yeah, all the an-workers [08:11:00] an-worker1103, an-worker1139 [08:11:08] an-worker1118 [08:11:11] gmodena: around? [08:11:12] and many others [08:11:23] step 1: blame elukey. step 2: investigate [08:11:29] elukey yup [08:11:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:11:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:12:09] gmodena: do you mind if I kill your spark job? [08:12:39] elukey killed [08:12:56] and looking into it [08:13:02] gmodena: <3 not sure if it is yours, let's see [08:13:05] it was the biggest [08:13:06] it should have been running with conservative settings [08:13:23] thanks [08:14:25] gmodena: yes probably it is something else, I see other two big jobs [08:14:28] in case I'll try to kill them [08:15:35] elukey ack. I'll hold off from submitting new jobs till you give a green light [08:15:36] XioNoX: still saturated? [08:15:39] gmodena: <3 [08:15:50] let me know if there's anything I can do to help investigate [08:16:09] checking [08:16:59] ah I didn't se the workers above, checking as well [08:17:00] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10jcrespo) @cmjohnson or @Jclark-ctr can we get a request for a disk replacement sent to Dell? This host was bought last year. [08:17:11] elukey: recovering [08:17:41] see https://librenms.wikimedia.org/graphs/to=1633335300/id=14308/type=port_bits/from=1633313700/ for example [08:18:23] ok let's wait a bit more and see [08:19:30] gmodena: to fill you in - we have 10g NICs for most of the hadoop worker nodes, and they share switches / routers infrastructure with the rest of the nodes in production. When two or more an-workers start to push a lot of data over the network they can saturate (partially) shared links [08:20:16] so even if the spark job is not consuming a ton of hadoop resources, if it causes a lot of data shuffling around the network it may impact other services [08:20:40] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10fgiunchedi) @papaul is there anything needed from us at this time? thank you! [08:20:47] elukey ah! This is really good to know. [08:20:57] (03PS1) 10Ladsgroup: Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) [08:21:11] elukey FWIW the job I just killed was submitted from stat1005 [08:22:50] ^ clarakosi: [08:31:42] cc topranks for visibility too ^ [08:32:23] as follow up we can discuss with Joseph about Spark data shuffling etc.., there may be some good best practices to use to avoid some use cases [08:32:28] but it is difficult [08:33:06] long term we may want to have a separate set of analytics racks/switches/etc.. but not sure how horrible this could be in term of engineering hours :D [08:33:40] Thanks XioNox, sry had missed busy fighting gmail filters.... [08:33:52] elukey: I think if we take that path we end up managing 100 different networks. [08:34:27] topranks: yeah I can imagine, but the amount of data shuffled in hadoop is going to get worse over time/years [08:34:31] Probably some kind of traffic classification / QoS priority rules on the switches is the answer, so that jobs like this can be marked non-essential, and more important traffic is sent first if something has to be dropped. [08:34:32] this is why I was saying that [08:34:47] ah nice! Yes even something like that would be great [08:35:08] qos is famously easy to configure and use [08:35:09] I'd be interested to know how to do it.. is it ok to open a task about it? Then we may add some thoughts in there [08:35:24] kormat: ah I missed your hilarious jokes ;) [08:35:28] :D [08:35:41] <_joe_> topranks: I doubt that to be true (we end up managing 100 different networks) [08:36:07] elukey: I'll be adding something to our design doc for eqiad in coming weeks about it, happy to discuss then. [08:36:22] topranks: sure ping me anytime [08:36:34] <_joe_> analytics has specific needs and yes, the alternative is to make it logically segregated (including bandwidth limits/QoS) or physically segregated, but the ship for the latter has sailed [08:36:43] _joe_: clearly I was exagerating, but I'll stick to my guns and say "separate network" is not the way to deal with this kind of issue every time it comes up. [08:36:56] yup, QoS or dedicated switches are the 2 main options [08:37:45] analytics has the advantage of being already quite separated, as they use different vlans [08:38:03] "ship has already sailed" I've heard a few times since I started, but others have said keep an open mind ¯\_(ツ)_/¯ [08:38:36] other factors like how redundant analytics needs to be are at play here [08:38:42] topranks: the ship sailed, but sank before it left the harbour [08:38:56] <_joe_> topranks: oh I agree in general :) [08:39:27] <_joe_> topranks: well I don't think moving 100s of servers across racks/rows would be feasible. I mean, it is, but the cost/benefit seems thin [08:40:00] I'm not sure anyone is suggesting that, but maybe there are things I don't know going on. [08:40:08] if capacity wise they only need 80 ports (2 switches), but can't risk loosing half their servers with a switch failure, it's better to mutualise them [08:40:42] On paper it's not _that_ tricky to classify certain kinds of traffic. That doesn't mean if you have known high bandwdith generators of traffic you wouldn't strategically place them. [08:40:48] the more we are spread across rows/raws is generally better for analytics [08:40:54] it also depends on how well the analytics software can clasify their traffic [08:42:43] (03PS4) 10Muehlenhoff: ssh: Puppetize GatewayPorts config option for sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [08:43:12] prerequisite to QoS is also deploying internal netflow [08:45:44] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:48:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10jcrespo) [08:49:27] (03CR) 10Muehlenhoff: [C: 03+2] ssh: Puppetize GatewayPorts config option for sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [08:49:34] (03CR) 10Elukey: [C: 03+2] "pcc looks good, I think it is worth testing more. Proceeding, please ping me if there is something that should be followed up!" [puppet] - 10https://gerrit.wikimedia.org/r/725326 (owner: 10Elukey) [08:50:14] moritzm: puppet-merge when you prefer [08:50:21] elukey: ack, doing that now [08:50:35] done [08:51:26] thanks! [08:56:52] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Kormat) [08:58:05] (03CR) 10Muehlenhoff: "Jack/Elliott: This is now merged and can be used in your project (along with a minor change to avoid a new newline in the case no gateway " [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [08:58:08] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@071f7c3] (eqiad): Increase mirrored traffic to 100% for eqiad [08:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:02] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@071f7c3] (eqiad): Increase mirrored traffic to 100% for eqiad (duration: 00m 54s) [08:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:06] FYI mbsantos ^ [09:04:00] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Kormat) Updated description with idrac output. [09:13:07] !log hbal -L -G row_C -X on ganeti01.svc.eqiad.wmnet [09:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:03] (03CR) 10Volans: "LGTM, one inaccuracy in a docstring and a nit in the tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [09:17:09] (03CR) 10Jelto: [C: 03+2] aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.3 [puppet] - 10https://gerrit.wikimedia.org/r/725303 (https://phabricator.wikimedia.org/T292256) (owner: 10Jelto) [09:25:21] (03PS7) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [09:25:48] (03CR) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:27:40] (03PS28) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [09:27:52] (03PS29) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [09:28:02] (03PS1) 10Urbanecm: dewiki, nlwiki: Bump Growth features to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725749 (https://phabricator.wikimedia.org/T288420) [09:28:08] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [09:32:49] (03CR) 10Vgutierrez: haproxy: Allow configuring TLS options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:34:23] (03CR) 10Urbanecm: "PCC still happy: https://puppet-compiler.wmflabs.org/compiler1001/31472/mwmaint1002.eqiad.wmnet/fulldiff.html." [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [09:40:45] (03CR) 10Michael Große: [C: 03+1] Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [09:40:50] (03CR) 10Jbond: [C: 03+1] "LGTM i made a comment but decided to resolve it as its so minor" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [09:43:19] (03PS6) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [09:47:32] (03CR) 10Jbond: "thanks for the review, updated" [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [09:47:34] (03PS7) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [09:55:39] (03PS8) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [09:55:41] (03CR) 10Jbond: [C: 03+1] debdeploy/base: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [09:56:27] (03CR) 10Vgutierrez: haproxy: STEK support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:06:54] (03PS3) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) [10:06:57] (03PS1) 10Jbond: nagios_common: add SSL certificate validation to remaining http checks [puppet] - 10https://gerrit.wikimedia.org/r/725766 [10:15:41] (03PS4) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [10:15:45] (03CR) 10Jbond: "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [10:20:18] (03CR) 10Volans: [C: 03+1] "LGTM, reply with optional nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [10:27:34] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Jelto) Should be possible and sounds like a good idea to piggyback this with T251305 if we are going to re-deploy all services anyway. [10:28:05] (03PS1) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 [10:28:47] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [10:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T1030). [10:37:54] (03PS4) 10Michael Große: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [10:37:56] (03CR) 10Michael Große: Enable dispatching via jobs everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [10:37:58] (03PS2) 10Michael Große: Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [10:45:57] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump max_accelerated_files [deployment-charts] - 10https://gerrit.wikimedia.org/r/725500 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [10:47:21] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725785 (https://phabricator.wikimedia.org/T292088) (owner: 10Michael Große) [10:49:53] (03Merged) 10jenkins-bot: mwdebug: bump max_accelerated_files [deployment-charts] - 10https://gerrit.wikimedia.org/r/725500 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [10:54:59] PROBLEM - ganeti-wconfd running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:55:38] <_joe_> uhh [10:56:49] isn't that one of the test hosts? [10:56:57] moritzm: ^ [10:57:11] <_joe_> it's indeed not running [10:57:47] yep, that's (staging/test) (ganeti_test), not really critical [10:58:23] <_joe_> ok [10:58:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/725328 (owner: 10PipelineBot) [10:59:03] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T1100). [11:00:04] Juan90264, Inductiveload, and kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] I can deploy in a few [11:00:20] o/ [11:00:30] Or Lucas can [11:00:37] sure, I can start if you want [11:00:43] Feel free to. [11:00:56] I'm in an elevator. [11:00:57] I am around ^_^ [11:01:04] (03PS4) 10Jgiannelos: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) [11:01:13] ok, I’m looking at the calendar [11:01:20] Sorry, bit late. Around now. [11:01:45] Juan90264 isn’t here yet it seems, so let’s star twith inductiveload [11:02:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from my side. Don't forget to Bump the version in Chart.yaml to pick up the changes introduced in this PR and it is good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [11:02:54] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/725328 (owner: 10PipelineBot) [11:02:57] (03CR) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:03:01] one very minor comment [11:03:02] * urbanecm is at laptop now, if needed [11:03:20] (03PS2) 10Hashar: gitlab: enable Content-Security-Policy reporting [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) [11:04:01] !log pool wtp1025 [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] !log depool wtp1026 for tests [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:43] (03CR) 10Urbanecm: [C: 04-1] Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:04:50] Lucas_WMDE: i have a bigger comment :/ [11:04:58] mh ok [11:05:23] (03PS7) 10Inductiveload: Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [11:05:39] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:40] I was also just about to ask – maybe the +wikisource section should go above testwiki and commonswiki, since it’s a dblist? but idk how strict we usually are about this [11:05:52] I think it should be deleted at this point [11:06:00] ok well let's forget the wiksource bit [11:06:10] ok let’s just deploy the Commons part for now [11:06:11] (03CR) 10Hashar: "I have copied the upstream configuration at https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/gitlab.yml.example . Empty settings " [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [11:06:13] as i said in CR, it won't work -- and I'd prefer checking why is wgAllowCopyUploads false at most wikis before blindly enabling it [11:06:15] task can remain open for the other part [11:06:17] yeah [11:06:19] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [11:06:29] shall i make the change? [11:06:33] yes please [11:06:43] same for the other one (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/725042) [11:06:47] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:57] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:24] (03PS8) 10Inductiveload: Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [11:07:29] Lucas_WMDE: fyi I'd like to undeploy GettingStarted if time permits -- but of course, no rush. I can do it later, too. [11:07:35] ok [11:07:42] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:18] (03CR) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:09:32] (03PS3) 10Inductiveload: Add wikisource-bot.toolforge.org to Commons/Wikisource copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) [11:09:45] (03CR) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:09:48] (03PS2) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 [11:10:03] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:10:06] ok both changed [11:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove deprecated SectionTranslationTargetLanguage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:10:37] (03PS9) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:10:44] (03CR) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:10:48] (03PS10) 10Lucas Werkmeister (WMDE): Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:10:53] rebased [11:11:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:11:10] let’s do the first one [11:11:32] (03PS4) 10Inductiveload: Add wikisource-bot.toolforge.org to Commons copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) [11:11:56] (03Merged) 10jenkins-bot: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:12:25] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:12:29] inductiveload: the IA-Upload change should be on mwdebug1002, can you test it? [11:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:56] does that mean beta commons? [11:13:01] or commons commons? [11:13:21] Commons commons, but only one one backend server, which you can select using the WikimediaDebug extension https://wikitech.wikimedia.org/wiki/WikimediaDebug [11:13:28] inductiveload: you need to use https://wikitech.wikimedia.org/wiki/WikimediaDebug to connect to commons commons via a debug server [11:13:38] (or i can if you don't have upload_by_url at commons) [11:13:51] I’ll rebase the other change in the meantime, it probably has a merge conflict [11:14:31] (03PS5) 10Lucas Werkmeister (WMDE): Add wikisource-bot.toolforge.org to Commons copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) (owner: 10Inductiveload) [11:14:41] Lucas_WMDE: works for me (domain is listed at https://commons.wikimedia.org/wiki/Special:GWToolset) [11:14:51] ok, syncing [11:16:21] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720058|Add IA-Upload tool domains to Commons wgCopyUploadsDomains (T287241)]] (duration: 00m 59s) [11:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] T287241: Add https://ia-upload.wmcloud.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T287241 [11:16:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add wikisource-bot.toolforge.org to Commons copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) (owner: 10Inductiveload) [11:16:50] and now the other one [11:17:06] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/725799 (owner: 10L10n-bot) [11:17:12] (03CR) 10Hashar: "I am missing something, https://puppet-compiler.wmflabs.org/compiler1002/1002/ reports there are no differences:" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [11:17:53] worked: https://commons.wikimedia.org/wiki/File:F%C3%A9val_-Le_poisson_d%27or(1863).djvu [11:18:03] (03Merged) 10jenkins-bot: Add wikisource-bot.toolforge.org to Commons copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) (owner: 10Inductiveload) [11:18:10] ^_^ [11:18:22] yay [11:18:27] and that was super quick to upload [11:18:30] inductiveload: wanna test the second patch yourself? 🙂 [11:18:33] yeah, that's a an artefact from a master failover I did earlier [11:18:37] can you test wikisource-bot now? should also be on mwdebug1002 [11:18:58] (I don’t think I can do it myself, at least the GWToolset special page tells me I don’t have permission) [11:19:14] (I could createAndPromote.php myself but I don’t think that would be appropriate here) [11:19:31] yeah, it likely lets me through via one of my global flags [11:19:45] domain listed too [11:19:59] gwtoolset comes via global +steward [11:20:26] yeah [11:20:41] ok, then I’ll sync this one too [11:21:27] (03PS1) 10Filippo Giunchedi: pontoon: add acme_chief host authorizations [puppet] - 10https://gerrit.wikimedia.org/r/725808 [11:21:29] (03PS1) 10Filippo Giunchedi: pontoon: move acmechief_host to settings [puppet] - 10https://gerrit.wikimedia.org/r/725809 [11:21:36] ok uploading now [11:21:43] still no Juan90264, so let’s proceed with kart_ [11:22:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725042|Add wikisource-bot.toolforge.org to Commons copy upload list (T292213)]] (duration: 00m 59s) [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] T292213: Add https://wikisource-bot.wmcloud.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons and enWikisource - https://phabricator.wikimedia.org/T292213 [11:22:49] (03PS3) 10Lucas Werkmeister (WMDE): Remove deprecated SectionTranslationTargetLanguage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:22:55] hmm Error Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes. [11:23:09] that's likely a timeout [11:23:19] but it did actually work: https://commons.wikimedia.org/wiki/File:The_Strand_Magazine_(Volume_15).djvu [11:23:26] it does that ocasionally with upload-by-url uploads [11:23:42] hasn’t shown up in mediawiki-errors on logstash yet [11:25:00] Lucas_WMDE: note that hides timeout by default AFAIK [11:25:05] ah [11:25:44] anyway, it's not related to the change -- similar things happened before 🙂 [11:25:55] kart_: are you still there [11:26:00] *? [11:26:46] (03CR) 10Vgutierrez: cache::haproxy: Configure sslcert::ocsp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:26:58] Portal turrets 🤝 backport+config team [11:27:00] “are you still there?” “deploying?” [11:27:16] Lucas_WMDE: yeah :) [11:27:19] yay [11:27:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove deprecated SectionTranslationTargetLanguage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:27:39] I lost in the logs ;) [11:27:48] I probably would’ve still deployed it since it looks very straightforward, but if you can quickly test it, it’s better ^^ [11:28:02] (03PS1) 10Volans: sre.experimental.reimage: improve --new logic [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 [11:28:11] it’s a config setting that’s no longer used since wmf.2, right? [11:28:14] and wmf.2 is safely rolled out now [11:28:17] (03PS1) 10Volans: remote: reduce wait time for reboot to 20 minutes [software/spicerack] - 10https://gerrit.wikimedia.org/r/725818 [11:28:18] Yes. [11:28:28] makes sense [11:28:29] Can be safely removed now. [11:28:33] (03Merged) 10jenkins-bot: Remove deprecated SectionTranslationTargetLanguage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:28:54] it’s on mwdebug1002, do you want to test it? [11:29:02] Sure. Testing. [11:29:57] Lucas_WMDE: looks good. Please deploy. [11:30:01] ack [11:31:21] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724992|Remove deprecated SectionTranslationTargetLanguage config (T290302)]] (duration: 00m 58s) [11:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:28] T290302: Confirm Section Translation can support the new set of languages - https://phabricator.wikimedia.org/T290302 [11:32:05] Thanks Lucas_WMDE [11:32:09] np [11:32:23] urbanecm: I think you can go ahead with undeploying GettingStarted now [11:32:26] thanks! [11:32:50] (03PS2) 10Urbanecm: Undeploy GettingStarted I: Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722574 (https://phabricator.wikimedia.org/T235752) [11:32:54] (03CR) 10Urbanecm: [C: 03+2] Undeploy GettingStarted I: Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722574 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:33:21] (03PS2) 10Urbanecm: Undeploy GettingStarted II: Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722575 (https://phabricator.wikimedia.org/T235752) [11:33:25] (03CR) 10Urbanecm: [C: 03+2] Undeploy GettingStarted II: Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722575 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:33:44] (03Merged) 10jenkins-bot: Undeploy GettingStarted I: Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722574 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:34:37] (03Merged) 10jenkins-bot: Undeploy GettingStarted II: Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722575 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:34:49] (03PS2) 10Urbanecm: Undeploy getting started III: Don't set wmgUseGettingStarted, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722576 (https://phabricator.wikimedia.org/T235752) [11:34:54] (03CR) 10Urbanecm: [C: 03+2] Undeploy getting started III: Don't set wmgUseGettingStarted, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722576 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:35:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1c7405ad1eb323a8da524819f17d6f1a66afaa57: Undeploy GettingStarted I: Disable on all wikis (T235752) (duration: 00m 58s) [11:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:20] T235752: Undeploy the GettingStarted extension - https://phabricator.wikimedia.org/T235752 [11:37:00] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 9eaf960c4b7c304be57dfc8d248aca0c6501d04c: Undeploy GettingStarted II: Dont load regardless of config (T235752) (duration: 00m 58s) [11:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:41] (03PS2) 10Urbanecm: Undeploy GettingStarted IV: Don't build i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722577 [11:39:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d60f332785868797e7ecc9b5e410616d5604b392: Undeploy getting started III: Dont set wmgUseGettingStarted, now ignored (T235752) (duration: 00m 58s) [11:39:48] (03Merged) 10jenkins-bot: Undeploy getting started III: Don't set wmgUseGettingStarted, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722576 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:09] (03PS3) 10Urbanecm: Undeploy GettingStarted IV: Don't build i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722577 (https://phabricator.wikimedia.org/T235752) [11:40:17] (03CR) 10Urbanecm: [C: 03+2] Undeploy GettingStarted IV: Don't build i18n (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722577 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:40:57] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/725799 (owner: 10L10n-bot) [11:41:04] (03Merged) 10jenkins-bot: Undeploy GettingStarted IV: Don't build i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722577 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:41:39] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/724965 (owner: 10L10n-bot) [11:41:44] (03PS2) 10Urbanecm: Undeploy GettingStarted V: Remove now-obsolete logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722578 (https://phabricator.wikimedia.org/T235752) [11:41:48] (03CR) 10Urbanecm: [C: 03+2] Undeploy GettingStarted V: Remove now-obsolete logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722578 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:42:31] !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: 9709bcfc8dacbcd1704471df08c31cec0711bea6: Undeploy GettingStarted IV: Dont build i18n (T235752) (duration: 00m 58s) [11:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] T235752: Undeploy the GettingStarted extension - https://phabricator.wikimedia.org/T235752 [11:42:42] (03Merged) 10jenkins-bot: Undeploy GettingStarted V: Remove now-obsolete logging channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722578 (https://phabricator.wikimedia.org/T235752) (owner: 10Urbanecm) [11:44:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b0a96bed4562bcc975187b1d34626201d407404b: Undeploy GettingStarted V: Remove now-obsolete logging channels (T235752) (duration: 00m 59s) [11:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:49] (still deploying) [11:46:41] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: 5728376: Update T250887 mitigations (duration: 00m 58s) [11:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:33] (03CR) 10Urbanecm: [C: 03+2] dewiki, nlwiki: Bump Growth features to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725749 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:47:40] (03PS2) 10Urbanecm: dewiki, nlwiki: Bump Growth features to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725749 (https://phabricator.wikimedia.org/T288420) [11:47:47] (03CR) 10Urbanecm: [C: 03+2] dewiki, nlwiki: Bump Growth features to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725749 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:48:36] (03Merged) 10jenkins-bot: dewiki, nlwiki: Bump Growth features to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725749 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:50:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a855078cf52d88cc2cd27a0adc7c6a680c80dd39: dewiki, nlwiki: Bump Growth features to 80% (T288420, T285254) (duration: 00m 58s) [11:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:32] T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254 [11:50:33] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [11:52:33] (03PS2) 10Urbanecm: Let DB expressions intersect DB lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725263 (https://phabricator.wikimedia.org/T290609) [11:52:38] (03CR) 10Urbanecm: [C: 03+2] Let DB expressions intersect DB lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725263 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [11:52:42] and one more patch [11:53:26] (03Merged) 10jenkins-bot: Let DB expressions intersect DB lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725263 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [11:55:02] !log urbanecm@deploy1002 Synchronized multiversion/MWWikiversions.php: 508cf5cc6d213373f7c9ba1cdef142ebc8398022: Let DB expressions intersect DB lists (T290609) (duration: 00m 58s) [11:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:09] !log EU B&C window done [11:55:10] T290609: Make mentee overview module's updateMenteeData.php scale better - https://phabricator.wikimedia.org/T290609 [11:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:01] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10ayounsi) Documenting all the cables make sens, feel free to add the one between the cloudstore hosts (or ask DCops) About the IPs, we decided to not track any of the 192.16... [12:01:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:48] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add acme_chief host authorizations [puppet] - 10https://gerrit.wikimedia.org/r/725808 (owner: 10Filippo Giunchedi) [12:01:53] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: move acmechief_host to settings [puppet] - 10https://gerrit.wikimedia.org/r/725809 (owner: 10Filippo Giunchedi) [12:02:01] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ayounsi) Note that a few of the durum IPs have both the "DNS name" field set, and "Keep manual DNS" as comment, which I think are mutually exclusive (but not enforced). https://netbo... [12:02:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Ganeti tests [12:02:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Ganeti tests [12:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti2026.codfw.wmnet with reason: Ganeti tests [12:02:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti2026.codfw.wmnet with reason: Ganeti tests [12:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:46] (03PS1) 10Filippo Giunchedi: pontoon: auto generate service certificates [puppet] - 10https://gerrit.wikimedia.org/r/725838 [12:05:20] (03CR) 10jerkins-bot: [V: 04-1] pontoon: auto generate service certificates [puppet] - 10https://gerrit.wikimedia.org/r/725838 (owner: 10Filippo Giunchedi) [12:05:48] (03CR) 10Filippo Giunchedi: "Note this soft-depends on I1be58fe082 but can merged independently" [puppet] - 10https://gerrit.wikimedia.org/r/725838 (owner: 10Filippo Giunchedi) [12:07:21] (03PS2) 10Filippo Giunchedi: pontoon: auto generate service certificates [puppet] - 10https://gerrit.wikimedia.org/r/725838 [12:07:31] ;_; [12:08:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:50] (03PS1) 10Filippo Giunchedi: alerts: move alerts-deploy to systemd units [puppet] - 10https://gerrit.wikimedia.org/r/725840 (https://phabricator.wikimedia.org/T292303) [12:12:22] (03PS10) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [12:19:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:06] (03PS1) 10KartikMistry: Enable Content and Section Translation to Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725858 (https://phabricator.wikimedia.org/T290238) [12:24:19] Lucas_WMDE: ok, it works again without pre-selecting the server, so I'm happy ^_^ [12:24:24] yay \o/ [12:24:45] thank you for the merge and comments :-) [12:25:37] and urban too (imagine a ping there :-D) [12:30:01] (03PS1) 10KartikMistry: Update cxserver to use nodejs12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725866 [12:32:06] (03PS2) 10KartikMistry: Update cxserver to use nodejs12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725866 (https://phabricator.wikimedia.org/T290754) [12:32:19] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Ottomata) Thank you!!! [12:33:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/725766 (owner: 10Jbond) [12:37:06] (03PS5) 10Juan90264: Add WN as an alias to project namespace in Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725132 (https://phabricator.wikimedia.org/T291344) [12:37:34] (03PS24) 10Juan90264: Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [12:38:28] (03PS7) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [12:39:12] (03CR) 10jerkins-bot: [V: 04-1] Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [12:42:28] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31473/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:44:34] (03CR) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [12:44:43] (03CR) 10Juan90264: Add WN as an alias to project namespace in Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725132 (https://phabricator.wikimedia.org/T291344) (owner: 10Juan90264) [12:46:47] Hello. Can someone unabandon a change? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/390303 I'd like to rebase it, as we got news on Phabricator for that one. [12:47:37] 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10ttaylor) Maybe get manager approval from Corey [12:47:43] (actually I've solved a merge conflict it, and I'd like to send the rebase to Gerrit, then schedule it for deployment) [12:48:48] jouncebot: nowandnext [12:48:48] No deployments scheduled for the next 4 hour(s) and 11 minute(s) [12:48:48] In 4 hour(s) and 11 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T1700) [12:50:03] (03Restored) 10Ladsgroup: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [12:50:15] Dereckson: done [12:50:19] Thanks [12:50:28] (03PS4) 10Dereckson: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [12:51:25] (03PS2) 10Ladsgroup: Enable dispatching for wikidatawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725785 (https://phabricator.wikimedia.org/T292088) (owner: 10Michael Große) [12:51:37] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725785 (https://phabricator.wikimedia.org/T292088) (owner: 10Michael Große) [12:52:07] (03CR) 10jerkins-bot: [V: 04-1] Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [12:52:44] (03Merged) 10jenkins-bot: Enable dispatching for wikidatawiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725785 (https://phabricator.wikimedia.org/T292088) (owner: 10Michael Große) [12:54:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:29] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725785|Enable dispatching for wikidatawiki and commonswiki (T292088)]] (duration: 01m 00s) [12:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:35] T292088: Enable new Dispatching on all production wikis for wikidata - https://phabricator.wikimedia.org/T292088 [12:57:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:02] (03PS5) 10Dereckson: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [13:01:39] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) 05Open→03Resolved I've recreated the IP, and put the DNS name in the description with "Keep manual DNS" prefix. It doesn't make much difference, as the Netbox... [13:02:18] (03CR) 10Dereckson: [C: 03+1] "PS4: Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [13:07:08] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:09:47] (03PS1) 10ArielGlenn: handle large offsets inside of bz2-compressed multistream files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/725875 (https://phabricator.wikimedia.org/T290459) [13:15:09] 10SRE, 10Traffic, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10Volans) >>! In T289536#7398142, @ayounsi wrote: > Note that a few of the durum IPs have both the "DNS name" field set, and "Keep manual DNS" as comment, which I think are mutually ex... [13:21:04] (03CR) 10ArielGlenn: [V: 03+1 C: 03+2] "Tested on the large wikidata file mentioned in the related bug." [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/725875 (https://phabricator.wikimedia.org/T290459) (owner: 10ArielGlenn) [13:21:16] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] handle large offsets inside of bz2-compressed multistream files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/725875 (https://phabricator.wikimedia.org/T290459) (owner: 10ArielGlenn) [13:21:21] (03PS3) 10Hashar: gitlab: enable Content-Security-Policy reporting [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) [13:22:36] (03CR) 10Hashar: "Jelto informed me ::gitlab::init is not used yet (we still rely on Ansible). I have thus rebased my change on top of the change that will " [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [13:25:47] (03PS1) 10Filippo Giunchedi: o11y: port Icinga checks [alerts] - 10https://gerrit.wikimedia.org/r/725884 (https://phabricator.wikimedia.org/T288726) [13:27:00] (03PS1) 10Filippo Giunchedi: alerts: remove icinga overload alert, moved to AM [puppet] - 10https://gerrit.wikimedia.org/r/725885 (https://phabricator.wikimedia.org/T288726) [13:28:02] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:59] (03PS1) 10Elukey: amd_rocm: import ROCm suite 4.3.1 [puppet] - 10https://gerrit.wikimedia.org/r/725887 (https://phabricator.wikimedia.org/T287267) [13:32:37] (03PS4) 10Hashar: gitlab: enable Content-Security-Policy reporting [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) [13:33:32] (03CR) 10Hashar: gitlab: enable Content-Security-Policy reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [13:35:01] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm boldly resolving this since we have subtasks open for plaintext t... [13:36:14] (03PS1) 10Giuseppe Lavagetto: mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) [13:36:43] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [13:37:37] (03CR) 10Michael Große: [C: 03+1] Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [13:42:18] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:42:42] (03CR) 10Elukey: [C: 03+2] Move analytics-hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/725687 (https://phabricator.wikimedia.org/T288625) (owner: 10Elukey) [13:45:27] (03PS3) 10Ottomata: EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724380 (https://phabricator.wikimedia.org/T288853) [13:57:51] (03PS5) 10Ladsgroup: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) [13:58:09] (03CR) 10Ladsgroup: [C: 03+2] "deploying, let's get the party started" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [13:58:51] (03CR) 10Jbond: "lgtm see comment/question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 (owner: 10Volans) [13:59:12] (03Merged) 10jenkins-bot: Enable dispatching via jobs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725502 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [13:59:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/725818 (owner: 10Volans) [14:00:49] (03PS1) 10Hashar: Enable Content-Security-Policy reporting [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) [14:01:37] (03CR) 10Hashar: "I have ported it to the Ansible playbook so we can enable CSP before we switch to using the Puppet module. https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [14:01:44] !log ladsgroup@deploy1002 Synchronized wmf-config: Config: [[gerrit:725502|Enable dispatching via jobs everywhere (T48643)]] (duration: 01m 00s) [14:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:53] T48643: [Story] Dispatching via job queue (instead of cron script) - https://phabricator.wikimedia.org/T48643 [14:01:55] deployed [14:02:32] _joe_: if you have a minute: https://gerrit.wikimedia.org/r/c/725673 [14:02:43] (03CR) 10Elukey: [C: 03+2] amd_rocm: import ROCm suite 4.3.1 [puppet] - 10https://gerrit.wikimedia.org/r/725887 (https://phabricator.wikimedia.org/T287267) (owner: 10Elukey) [14:02:43] deployed everywhere now [14:03:46] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T292256 [14:03:48] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikmiedia.org/T292256 [14:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:02] (03CR) 10Hashar: "The CSP config comes straight from the puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/725012" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [14:05:53] (03PS3) 10Ladsgroup: Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) [14:05:58] (03CR) 10Ladsgroup: [C: 03+2] Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:06:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:13] (03Merged) 10jenkins-bot: Disable dispatch lag part of maxlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725705 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:08:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:57] (03PS1) 10Elukey: aptrepo: add missing amd-rocm431 settings [puppet] - 10https://gerrit.wikimedia.org/r/725904 (https://phabricator.wikimedia.org/T287267) [14:09:59] (03PS1) 10Ladsgroup: Explicitly enable dispatching and pruning for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725905 (https://phabricator.wikimedia.org/T48643) [14:10:23] (03CR) 10Herron: [C: 03+1] "LGTM! minor comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/725838 (owner: 10Filippo Giunchedi) [14:10:39] (03CR) 10Ladsgroup: [C: 03+2] Explicitly enable dispatching and pruning for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725905 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:10:54] (03CR) 10Elukey: [C: 03+2] aptrepo: add missing amd-rocm431 settings [puppet] - 10https://gerrit.wikimedia.org/r/725904 (https://phabricator.wikimedia.org/T287267) (owner: 10Elukey) [14:10:56] (03CR) 10Michael Große: [C: 03+1] Explicitly enable dispatching and pruning for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725905 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:11:33] (03Merged) 10jenkins-bot: Explicitly enable dispatching and pruning for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725905 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:13:14] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:725905|Explicitly enable dispatching and pruning for wikidata (T48643)]] (duration: 00m 58s) [14:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:23] T48643: [Story] Dispatching via job queue (instead of cron script) - https://phabricator.wikimedia.org/T48643 [14:17:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:57] !log import AMD ROCm 4.3.1 packages in buster-wikimedia's thirdparty/amd-rocm431 - T287267 [14:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] T287267: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 [14:24:19] !log gitlab: downtime for upgrade to 14.3.1 [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:53] !log cleaning up wb_changes_subscription rows from closed wikis (T292440) [14:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] T292440: huwikinews is closed, but still subscribed to some items - https://phabricator.wikimedia.org/T292440 [14:28:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:28:35] dcausse: when we see --^ do we need to do anything? (I saw a lot of them during the weekend) [14:28:36] silencing BlazegraphFreeAllocatorsDecreasingRapidly, wdqs1009 & wdqs2008 are being reloaded [14:28:45] elukey: yes ^ :) [14:29:09] ok ok thanks :) [14:29:40] when the machine is not being reloaded yes it's an indication that a problem will arise (probably next step is disk full) [14:30:00] I think the runbooks have it (checking) [14:30:24] (03PS11) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [14:31:01] dcausse: ah snap I didn't check it yet, will try to document myself! [14:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1009:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:37:40] (03CR) 10Herron: [C: 03+1] "really like this approach, looks like a solid improvement!" [puppet] - 10https://gerrit.wikimedia.org/r/725840 (https://phabricator.wikimedia.org/T292303) (owner: 10Filippo Giunchedi) [14:37:43] (03CR) 10Nikki Nikkhoui: [C: 03+1] api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) (owner: 10Hnowlan) [14:37:49] (03PS1) 10Ladsgroup: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725927 (https://phabricator.wikimedia.org/T48643) [14:38:29] (03CR) 10Michael Große: [C: 03+1] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725927 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:39:39] (03CR) 10Ladsgroup: [C: 03+2] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725927 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:39:50] !log gitlab: upgrade to 14.3.2 (note there was an additional patch release on 2021-10-01) complete (T292256) [14:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:56] T292256: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 [14:41:25] o/ I'm going to run a maintenance script that'll delete a handful of rows from a table related to SecurePoll [14:42:28] I've done a dry run and have confirmed the output looks OK [14:42:35] (03PS2) 10Volans: sre.experimental.reimage: improve --new logic [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 [14:42:42] (03CR) 10Volans: sre.experimental.reimage: improve --new logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 (owner: 10Volans) [14:42:58] (03CR) 10Volans: [C: 03+2] remote: reduce wait time for reboot to 20 minutes [software/spicerack] - 10https://gerrit.wikimedia.org/r/725818 (owner: 10Volans) [14:44:15] (03Merged) 10jenkins-bot: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725927 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:45:42] !log ladsgroup@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:16] !log uploading scap 4.0.2 - T291095 [14:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:24] T291095: Deploy Scap version 4.0.2 - https://phabricator.wikimedia.org/T291095 [14:46:49] !log ladsgroup@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:21] (03CR) 10Addshore: [C: 03+1] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725927 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [14:49:23] (03Merged) 10jenkins-bot: remote: reduce wait time for reboot to 20 minutes [software/spicerack] - 10https://gerrit.wikimedia.org/r/725818 (owner: 10Volans) [14:50:02] !log phuedx@mwmaint1002:~$ mwscript extensions/SecurePoll/cli/purgeDecryptionKeys.php --wiki=votewiki --before="20210101000000" [14:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:52] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) @fgiunchedi no nothing needed. I just left the task open to monitor the server. It looks there is no issue yet so I will update Dell and let them close the case. Thank you. [14:52:44] (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [14:57:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sure, done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:00:50] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Cmjohnson) This was saved incomplete on the dell tech website, I must not have submitted it, I submitted it today, and barring any issues or pushback from Dell the disk will be here to... [15:01:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 (owner: 10Volans) [15:02:01] (03Merged) 10jenkins-bot: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:02:41] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve --new logic [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 (owner: 10Volans) [15:03:01] (03CR) 10Alexandros Kosiaris: Rename main cluster to services (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [15:05:49] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve --new logic [cookbooks] - 10https://gerrit.wikimedia.org/r/725817 (owner: 10Volans) [15:06:34] (03PS1) 10Ladsgroup: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725936 [15:06:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [15:07:24] (03CR) 10Addshore: [C: 03+1] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725936 (owner: 10Ladsgroup) [15:08:54] (03CR) 10Michael Große: [C: 03+1] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725936 (owner: 10Ladsgroup) [15:11:13] (03CR) 10Ladsgroup: [C: 03+2] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725936 (owner: 10Ladsgroup) [15:13:44] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:56] (03Merged) 10jenkins-bot: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725936 (owner: 10Ladsgroup) [15:16:34] !log ladsgroup@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:40] !log ladsgroup@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:16] !log pool cp5006 [15:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:43] (03CR) 10Ssingh: [C: 03+1] haproxy: Basic TLS terminator based on HAProxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:35:09] (03CR) 10Ssingh: [C: 03+1] haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:35:30] (03CR) 10Ssingh: [C: 03+1] haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:35:51] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Configure sslcert::ocsp [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:37:54] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "[patch merged; trying to comment to get this out of "my turn" :)]" [puppet] - 10https://gerrit.wikimedia.org/r/725036 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:42:47] (03CR) 10MacFan4000: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725912 (owner: 10MacFan4000) [15:43:49] (03CR) 10Jgleeson: "Awesome! I'll try out add the hiera config today and see how we get on. Thanks again!" [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [15:46:07] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) [15:48:39] (03PS2) 10Giuseppe Lavagetto: admin: add Shari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724946 (https://phabricator.wikimedia.org/T292069) [15:52:10] mutante: which url do you check for wikistats [15:53:03] (03CR) 10Reedy: [C: 03+1] Update ExtensionDistributor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725912 (owner: 10MacFan4000) [15:56:11] (03CR) 10Dzahn: "If you want this to be limited to just wikis and since we won't change to another wiki engine anytime soon, just call it the "mediawiki" c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [16:00:18] RhinosF1: https://meta.miraheze.org/w/api.php?action=wikidiscover&wdstate=public&format=php but also the issue is the update - happens only once a week and deletes existing data to then fetch new data. so if things break for an hour and it's jus during the update.. they wont fix themselves until next Friday [16:00:27] gotta go to a meetin [16:01:17] mutante: nope that's blank [16:01:32] RhinosF1: yea, that is broken. "it wasnt' me(tm)" :) [16:01:47] can we open the ticket on MH side [16:01:50] bbl [16:02:04] mutante: unlikely me either. I'll ping Universal Omega. [16:02:33] did not mean to imply that:) thanks [16:04:48] (03CR) 10BryanDavis: "This was reverted as part of Iace8fdf411cd840e3efa2eeef8dc58a20018ed9e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725428 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [16:09:21] (03PS3) 10Jforrester: ExtensionDistributor: Add 1.37 as preview branch; remove 1.31 as it's EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725912 (owner: 10MacFan4000) [16:09:51] (03CR) 10Jforrester: "I swore I did the second half of this patch already, but clearly I didn't push it. Whoops, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725912 (owner: 10MacFan4000) [16:09:55] (03CR) 10Jforrester: [C: 03+1] ExtensionDistributor: Add 1.37 as preview branch; remove 1.31 as it's EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725912 (owner: 10MacFan4000) [16:10:37] (03CR) 10Alexandros Kosiaris: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [16:15:02] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10hashar) That seems to be related to `use open :encoding(UTF-8)` which causes perl to expect valid unicode coming from `cat exception... [16:27:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Cmjohnson) Ticket opened with Dell, SR1071934085 [16:31:17] (03PS2) 10BBlack: dotls: Benefit from HAProxy support on acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/720936 (owner: 10Vgutierrez) [16:31:19] (03PS1) 10BBlack: AuthDNS DoTLS - use alternate LE chain [puppet] - 10https://gerrit.wikimedia.org/r/725979 [16:33:52] bblack: that's going to fail ^^ [16:34:16] (03CR) 10BBlack: [C: 03+2] AuthDNS DoTLS - use alternate LE chain [puppet] - 10https://gerrit.wikimedia.org/r/725979 (owner: 10BBlack) [16:34:22] bblack!! [16:34:24] (03CR) 10BBlack: [C: 03+2] dotls: Benefit from HAProxy support on acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/720936 (owner: 10Vgutierrez) [16:34:39] ok [16:34:47] which? [16:35:24] both actually [16:35:29] the combination :) [16:35:32] why? [16:35:46] I've missed the alt.chained.crt.key.ocsp link [16:35:50] ok [16:36:12] I can fix it first thing tomorrow EU morning [16:36:16] sorry :) [16:36:21] uh [16:36:36] don't we already have services using it? they have no ocsp? [16:36:51] I guess only the haproxy stuff that's not in production yet [16:37:36] so the missing part is haproxy + alt chain specific [16:37:57] well we could dump OCSP for DoTLS for now, too [16:38:00] checking docs there [16:38:04] ack [16:38:17] IIRC wikidough doesn't use OCSP [16:38:24] sukhe: ^^ [16:39:07] oh it uses OCSP, but it doesn't use haproxy :) [16:39:12] yeah I don't think DoTLS needs to either, it's not "user" facing [16:40:23] and apparnetly haproxy's OCSP config is "does the file exist" [16:40:34] so, I think I can just push this out as-is, and DoTLS will lose OCSP and be ok [16:41:05] right [16:41:07] checking if it will trip some icinga alert though [16:41:13] vgutierrez: we do, which means we will need to update that as well. [16:41:42] so as soon as I fix the acme-chief side, doTLS will recover OCSP [16:42:58] looks good on that front too [16:43:04] (DoT is user facing for Wikidough but I am not sure how many stubs actually verify; unbound doesn't) [16:50:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Cmjohnson) A ticket has been opened with Dell, interesting enough they didn't have HDD as a pre-selected option to replace. Hopefully, having to add this does not delay the p... [16:51:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:55] !log rolling restart of haproxy for DoTLS on dns300[12],authdns1001,authdns2001 to recycle connections [16:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:59] 10SRE, 10ops-eqiad: Degraded RAID on db1126 - https://phabricator.wikimedia.org/T292325 (10Cmjohnson) Ticket opened with Dell, You have successfully submitted request SR1071943805. [16:56:39] to clarify, "we do" meant that we use OCSP but no, we don't use haproxy for Wikidough so this doesn't affect us [16:57:12] (03PS1) 10Vgutierrez: acme_chief,api: Provide .alt.chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/725983 (https://phabricator.wikimedia.org/T290249) [17:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T1700). [17:00:36] >deploy deploy [17:00:36] ? [17:01:03] (03CR) 10jerkins-bot: [V: 04-1] acme_chief,api: Provide .alt.chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/725983 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [17:01:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:51] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Cmjohnson) Created a Dell dispatch ticket You have successfully submitted request SR1071944241. [17:15:43] (03PS1) 10Effie Mouzeli: mwdebug: bump envoy memory and cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/726000 (https://phabricator.wikimedia.org/T280497) [17:15:50] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:21] (03PS2) 10Bartosz Dziewoński: Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) [17:26:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: bump envoy memory and cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/726000 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [17:29:10] (03CR) 10Jdlrobson: [C: 03+1] Adding and use wordmark in trwikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [17:40:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31475/" [puppet] - 10https://gerrit.wikimedia.org/r/724838 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [17:43:19] (03CR) 10Dzahn: "noop on fe1001 and be1001" [puppet] - 10https://gerrit.wikimedia.org/r/724838 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [17:48:50] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31476/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [17:54:03] jbond: eh.. does anything actually use base::debdeploy? It does not seem to be included nor instantiated in any other class in prod? [17:55:09] (03CR) 10Dzahn: "noop on alert1001" [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [17:58:25] mutante: no its not used, the code there is now in debdeploy::client, i think it just never got removed when refactoring, will look tomorrow [17:58:29] jbond: base includes profile::debdeploy::client and that uses debdeploy::client but that base::debdeploy might be unused [17:58:32] heh [17:58:43] ok:) thank you [17:58:48] :) [17:58:50] np [17:59:41] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump envoy memory and cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/726000 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T1800). [18:00:04] Juan90264, Dereckson, and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:26] (03CR) 10Dzahn: "Turns out base::debdeploy is not actually used. Will upload separate patch to delete it." [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:00:38] hi [18:01:23] hi, if no one beats me, i can deploy in a few minutes [18:01:33] (03CR) 10Dzahn: "So only affects the debdeploy-server part which is on https://debmonitor.wikimedia.org/packages/debdeploy-server" [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:02:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31478/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:03:04] Dereckson: hi, around? [18:03:13] still no Juan90264 [18:03:32] I'll let them know on-Phab about the process -- they weren't around during the EU window either [18:03:50] @seen Juan90264 [18:04:49] (03Merged) 10jenkins-bot: mwdebug: bump envoy memory and cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/726000 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [18:04:54] MatmaRex: I find your commit a bit confusing -- the message says "Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis", but it looks to be most of _special_ wikis [18:05:02] urbanecm: I assume that GettingStarted definitely won't get re-enabled? [18:05:16] also was this announced/discussed with the active projects who're getting the change? [18:05:24] James_F: not unless something breaks without it not being there [18:05:33] (I don't expect it, but...surprising things happen) [18:05:54] Right. I'll hold off on merging https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/725980 for this week, then. :-) [18:05:54] urbanecm: uh, yes [18:06:14] thanks James_F :) [18:06:29] i'll try to rephrase it [18:06:48] i meant "…on most wikis where it was included" (but that's also maybe not true) [18:07:57] (03PS3) 10Bartosz Dziewoński: Remove NS_MAIN from wgExtraSignatureNamespaces on most 'special' wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) [18:07:59] the community question still stands though -- I don't think it's a good idea to do this out of the blue [18:08:24] oh, i missed it [18:08:36] no problem :) [18:08:51] urbanecm: it wasn't announced, i think it's a low-impact change to config that no one cared about before. but i will post some notes if you think it's necessary [18:09:03] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10DannyH) I approve, thank you. [18:09:21] I'd prefer it, yes. Thank you. [18:10:10] Dereckson: hi, around? [18:13:38] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:00] So, it looks we're done. [18:17:18] (03CR) 10Dzahn: [C: 03+1] "afaik there is nothing jessie anymore. but let's have Subbu ACK this as well" [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [18:19:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:45] (03PS1) 10Dzahn: base: delete unused base::debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/726026 [18:27:19] ^ that is me [18:27:28] (deploy1002 alert) [18:30:09] thanks [18:35:03] (03PS2) 10Dzahn: puppetmaster/geoip: do not duplicate pulling of maxmind on all servers [puppet] - 10https://gerrit.wikimedia.org/r/725390 [18:44:21] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) Gosh this it's hard to parse what's going on here and the folk... [18:47:24] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Reedy) a:05holger.knust→03None [18:48:26] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/31479/" [puppet] - 10https://gerrit.wikimedia.org/r/725390 (owner: 10Dzahn) [18:49:36] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:47] (03PS3) 10Dzahn: puppetmaster/geoip: do not duplicate pulling of maxmind on all servers [puppet] - 10https://gerrit.wikimedia.org/r/725390 [18:53:57] (03CR) 10Dzahn: "compiler shows it absents the files on 2001 but keeps them on 1001" [puppet] - 10https://gerrit.wikimedia.org/r/725390 (owner: 10Dzahn) [19:24:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:29:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:56:17] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10KFrancis) @Deniz_WMDE The NDA was sent electronically for to you sign. Thanks! [20:00:04] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T2000). [20:01:50] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.133e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:06:44] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:09] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) > I'll first dry-run the uppercaseTitlesForUnicodeTransition.p... [20:43:06] 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Fjalapeno) Approved [20:50:47] (03PS1) 10Bartosz Dziewoński: Add explicit config for licensing/copyright message overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) [21:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T2100). [21:01:14] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:32] (03CR) 10Legoktm: [C: 03+1] "Spot-checked the diff and LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) (owner: 10Bartosz Dziewoński) [21:10:39] (03PS2) 10Bartosz Dziewoński: Add explicit config for licensing/copyright message overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) [21:10:54] (03PS11) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [21:11:39] (03CR) 10SBassett: [C: 03+1] "LGTM. Can Ibf2645663 be abandoned now?" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [21:12:02] (03CR) 10Legoktm: [C: 03+1] Add explicit config for licensing/copyright message overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) (owner: 10Bartosz Dziewoński) [21:12:41] (03PS6) 10Juan90264: Add WN as an alias to project namespace in Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725132 (https://phabricator.wikimedia.org/T291344) [21:13:12] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) One more: https://commons.wikimedi... [21:19:19] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) So, first I've dry-run the script ` foreachwiki uppercaseTitl... [21:21:06] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:28] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:14] (03PS1) 10Dzahn: wikistats: add email capability to import jobs [puppet] - 10https://gerrit.wikimedia.org/r/726082 (https://phabricator.wikimedia.org/T292369) [21:37:01] (03CR) 10Dzahn: [C: 03+2] wikistats: add email capability to import jobs [puppet] - 10https://gerrit.wikimedia.org/r/726082 (https://phabricator.wikimedia.org/T292369) (owner: 10Dzahn) [22:09:59] (03PS1) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) [22:13:48] 10SRE, 10Analytics-Radar, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) Currently working on T288844 and added puppet code that allowed us to use a second, new, license for MaxMind geoip databases. So far e... [22:15:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726100 (https://phabricator.wikimedia.org/T128546) [22:20:48] (03PS1) 10Dzahn: puppetmaster::geoip: test if new license lets us download needed databases [puppet] - 10https://gerrit.wikimedia.org/r/726102 (https://phabricator.wikimedia.org/T288844) [22:30:50] (03CR) 10Dzahn: [C: 03+2] puppetmaster::geoip: test if new license lets us download needed databases [puppet] - 10https://gerrit.wikimedia.org/r/726102 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:32:33] (03CR) 10Dzahn: "This is a separate path from where clients fetch the DBs from." [puppet] - 10https://gerrit.wikimedia.org/r/726102 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:33:40] (03PS1) 10RLazarus: admin: Add xterm-kitty terminfo to ~rzl [puppet] - 10https://gerrit.wikimedia.org/r/726103 [22:34:59] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31481/console" [puppet] - 10https://gerrit.wikimedia.org/r/726103 (owner: 10RLazarus) [22:35:54] (03CR) 10Dzahn: "test shows we CANNOT merge everything into a single license: "Invalid product ID or subscription expired"" [puppet] - 10https://gerrit.wikimedia.org/r/726102 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:37:14] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) Tested whether we can download all the existing databases PLUS the new databases using the same license..... [22:39:27] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T288844#7400574" [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:42:18] (03PS1) 10Dzahn: Revert "puppetmaster::geoip: test if new license lets us download needed databases" [puppet] - 10https://gerrit.wikimedia.org/r/725919 [22:42:20] (03CR) 10RLazarus: [C: 03+2] admin: Add xterm-kitty terminfo to ~rzl [puppet] - 10https://gerrit.wikimedia.org/r/726103 (owner: 10RLazarus) [22:44:18] (03CR) 10Dzahn: [C: 03+2] Revert "puppetmaster::geoip: test if new license lets us download needed databases" [puppet] - 10https://gerrit.wikimedia.org/r/725919 (owner: 10Dzahn) [22:45:42] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:35] that is me [22:50:21] should resolve in a moment [22:51:40] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:56] and also fixing logrotate unit on puppetmaster2001 from the other day [22:54:49] !log puppetmaster2001 - rm /etc/logrotate.d/geoipupdate_ipinfo and geoipupdate_ipinfo ; running puppet, starting logrotate service [22:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:36] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:41] jouncebot: next [22:57:41] In 0 hour(s) and 2 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T2300) [22:57:55] i'm around for the backport, but need to step away for 5m [23:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211004T2300) [23:00:05] Juan90264, MatmaRex, and jan_drewniak: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:24] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:31] !log [deneb:~] $ sudo systemctl start docker-reporter-releng-images [23:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:52] i'm back [23:06:00] any deployers? :D [23:06:44] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:59] MatmaRex: I'm here, but...i planned to sleep :D [23:07:43] heh, well, my change is a no-op, so it can wait [23:07:56] not sure if anyone else is here? [23:08:06] I'll do it -- it appears to not do anything ATM [23:08:09] good night i guess :D [23:08:18] I can deploy my own change, but that's pretty much it. [23:09:01] jan_drewniak: I'll do MatmaRex's and hand over? [23:09:16] (03CR) 10Urbanecm: [C: 03+2] Add explicit config for licensing/copyright message overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) (owner: 10Bartosz Dziewoński) [23:09:23] urbanecm: sounds good [23:10:21] (03Merged) 10jenkins-bot: Add explicit config for licensing/copyright message overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726052 (https://phabricator.wikimedia.org/T284097) (owner: 10Bartosz Dziewoński) [23:10:51] just syncing, hard to test a no-op :) [23:11:44] yeah. thanks [23:12:19] np [23:13:15] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 75645c9cc59b37dbf59942eabbc014b7dc147626: Add explicit config for licensing/copyright message overrides (T284097) (duration: 00m 59s) [23:13:16] jan_drewniak: floor is yours :) [23:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:22] T284097: Update editing interface copyright messages to comply with local project policies - https://phabricator.wikimedia.org/T284097 [23:13:27] 10SRE, 10SRE-swift-storage, 10observability: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10Dzahn) Just wanted to say when swift-drive-audit fails it now causes generic systemd Icinga alerts because we converted it to a service/... [23:13:30] urbanecm: ok thanks [23:13:36] np [23:13:42] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726100 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:14:25] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726100 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:15:45] ACKNOWLEDGEMENT - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service daniel_zahn https://phabricator.wikimedia.org/T292486 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:01] 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10Dzahn) [23:16:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10Dzahn) [23:18:25] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:726084| Bumping portals to master (T128546)]] (duration: 00m 59s) [23:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:33] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [23:19:24] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:726084| Bumping portals to master (T128546)]] (duration: 00m 59s) [23:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:52] Ok I'm done. I think that's all for this backport window. [23:20:58] (03PS7) 10Juan90264: Add WN as an alias to project namespace in Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725132 (https://phabricator.wikimedia.org/T291344) [23:21:10] (03PS12) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [23:21:24] Juan's apparently online [23:21:26] but not at IRC [23:22:40] anyway, good night/morning/afternoon/whatever :) [23:24:31] (03PS1) 10Brennen Bearnes: WIP: logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) [23:26:07] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10brennen) a:03brennen Thanks for the reproduction case. I thought this class of error was already hand... [23:30:38] !log resetting some emails used for abuse by a globally-banned user [23:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:51] (03PS1) 10Cwhite: logstash: move kubernetes_docker parsing towards the front of the pipeline [puppet] - 10https://gerrit.wikimedia.org/r/726129 (https://phabricator.wikimedia.org/T292099) [23:50:49] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Krinkle) [23:56:25] (03PS2) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) [23:59:38] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/31484/" [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn)