[00:00:45] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:01:45] (03PS13) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [00:02:56] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:18] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:31:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:16] (03PS14) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [00:49:32] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:50:28] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:50:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:32] (03PS15) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [00:54:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:04:18] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) p:05Unbreak!→03High So the wtp* servers were indeed out of memory, as reported at T296098. Ther... [01:58:12] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:54] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:59:18] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:02] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:56] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:24:08] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:03:14] (03PS1) 10Marostegui: db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/746640 [06:03:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 for a restart', diff saved to https://phabricator.wikimedia.org/P18117 and previous config saved to /var/cache/conftool/dbconfig/20211213-060343-marostegui.json [06:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] (03CR) 10Marostegui: [C: 03+2] db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/746640 (owner: 10Marostegui) [06:15:17] (03PS1) 10Marostegui: Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/745882 [06:16:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/745882 (owner: 10Marostegui) [06:16:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P18118 and previous config saved to /var/cache/conftool/dbconfig/20211213-061652-root.json [06:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:17:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T277354)', diff saved to https://phabricator.wikimedia.org/P18119 and previous config saved to /var/cache/conftool/dbconfig/20211213-061756-marostegui.json [06:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:01] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:19:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T277354)', diff saved to https://phabricator.wikimedia.org/P18120 and previous config saved to /var/cache/conftool/dbconfig/20211213-061916-marostegui.json [06:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:24:15] RECOVERY - snapshot of s3 in eqiad on alert1001 is OK: Last snapshot for s3 at eqiad (db1145.eqiad.wmnet:3313) taken on 2021-12-13 04:37:43 (1174 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P18121 and previous config saved to /var/cache/conftool/dbconfig/20211213-063156-root.json [06:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18122 and previous config saved to /var/cache/conftool/dbconfig/20211213-063421-marostegui.json [06:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:23] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:43:28] (03PS1) 10Marostegui: analytics_multiinstance.my.cnf.erb: Enable full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/746654 (https://phabricator.wikimedia.org/T287244) [06:45:20] (03CR) 10Marostegui: [C: 03+2] analytics_multiinstance.my.cnf.erb: Enable full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/746654 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [06:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P18123 and previous config saved to /var/cache/conftool/dbconfig/20211213-064700-root.json [06:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:49:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18124 and previous config saved to /var/cache/conftool/dbconfig/20211213-064926-marostegui.json [06:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:37] (03PS1) 10Marostegui: dbstore_multiinstance.my.cnf.erb: Add full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/746656 (https://phabricator.wikimedia.org/T287244) [06:51:09] !log run `apt-get clean` on aphlict1001 to free some space [06:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:36] (03CR) 10Marostegui: [C: 03+2] dbstore_multiinstance.my.cnf.erb: Add full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/746656 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [07:02:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P18125 and previous config saved to /var/cache/conftool/dbconfig/20211213-070204-root.json [07:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T277354)', diff saved to https://phabricator.wikimedia.org/P18126 and previous config saved to /var/cache/conftool/dbconfig/20211213-070430-marostegui.json [07:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:36] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:06:42] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [07:07:39] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:08:49] (03Merged) 10jenkins-bot: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:14:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 13 hosts with reason: Maintenance [07:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 13 hosts with reason: Maintenance [07:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:14:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18127 and previous config saved to /var/cache/conftool/dbconfig/20211213-071433-marostegui.json [07:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:38] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:15:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18128 and previous config saved to /var/cache/conftool/dbconfig/20211213-071539-marostegui.json [07:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:41] (03PS1) 10Ladsgroup: Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745883 (https://phabricator.wikimedia.org/T296380) [07:21:53] (03CR) 10Ladsgroup: [C: 03+2] Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745883 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [07:26:10] (03Merged) 10jenkins-bot: Change logic of pruneChange to allow deleting rows more flexibly [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745883 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [07:28:23] (03PS1) 10Ladsgroup: Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745884 (https://phabricator.wikimedia.org/T296380) [07:28:28] (03CR) 10Ladsgroup: [C: 03+2] Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745884 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [07:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P18129 and previous config saved to /var/cache/conftool/dbconfig/20211213-073044-marostegui.json [07:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:09] (03Merged) 10jenkins-bot: Fix the mistake in passing parameter [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/745884 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [07:35:44] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745883|Change logic of pruneChange to allow deleting rows more flexibly (T296380)]] (duration: 00m 57s) [07:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:49] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [07:36:23] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:39:59] !log start of clean up of flaggedtemplates on all flaggedrevs wikis: T296380 [07:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:00] (03CR) 10Muehlenhoff: [C: 03+1] "Fine with me" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [07:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P18130 and previous config saved to /var/cache/conftool/dbconfig/20211213-074549-marostegui.json [07:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:25] !log drain primary/secondary instances off ganeti2021 T296622 [07:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:29] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [07:52:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:02] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [07:53:42] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03High a:03MoritzMuehlenhoff This is complete [07:55:28] (03PS1) 10Marostegui: replica_set,schema_change: Minor language changes [software] - 10https://gerrit.wikimedia.org/r/746791 (https://phabricator.wikimedia.org/T288235) [07:59:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18131 and previous config saved to /var/cache/conftool/dbconfig/20211213-080054-marostegui.json [08:00:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:00:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:59] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18132 and previous config saved to /var/cache/conftool/dbconfig/20211213-080101-marostegui.json [08:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18133 and previous config saved to /var/cache/conftool/dbconfig/20211213-080207-marostegui.json [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:40] (03CR) 10Ladsgroup: [C: 03+1] replica_set,schema_change: Minor language changes [software] - 10https://gerrit.wikimedia.org/r/746791 (https://phabricator.wikimedia.org/T288235) (owner: 10Marostegui) [08:04:49] (03CR) 10Marostegui: [C: 03+2] replica_set,schema_change: Minor language changes [software] - 10https://gerrit.wikimedia.org/r/746791 (https://phabricator.wikimedia.org/T288235) (owner: 10Marostegui) [08:05:43] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The following units failed: swiftrepl-mw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:14] (03Merged) 10jenkins-bot: replica_set,schema_change: Minor language changes [software] - 10https://gerrit.wikimedia.org/r/746791 (https://phabricator.wikimedia.org/T288235) (owner: 10Marostegui) [08:12:28] (03CR) 10Ladsgroup: [C: 03+2] flaggedrevs: Fix idwiki's autoreview config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745995 (https://phabricator.wikimedia.org/T288404) (owner: 10Ladsgroup) [08:13:59] (03Merged) 10jenkins-bot: flaggedrevs: Fix idwiki's autoreview config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745995 (https://phabricator.wikimedia.org/T288404) (owner: 10Ladsgroup) [08:15:14] 10SRE, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Prtksxna) Thanks @Dzahn! I don't have access to see that ticket, but I'll keep this in mind. [08:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P18135 and previous config saved to /var/cache/conftool/dbconfig/20211213-081712-marostegui.json [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:14] !log ladsgroup@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:745995|flaggedrevs: Fix idwiki's autoreview config (T288404)]] (duration: 00m 56s) [08:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:19] T288404: Allow pending changes reviewer at idwiki to mark revisions as being "accepted" - https://phabricator.wikimedia.org/T288404 [08:20:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] (03CR) 10Elukey: [C: 03+2] admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:26:04] rebase troubl [08:26:11] (03CR) 10jerkins-bot: [V: 04-1] admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:27:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:31] (03PS4) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) [08:28:33] (03PS2) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [08:31:43] (03CR) 10jerkins-bot: [V: 04-1] admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:32:12] uff [08:32:16] (03CR) 10jerkins-bot: [V: 04-1] knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 (owner: 10Elukey) [08:32:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P18136 and previous config saved to /var/cache/conftool/dbconfig/20211213-083217-marostegui.json [08:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:59] (03PS5) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) [08:37:00] (03PS3) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [08:45:42] !log fixing centralauth grants of wikiuser on all of s7 T296537 [08:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537 [08:46:46] 10SRE, 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10MatthewVernon) Hi @EJoseph and @Gehel I'm on clinic duty this week, and saw that this task remains stalled waiting on input from you. Could you advise, please? [08:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T277354)', diff saved to https://phabricator.wikimedia.org/P18137 and previous config saved to /var/cache/conftool/dbconfig/20211213-084721-marostegui.json [08:47:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:47:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1104.eqiad.wmnet with reason: Maintenance [08:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:27] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T277354)', diff saved to https://phabricator.wikimedia.org/P18138 and previous config saved to /var/cache/conftool/dbconfig/20211213-084729-marostegui.json [08:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:30] !log removing grant of '%a%' on db1123 (s3) T296537 [08:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:08] (03CR) 10Elukey: [C: 03+2] admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:54:41] (03PS1) 10DCausse: flink-session-cluster: manage log4j2 options [deployment-charts] - 10https://gerrit.wikimedia.org/r/746793 (https://phabricator.wikimedia.org/T297468) [08:55:42] (03PS1) 10Giuseppe Lavagetto: Remove dead symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746794 (https://phabricator.wikimedia.org/T285232) [08:55:44] (03PS1) 10Giuseppe Lavagetto: Make symlinks relative so they work on a local checkout too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) [08:57:21] (03PS2) 10DCausse: flink-session-cluster: manage log4j2 options [deployment-charts] - 10https://gerrit.wikimedia.org/r/746793 (https://phabricator.wikimedia.org/T297468) [09:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T277354)', diff saved to https://phabricator.wikimedia.org/P18139 and previous config saved to /var/cache/conftool/dbconfig/20211213-090339-marostegui.json [09:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:44] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:04:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thanks Jaime" [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [09:04:41] (03CR) 10JMeybohm: [C: 03+1] flink-session-cluster: manage log4j2 options [deployment-charts] - 10https://gerrit.wikimedia.org/r/746793 (https://phabricator.wikimedia.org/T297468) (owner: 10DCausse) [09:06:44] jynus: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/745838 I'm ok to merge, what do you think ? [09:06:53] +1 [09:07:46] It would be nice to do a test restore to make sure it works and how much time it takes, I will ping you when I do one [09:09:13] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/32959/netflow2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [09:09:17] (03CR) 10Ayounsi: [C: 03+2] Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [09:12:23] jynus: ack, thanks! [09:12:33] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: backup 'daily' hierarchy, with weekly frequency, every Monday [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [09:13:35] (03CR) 10Elukey: [C: 03+1] Remove dead symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746794 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:15:14] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: manage log4j2 options [deployment-charts] - 10https://gerrit.wikimedia.org/r/746793 (https://phabricator.wikimedia.org/T297468) (owner: 10DCausse) [09:15:49] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One nitpick, but also an important filtering needed. Otherwise LGTM. Big thanks for tackling this." [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:18:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P18140 and previous config saved to /var/cache/conftool/dbconfig/20211213-091844-marostegui.json [09:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] (03CR) 10Elukey: "LGTM, checked the paths and found only two extra "/" that we may or may not want to remove." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:19:26] (03Merged) 10jenkins-bot: flink-session-cluster: manage log4j2 options [deployment-charts] - 10https://gerrit.wikimedia.org/r/746793 (https://phabricator.wikimedia.org/T297468) (owner: 10DCausse) [09:24:25] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] (03CR) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:25:13] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The following units failed: sfacctd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:32] (03PS10) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [09:25:34] (03PS10) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [09:25:36] (03PS11) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [09:25:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:27:42] (03PS11) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [09:27:43] I two-soft-space sinned [09:27:44] (03PS11) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [09:27:46] (03PS12) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [09:28:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:21] !log restarting blazegraph on wdqs1012 (jvm stuck for 6h) [09:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:10] jynus: ah yeah I forgot to mention, ATM we'll backup both hosts active and passive, we can switch to use only one though [09:30:24] perhaps that's better [09:31:07] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:10] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:31:24] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:27] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:01] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:05] (03PS4) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [09:33:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P18141 and previous config saved to /var/cache/conftool/dbconfig/20211213-093348-marostegui.json [09:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:16] godog ideally, we would transition to backup eqiad hosts into codfw and viceversa: https://wikitech.wikimedia.org/wiki/Bacula#Architecture_update_(2020) [09:36:22] (03CR) 10Urbanecm: [C: 03+1] "should work now, but I'm not sure if creating a DBlist is a good idea here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [09:36:26] I think it is ok as it is for now [09:36:38] jynus: ok! thank you [09:36:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [09:37:24] ^ should resolve soon [09:38:01] !log revoking DROP from centralauth grant of wikiadmin (T249683) [09:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [09:38:48] (03CR) 10Awight: [C: 03+2] "Merging just in case this branch gets used after all." [extensions/FileImporter] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/745373 (https://phabricator.wikimedia.org/T296605) (owner: 10Thiemo Kreuz (WMDE)) [09:39:07] (03Abandoned) 10WMDE-Fisch: Fix special page displaying unescaped user input [extensions/FileImporter] (wmf/1.38.0-wmf.10) - 10https://gerrit.wikimedia.org/r/745373 (https://phabricator.wikimedia.org/T296605) (owner: 10Thiemo Kreuz (WMDE)) [09:40:43] (03PS1) 10Giuseppe Lavagetto: wmflib: add service::get_services_for function [puppet] - 10https://gerrit.wikimedia.org/r/746801 [09:41:03] (03CR) 10Majavah: [C: 04-1] Clean up readers web team config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [09:41:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [09:42:00] jouncebot: nowandnext [09:42:00] No deployments scheduled for the next 2 hour(s) and 17 minute(s) [09:42:00] In 2 hour(s) and 17 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1200) [09:42:03] !log Stagging at mwdebug1001 [09:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:29] (03CR) 10Urbanecm: Clean up readers web team config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [09:43:04] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:15] !log pwnwiki: Create DB tables for GrowthExperiments [09:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:21] * urbanecm feels like the extension should be renamed [09:43:24] not an experiment anymore [09:44:21] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:49] !log Staging at mwdebug1001 ended [09:45:49] (03PS1) 10Muehlenhoff: oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746802 [09:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [09:48:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T277354)', diff saved to https://phabricator.wikimedia.org/P18142 and previous config saved to /var/cache/conftool/dbconfig/20211213-094853-marostegui.json [09:48:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:48:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:59] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:49:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T277354)', diff saved to https://phabricator.wikimedia.org/P18143 and previous config saved to /var/cache/conftool/dbconfig/20211213-094900-marostegui.json [09:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:51:41] (03PS5) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [09:51:43] (03PS1) 10Elukey: custom_deploy.d: add egress gateway settings to the ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/746804 (https://phabricator.wikimedia.org/T294414) [09:54:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:57:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P18144 and previous config saved to /var/cache/conftool/dbconfig/20211213-095728-marostegui.json [09:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:26] (03PS1) 10Ladsgroup: [WIP] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [09:58:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1114.eqiad.wmnet with reason: Maintenance [09:58:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1114.eqiad.wmnet with reason: Maintenance [09:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T277354)', diff saved to https://phabricator.wikimedia.org/P18145 and previous config saved to /var/cache/conftool/dbconfig/20211213-095851-marostegui.json [09:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:59:07] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2130 and db2074', diff saved to https://phabricator.wikimedia.org/P18146 and previous config saved to /var/cache/conftool/dbconfig/20211213-095949-marostegui.json [09:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:55] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:55] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: add egress gateway settings to the ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/746804 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [10:00:17] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) I have repooled db2130 and db2074 as they were not pooled back. [10:01:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1114', diff saved to https://phabricator.wikimedia.org/P18147 and previous config saved to /var/cache/conftool/dbconfig/20211213-100143-marostegui.json [10:01:46] 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) [10:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1116.eqiad.wmnet with reason: Maintenance [10:02:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1116.eqiad.wmnet with reason: Maintenance [10:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1126.eqiad.wmnet with reason: Maintenance [10:02:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1126.eqiad.wmnet with reason: Maintenance [10:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T277354)', diff saved to https://phabricator.wikimedia.org/P18148 and previous config saved to /var/cache/conftool/dbconfig/20211213-100238-marostegui.json [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] (03CR) 10Klausman: [C: 03+1] knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 (owner: 10Elukey) [10:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1126', diff saved to https://phabricator.wikimedia.org/P18149 and previous config saved to /var/cache/conftool/dbconfig/20211213-101013-marostegui.json [10:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:14:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:14:19] (03PS2) 10Muehlenhoff: oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746802 [10:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:14:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T277354)', diff saved to https://phabricator.wikimedia.org/P18150 and previous config saved to /var/cache/conftool/dbconfig/20211213-101427-marostegui.json [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:34] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:15:45] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10MatthewVernon) @thcipriani are you OK to approve this request, please? [or suggest someone else in releng who might be appropriate to do so?] [10:16:10] 10SRE, 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10Gehel) 05Stalled→03Declined Let's drop this at the moment, we'll see what we do when we have a concrete need for Emmanuel to have additional access [10:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1172', diff saved to https://phabricator.wikimedia.org/P18151 and previous config saved to /var/cache/conftool/dbconfig/20211213-101707-marostegui.json [10:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:02] win 8 [10:18:05] nope [10:18:25] 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) [10:18:29] (03PS1) 10Urbanecm: zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) [10:18:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) p:05Triage→03Medium [10:19:22] 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) p:05Triage→03Medium a:03Papaul [10:19:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [10:27:40] (03CR) 10Arturo Borrero Gonzalez: "thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [10:29:47] (03PS3) 10Muehlenhoff: oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746802 [10:31:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [10:33:15] (03CR) 10Arturo Borrero Gonzalez: "in general LGTM. Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [10:45:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32960/console" [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [10:45:50] (03CR) 10Elukey: [V: 03+1 C: 03+1] oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [10:47:02] (03PS1) 10Muehlenhoff: druid: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746834 [10:49:29] (03PS2) 10Muehlenhoff: druid: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746834 [10:50:14] (03CR) 10Elukey: "This one is incomplete, there are multiple daemons in our Druid setups (broker/historical/..) and afaics the new setting went only to the " [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [10:51:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 7 hosts with reason: Maintenance [10:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: Maintenance [10:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [10:53:21] !log mwscript namespaceDupes.php --wiki={mswiki,sqwiki,bclwiki,idwiki,siwiki,tlwiki,rowiki} --add-prefix=BROKEN --fix [10:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:00:05] (03PS2) 10David Caro: pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 [11:00:26] (03CR) 10David Caro: [C: 03+2] pcc: Autoformat with black+isort (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743380 (owner: 10David Caro) [11:00:44] (03CR) 10jerkins-bot: [V: 04-1] pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 (owner: 10David Caro) [11:01:51] (03CR) 10Giuseppe Lavagetto: Make symlinks relative so they work on a local checkout too (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [11:02:02] (03PS2) 10Giuseppe Lavagetto: Make symlinks relative so they work on a local checkout too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) [11:03:46] (03CR) 10Elukey: [C: 03+1] Make symlinks relative so they work on a local checkout too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [11:03:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:03:58] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:04:01] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:05:30] (03PS3) 10David Caro: pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 [11:06:13] (03CR) 10David Caro: [C: 03+2] pcc: Autoformat with black+isort [puppet] - 10https://gerrit.wikimedia.org/r/743380 (owner: 10David Caro) [11:10:07] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:55] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:12:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:12:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:51] (03PS3) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [11:16:59] (03PS1) 10Jbond: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 [11:17:19] (03CR) 10jerkins-bot: [V: 04-1] P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [11:18:05] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:22:04] (03PS1) 10Filippo Giunchedi: hieradata: add probes for non-critical catalog services [puppet] - 10https://gerrit.wikimedia.org/r/746838 (https://phabricator.wikimedia.org/T291946) [11:22:32] !log Run namespaceDupes.php --wiki=$WIKI --fix --add-prefix=BROKEN for wikis in P18152 [11:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:22:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T277354)', diff saved to https://phabricator.wikimedia.org/P18153 and previous config saved to /var/cache/conftool/dbconfig/20211213-112245-marostegui.json [11:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:49] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:24:19] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32961/console" [puppet] - 10https://gerrit.wikimedia.org/r/746838 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:25:37] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:25] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:31:32] (03CR) 10Btullis: "THe patch itself looks to be correct, but the property `formatMsgNoLookups` wasn't added to log4j until version 2.10 so I'm not sure that " [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [11:32:20] (03CR) 10Muehlenhoff: druid: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [11:32:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T277354)', diff saved to https://phabricator.wikimedia.org/P18154 and previous config saved to /var/cache/conftool/dbconfig/20211213-113249-marostegui.json [11:32:53] (03PS8) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [11:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:55] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:36:19] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:37:40] * Seddon checking in ready for morning backport [11:37:45] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:01] (03CR) 10Elukey: [V: 03+1 C: 03+1] oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [11:39:59] !log deployed patch for T297574 [11:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:25] (03CR) 10Muehlenhoff: oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [11:41:15] (03CR) 10Muehlenhoff: [C: 04-1] "Same issue, the bundled log4j is too old (2.8.2)" [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [11:42:28] (03CR) 10Elukey: druid: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [11:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P18155 and previous config saved to /var/cache/conftool/dbconfig/20211213-114754-marostegui.json [11:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:55] RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:33] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:57:32] (03PS1) 10Jbond: Cas 6.4.4: upgrade to cas 6.4.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746842 [11:57:34] (03PS1) 10Jbond: pmlinks: Add link to account creation process [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746843 (https://phabricator.wikimedia.org/T297524) [11:57:36] (03PS1) 10Jbond: 6.4.4: make new cas release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746844 [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1200). [12:00:05] seddon and samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:17] i can deploy today [12:00:26] Seddon: hey, around? [12:00:28] I'm here [12:00:32] hey samwilson [12:00:32] Here! [12:00:45] and hello Seddon [12:00:59] Hi de hi [12:01:05] Seddon: is only a wmf.12 backport enough? [12:01:08] or do you want to do _both_? [12:01:12] (both wmf.12 and wmf.9) [12:01:21] o/ [12:01:34] urbanecm: as far as I can see from the task the bug is only present on wmf.12 [12:01:35] urbanecm: wmf.12 should be enough. The changes this patch is fixing was only introduced in .12 [12:01:41] ah, great [12:02:20] Seddon: it reports a merge conflict for wmf.12 [12:02:26] can you try to upload the cherry-pick and fix conflict? [12:02:49] samwilson: i see you're a deployer yourself. Do you want to deploy it yourself? [12:02:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P18156 and previous config saved to /var/cache/conftool/dbconfig/20211213-120259-marostegui.json [12:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:53] urbanecm: am I? I thought that'd been removed! I've not done it for three years or more. Would rather you do it! [12:04:14] Seddon: actually, never mind -- i was trying to backport it to 1.37's wmf.12 [12:04:16] sorry [12:04:27] (03PS1) 10Urbanecm: Disable event logging for Quickview interactions [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746811 (https://phabricator.wikimedia.org/T297400) [12:04:32] (03CR) 10Urbanecm: [C: 03+2] Disable event logging for Quickview interactions [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746811 (https://phabricator.wikimedia.org/T297400) (owner: 10Urbanecm) [12:04:38] (03PS2) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [12:05:28] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on restbase[2025-2026].codfw.wmnet with reason: New cassandra hosts awaiting syncing [12:05:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase[2025-2026].codfw.wmnet with reason: New cassandra hosts awaiting syncing [12:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:49] samwilson: you appear to be one. Might be worth asking for removing the perms :)) [12:05:53] (03PS2) 10Urbanecm: Enable Disambiguator notifications on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745670 (https://phabricator.wikimedia.org/T297175) (owner: 10Samwilson) [12:05:58] (03CR) 10Urbanecm: [C: 03+2] Enable Disambiguator notifications on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745670 (https://phabricator.wikimedia.org/T297175) (owner: 10Samwilson) [12:06:07] yes, for sure, i'll do so. :) [12:06:20] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [12:06:43] (03Merged) 10jenkins-bot: Enable Disambiguator notifications on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745670 (https://phabricator.wikimedia.org/T297175) (owner: 10Samwilson) [12:07:17] samwilson: it's at mwdebug1001. can you test? [12:07:18] !log joining restbase2025-a to cassandra cluster [12:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:39] groovy, thanks, testing now [12:09:26] urbanecm: yep, looks great, go for it [12:09:32] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [12:10:15] samwilson: syncing [12:11:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 361214b50218ed90c0a2ec8194de46b053daa64e: Enable Disambiguator notifications on more wikis (T297175) (duration: 00m 56s) [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:35] T297175: Rollout disambiguation notifications to Group 0 and Group 1 - https://phabricator.wikimedia.org/T297175 [12:13:26] (03PS1) 10Ayounsi: Revert "Pmacct add sflow listener" [puppet] - 10https://gerrit.wikimedia.org/r/746812 [12:14:56] (03PS1) 10Hnowlan: conftool: add restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/746851 (https://phabricator.wikimedia.org/T297282) [12:15:15] (03CR) 10Ayounsi: [C: 03+2] Revert "Pmacct add sflow listener" [puppet] - 10https://gerrit.wikimedia.org/r/746812 (owner: 10Ayounsi) [12:15:46] urbanecm: thanks, all seems good [12:15:51] great [12:17:04] I might do some backports to wmf.12 and wmf.9 soonish (shouldn’t affect very much, preparation for a feature enablement on Wednesday) [12:18:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746843 (https://phabricator.wikimedia.org/T297524) (owner: 10Jbond) [12:18:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T277354)', diff saved to https://phabricator.wikimedia.org/P18157 and previous config saved to /var/cache/conftool/dbconfig/20211213-121803-marostegui.json [12:18:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:18:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:09] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:18:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:18:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T277354)', diff saved to https://phabricator.wikimedia.org/P18158 and previous config saved to /var/cache/conftool/dbconfig/20211213-121811-marostegui.json [12:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:24] Lucas_WMDE: still waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/746811 CI ftr [12:18:30] ack [12:18:38] Seddon: I'll just deploy that one (once it merges), as wmf.12 is not yet deployed anywhere (so hard to test) [12:19:44] Understood! [12:21:27] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:03] (03Merged) 10jenkins-bot: Disable event logging for Quickview interactions [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746811 (https://phabricator.wikimedia.org/T297400) (owner: 10Urbanecm) [12:24:43] Seddon: out of curiosity, when are you planning to deploy the enwiki autopatrol patch? [12:25:04] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) [12:25:28] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/MediaSearch/resources/components/SearchResults.vue: 7d0fa97dcfe4f3c0b5a5f98e722f54930c18bad0: Disable event logging for Quickview interactions (T297400) (duration: 00m 56s) [12:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:34] T297400: '.search_result_page_id' should be integer - https://phabricator.wikimedia.org/T297400 [12:25:40] Seddon: deployed, will be live together with wmf.12 [12:27:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T277354)', diff saved to https://phabricator.wikimedia.org/P18159 and previous config saved to /var/cache/conftool/dbconfig/20211213-123014-marostegui.json [12:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:30:22] (03PS1) 10Ayounsi: Make netboot.cfg generic for netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/746853 (https://phabricator.wikimedia.org/T297595) [12:31:58] (03CR) 10Kormat: [C: 03+1] conftool: add restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/746851 (https://phabricator.wikimedia.org/T297282) (owner: 10Hnowlan) [12:32:47] (03CR) 10Hnowlan: [C: 03+2] conftool: add restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/746851 (https://phabricator.wikimedia.org/T297282) (owner: 10Hnowlan) [12:33:18] (03PS1) 10David Caro: puppet_compiler: avoid duplication of puppet_compiler class def [puppet] - 10https://gerrit.wikimedia.org/r/746854 [12:34:45] (03Abandoned) 10David Caro: p:ceph::client::rbd_cloudcontrol: remove keyring generation [puppet] - 10https://gerrit.wikimedia.org/r/737869 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [12:36:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/746853 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [12:38:07] (03CR) 10Jbond: "LGTM< but think include would be better" [puppet] - 10https://gerrit.wikimedia.org/r/746854 (owner: 10David Caro) [12:39:00] (03CR) 10Ayounsi: [C: 03+2] Make netboot.cfg generic for netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/746853 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [12:39:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [12:43:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add probes for non-critical catalog services [puppet] - 10https://gerrit.wikimedia.org/r/746838 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P18160 and previous config saved to /var/cache/conftool/dbconfig/20211213-124519-marostegui.json [12:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:24] (03PS1) 10Lucas Werkmeister (WMDE): Remove most of mw.wikibase.lexeme Lua module [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746813 (https://phabricator.wikimedia.org/T297404) [12:48:28] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow2002.codfw.wmnet [12:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:59] (03PS2) 10Jbond: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 [12:55:07] (ProbeDown) firing: (2) Service kartotherian-ssl:443 has failed probes - https://alerts.wikimedia.org [12:55:16] (03PS3) 10Jbond: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 [12:55:46] the ProbeDown alert is me [12:56:03] I'll change it to warning so it doesn't spam here for now [12:57:26] (03CR) 10JMeybohm: P:pki::multirootca: add addtional vhost for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [12:58:17] (03CR) 10JMeybohm: P:pki::multirootca: add addtional vhost for k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [12:58:24] (03PS1) 10Filippo Giunchedi: prometheus: move probe alerts to severity 'warning' [puppet] - 10https://gerrit.wikimedia.org/r/746858 (https://phabricator.wikimedia.org/T291946) [12:59:23] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move probe alerts to severity 'warning' [puppet] - 10https://gerrit.wikimedia.org/r/746858 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P18161 and previous config saved to /var/cache/conftool/dbconfig/20211213-130024-marostegui.json [13:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32963/console" [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [13:03:54] (03PS4) 10JMeybohm: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [13:04:49] (03CR) 10JMeybohm: [C: 03+1] "Fixed nits. PCC looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [13:04:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow2002.codfw.wmnet [13:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:07] (ProbeDown) resolved: (2) Service kartotherian-ssl:443 has failed probes - https://alerts.wikimedia.org [13:05:46] msg="Error for HTTP request" err="Get https://10.2.2.13:443/osm-intl/6/23/24.png: x509: certificate is valid for kartotherian.svc.eqiad.wmnet, kartotherian.svc.codfw.wmnet, maps.wikimedia.org, not kartotherian.discovery.wmnet" [13:05:52] TIL [13:06:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:07:44] (03PS1) 10Jgiannelos: maps: Install openstack swift client CLI/lib [puppet] - 10https://gerrit.wikimedia.org/r/746860 (https://phabricator.wikimedia.org/T292700) [13:07:55] (03PS1) 10Filippo Giunchedi: hieradata: fix probe url for schema service [puppet] - 10https://gerrit.wikimedia.org/r/746861 (https://phabricator.wikimedia.org/T291946) [13:08:04] (03PS1) 10Ayounsi: Add netflow2002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/746862 (https://phabricator.wikimedia.org/T297595) [13:08:16] (03CR) 10jerkins-bot: [V: 04-1] hieradata: fix probe url for schema service [puppet] - 10https://gerrit.wikimedia.org/r/746861 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:09:01] (03CR) 10Ayounsi: [C: 03+2] Add netflow2002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/746862 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [13:09:20] (03PS2) 10Filippo Giunchedi: hieradata: fix probe url for schema service [puppet] - 10https://gerrit.wikimedia.org/r/746861 (https://phabricator.wikimedia.org/T291946) [13:11:04] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix probe url for schema service [puppet] - 10https://gerrit.wikimedia.org/r/746861 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:15:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T277354)', diff saved to https://phabricator.wikimedia.org/P18162 and previous config saved to /var/cache/conftool/dbconfig/20211213-131529-marostegui.json [13:15:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:15:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:35] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T277354)', diff saved to https://phabricator.wikimedia.org/P18163 and previous config saved to /var/cache/conftool/dbconfig/20211213-131538-marostegui.json [13:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:50] (03PS1) 10Ayounsi: Include new netflow VMs in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/746863 (https://phabricator.wikimedia.org/T297595) [13:23:53] (03CR) 10Ayounsi: [C: 03+2] Include new netflow VMs in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/746863 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [13:25:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T277354)', diff saved to https://phabricator.wikimedia.org/P18164 and previous config saved to /var/cache/conftool/dbconfig/20211213-132537-marostegui.json [13:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:43] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:28:51] (03PS1) 10Jelto: Rakefile: remove helm2 from Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) [13:31:36] !log installing wireshark security updates [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:52] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [13:32:14] (03CR) 10MSantos: [C: 03+1] maps: Install openstack swift client CLI/lib [puppet] - 10https://gerrit.wikimedia.org/r/746860 (https://phabricator.wikimedia.org/T292700) (owner: 10Jgiannelos) [13:37:07] (03PS2) 10David Caro: puppet_compiler: avoid duplication of puppet_compiler class def [puppet] - 10https://gerrit.wikimedia.org/r/746854 [13:37:09] (03CR) 10David Caro: puppet_compiler: avoid duplication of puppet_compiler class def (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746854 (owner: 10David Caro) [13:38:47] (03PS1) 10JMeybohm: Fix left overs from renaming [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/746865 [13:38:49] (03PS1) 10JMeybohm: Add dedicated endpoints for liveness and readiness probes [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/746866 [13:39:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/746854 (owner: 10David Caro) [13:40:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P18165 and previous config saved to /var/cache/conftool/dbconfig/20211213-134042-marostegui.json [13:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] (03PS5) 10Jbond: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 [13:42:30] (03CR) 10David Caro: [C: 03+2] puppet_compiler: avoid duplication of puppet_compiler class def [puppet] - 10https://gerrit.wikimedia.org/r/746854 (owner: 10David Caro) [13:42:47] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix left overs from renaming [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/746865 (owner: 10JMeybohm) [13:44:22] (03CR) 10David Caro: [C: 03+1] "LGTM, dumb question, why is this needed? (or maybe better not needed xd)" [puppet] - 10https://gerrit.wikimedia.org/r/745990 (owner: 10Jbond) [13:45:25] (03PS6) 10Jbond: P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 [13:45:34] (03PS1) 10Lucas Werkmeister (WMDE): Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746814 (https://phabricator.wikimedia.org/T297024) [13:45:36] (03PS1) 10JMeybohm: cfssl-issuer: Add liveness and readiness endpoints [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/746867 (https://phabricator.wikimedia.org/T294560) [13:45:40] (03PS4) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [13:46:10] (03PS1) 10Lucas Werkmeister (WMDE): Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746815 (https://phabricator.wikimedia.org/T297478) [13:46:18] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10Gehel) 05Open→03Resolved a:03Gehel [13:46:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32965/console" [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [13:46:28] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10Gehel) [13:47:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: add addtional vhost for k8s [puppet] - 10https://gerrit.wikimedia.org/r/746836 (owner: 10Jbond) [13:48:08] (03CR) 10Filippo Giunchedi: [C: 03+2] maps: Install openstack swift client CLI/lib [puppet] - 10https://gerrit.wikimedia.org/r/746860 (https://phabricator.wikimedia.org/T292700) (owner: 10Jgiannelos) [13:48:40] (03PS5) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [13:52:15] (03PS2) 10JMeybohm: fssl-issuer: Add liveness and readiness endpoints [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/746867 (https://phabricator.wikimedia.org/T294560) [13:52:25] (03PS1) 10Jgiannelos: kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 [13:54:11] (03PS1) 10David Caro: pcc: replace compiler1002 with pcc-worker1002 [puppet] - 10https://gerrit.wikimedia.org/r/746871 (https://phabricator.wikimedia.org/T297356) [13:54:56] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add dedicated endpoints for liveness and readiness probes [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/746866 (owner: 10JMeybohm) [13:55:14] (03CR) 10Jbond: [C: 03+1] pcc: replace compiler1002 with pcc-worker1002 [puppet] - 10https://gerrit.wikimedia.org/r/746871 (https://phabricator.wikimedia.org/T297356) (owner: 10David Caro) [13:55:23] (03CR) 10Jgiannelos: [C: 04-1] "Blocking until deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (owner: 10Jgiannelos) [13:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P18166 and previous config saved to /var/cache/conftool/dbconfig/20211213-135547-marostegui.json [13:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] (03PS3) 10JMeybohm: cfssl-issuer: Add liveness and readiness endpoints [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/746867 (https://phabricator.wikimedia.org/T294560) [13:56:23] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Add liveness and readiness endpoints [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/746867 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:57:33] (03PS2) 10Jgiannelos: kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) [13:58:44] (03PS2) 10JMeybohm: cfssl-issuer: Update to new cfss-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745923 (https://phabricator.wikimedia.org/T294560) [13:59:41] (03PS1) 10Ladsgroup: Remove dbtree [software] - 10https://gerrit.wikimedia.org/r/746872 (https://phabricator.wikimedia.org/T297605) [14:00:20] PROBLEM - IPMI Sensor Status on wdqs2003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:01:20] (03CR) 10Marostegui: [C: 03+2] Remove dbtree [software] - 10https://gerrit.wikimedia.org/r/746872 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:01:51] (03CR) 10MSantos: [C: 03+1] kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [14:02:01] (03Merged) 10jenkins-bot: Remove dbtree [software] - 10https://gerrit.wikimedia.org/r/746872 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:03:05] (03CR) 10JMeybohm: [C: 03+2] cfssl-issuer: Update to new cfss-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745923 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:03:47] (03CR) 10Jbond: [C: 04-1] "-1: this needs a bit more work from my side" [puppet] - 10https://gerrit.wikimedia.org/r/745990 (owner: 10Jbond) [14:04:19] (03CR) 10jerkins-bot: [V: 04-1] Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746815 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [14:05:12] (03PS3) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [14:05:48] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [14:06:41] (03Merged) 10jenkins-bot: cfssl-issuer: Update to new cfss-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745923 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:06:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [14:08:01] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [14:09:17] (03CR) 10Lucas Werkmeister (WMDE): "(random test failure, let’s not bother with a recheck and just see if the gate-and-submit works)" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746815 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [14:10:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T277354)', diff saved to https://phabricator.wikimedia.org/P18167 and previous config saved to /var/cache/conftool/dbconfig/20211213-141052-marostegui.json [14:10:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:10:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:58] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:37] (03PS1) 10JMeybohm: Allow cfssl-issuer to connect to pki.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/746874 (https://phabricator.wikimedia.org/T294560) [14:15:14] (03PS1) 10David Caro: wmcs_backups: Add localdisc of wikiwho vms to exclude list [puppet] - 10https://gerrit.wikimedia.org/r/746875 (https://phabricator.wikimedia.org/T297590) [14:15:40] (03CR) 10Volans: "Couple of random comments from someone with little context, feel free to ignore if offtopic." [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [14:18:39] (03PS4) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [14:20:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [14:20:55] (03CR) 10JMeybohm: [C: 04-1] Rakefile: remove helm2 from Rakefile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:21:32] (03CR) 10JMeybohm: [C: 03+2] Allow cfssl-issuer to connect to pki.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/746874 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:21:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:21:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T277354)', diff saved to https://phabricator.wikimedia.org/P18168 and previous config saved to /var/cache/conftool/dbconfig/20211213-142141-marostegui.json [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:46] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:25:05] (03Merged) 10jenkins-bot: Allow cfssl-issuer to connect to pki.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/746874 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:26:24] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T277354)', diff saved to https://phabricator.wikimedia.org/P18169 and previous config saved to /var/cache/conftool/dbconfig/20211213-143131-marostegui.json [14:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:37] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:34:33] (03PS1) 10Jbond: P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 [14:34:33] !log imported fastnetmon 1.1.7+deb11u1 for bullseye-wikimedia https://phabricator.wikimedia.org/T297595 [14:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] (03PS2) 10Jbond: P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 [14:39:02] (03CR) 10David Caro: [C: 03+1] P:puppet_compiler: strip worker host name when proxying (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746877 (owner: 10Jbond) [14:39:53] 10SRE, 10Infrastructure-Foundations, 10netops: Increase in prefix announcements from AS15169 - https://phabricator.wikimedia.org/T297609 (10MatthewVernon) [14:42:56] (03PS1) 10JMeybohm: cfssl-issuer: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/746878 (https://phabricator.wikimedia.org/T294560) [14:43:09] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/746878 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:44:28] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:45:02] (03PS3) 10Jbond: P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 [14:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P18170 and previous config saved to /var/cache/conftool/dbconfig/20211213-144636-marostegui.json [14:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:58] (03PS4) 10Jbond: P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 [14:47:31] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 (owner: 10Jbond) [14:48:57] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppet_compiler: strip worker host name when proxying [puppet] - 10https://gerrit.wikimedia.org/r/746877 (owner: 10Jbond) [14:53:24] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:53:37] (03PS1) 10Jbond: puppet_compile: improve rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/746879 [14:53:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compile: improve rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/746879 (owner: 10Jbond) [14:54:28] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=restbase2024.codfw.wmnet [14:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2024.codfw.wmnet [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] (03PS1) 10Elukey: helmfile.d: add the istio pod security policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/746880 (https://phabricator.wikimedia.org/T297612) [14:55:17] !log upload cas 6.4.4 deb package [14:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) @aborrero can we test this for now on only one server to see if it works before moving it to the other serve... [15:01:11] 10SRE, 10Infrastructure-Foundations, 10netops: Increase in prefix announcements from AS15169 - https://phabricator.wikimedia.org/T297609 (10jbond) i don't see either of those route servers in our config as such i think there is no action. but pinging @ayounsi @cmooney to confirm [15:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P18171 and previous config saved to /var/cache/conftool/dbconfig/20211213-150141-marostegui.json [15:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:30] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) [15:08:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) [15:08:23] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32968/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:12:55] (03CR) 10Jgiannelos: kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [15:14:03] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32970/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:14:09] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32971/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:14:44] (03CR) 10David Caro: [C: 03+2] pcc: replace compiler1002 with pcc-worker1002 [puppet] - 10https://gerrit.wikimedia.org/r/746871 (https://phabricator.wikimedia.org/T297356) (owner: 10David Caro) [15:15:33] !log joining restbase2025-b to cassandra cluster [15:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:48] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [15:16:32] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) 05Open→03Resolved [15:16:44] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, and 2 others: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) 05Open→03Resolved This is complete. [15:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T277354)', diff saved to https://phabricator.wikimedia.org/P18172 and previous config saved to /var/cache/conftool/dbconfig/20211213-151645-marostegui.json [15:16:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1112,1154].eqiad.wmnet with reason: Maintenance [15:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:16:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1112,1154].eqiad.wmnet with reason: Maintenance [15:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T277354)', diff saved to https://phabricator.wikimedia.org/P18173 and previous config saved to /var/cache/conftool/dbconfig/20211213-151657-marostegui.json [15:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:16] (03CR) 10Andrew Bogott: "Is this just because the wikiwho VMs are enormous, or are there other reasons why they can't/shouldn't be backed up?" [puppet] - 10https://gerrit.wikimedia.org/r/746875 (https://phabricator.wikimedia.org/T297590) (owner: 10David Caro) [15:19:08] PROBLEM - Host dns6001 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:00] jouncebot: nowandnext [15:21:00] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [15:21:00] In 1 hour(s) and 8 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1630) [15:21:02] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) 05Open→03In progress [15:21:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) [15:21:24] I’ll do some wmf.12 backports that should have no effect yet, in preparation for enabling a feature on Wednesday [15:21:31] (I’ll backport them to wmf.9 too later) [15:21:41] (03CR) 10David Caro: wmcs_backups: Add localdisc of wikiwho vms to exclude list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746875 (https://phabricator.wikimedia.org/T297590) (owner: 10David Caro) [15:21:42] gate-and-submit will take a while anyways but yell at me if you want me to stop :) [15:23:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Backporting – should have no effect yet, for two reasons: Lexeme Lua is not yet enabled in production, and wmf.12 is not the current train" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746813 (https://phabricator.wikimedia.org/T297404) (owner: 10Lucas Werkmeister (WMDE)) [15:23:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Backporting – should have no effect yet, for two reasons: Lexeme Lua is not yet enabled in production, and wmf.12 is not the current train" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746814 (https://phabricator.wikimedia.org/T297024) (owner: 10Lucas Werkmeister (WMDE)) [15:23:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Backporting – should have no effect yet, for two reasons: Lexeme Lua is not yet enabled in production, and wmf.12 is not the current train" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746815 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [15:23:56] RECOVERY - Host dns6001 is UP: PING OK - Packet loss = 0%, RTA = 86.21 ms [15:23:57] (03CR) 10Andrew Bogott: [C: 03+1] "Oh, it uses local storage and not ceph? That would do it!" [puppet] - 10https://gerrit.wikimedia.org/r/746875 (https://phabricator.wikimedia.org/T297590) (owner: 10David Caro) [15:24:29] 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Ade56facc) Around 13..14 (1 .. 2 p.m. UTC) I noticed long delays (6 .. 10 seconds) when trying to "Publish" or "Show preview" (of) an article. In one case the delay was so long t... [15:25:26] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) [15:25:35] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) 05In progress→03Resolved [15:26:59] (03CR) 10David Caro: [C: 03+2] wmcs_backups: Add localdisc of wikiwho vms to exclude list [puppet] - 10https://gerrit.wikimedia.org/r/746875 (https://phabricator.wikimedia.org/T297590) (owner: 10David Caro) [15:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T277354)', diff saved to https://phabricator.wikimedia.org/P18174 and previous config saved to /var/cache/conftool/dbconfig/20211213-152859-marostegui.json [15:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:04] (03PS1) 10Filippo Giunchedi: prometheus: pin discovery probes to their site [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) [15:29:05] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:30:35] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32972/console" [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:32:47] (03PS5) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [15:33:22] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [15:36:04] (03PS6) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [15:39:32] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [15:41:03] (03PS6) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [15:43:39] (03PS7) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [15:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P18175 and previous config saved to /var/cache/conftool/dbconfig/20211213-154404-marostegui.json [15:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:03] (03Merged) 10jenkins-bot: Remove most of mw.wikibase.lexeme Lua module [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746813 (https://phabricator.wikimedia.org/T297404) (owner: 10Lucas Werkmeister (WMDE)) [15:46:05] (03Merged) 10jenkins-bot: Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746814 (https://phabricator.wikimedia.org/T297024) (owner: 10Lucas Werkmeister (WMDE)) [15:46:08] (03Merged) 10jenkins-bot: Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746815 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [15:46:18] (03CR) 10Ahmon Dancy: [C: 03+1] Make symlinks relative so they work on a local checkout too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746795 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [15:46:23] alright, deploying those three WikibaseLexeme changes [15:47:14] (03PS7) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [15:47:19] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [15:48:09] (03PS2) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) [15:49:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikibaseLexeme: Backport: [[gerrit:746813|Remove most of mw.wikibase.lexeme Lua module (T297404)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 58s) [15:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:42] T297404: Remove most of mw.wikibase.lexeme module (remove getLemmas, getLanguage, getLexicalCategory; keep splitLexemeId) - https://phabricator.wikimedia.org/T297404 [15:50:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikibaseLexeme: Backport: [[gerrit:746814|Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() (T297024)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 57s) [15:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:31] T297024: Add methods to get lemma, representation, gloss by language code - https://phabricator.wikimedia.org/T297024 [15:51:55] 10SRE, 10Infrastructure-Foundations, 10netops: Increase in prefix announcements from AS15169 - https://phabricator.wikimedia.org/T297609 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, we're already at 250000 for those. We usually set a high limit from the get go for route servers. [15:52:53] (03PS1) 10Jgiannelos: tegola-vector-tiles: Increase pregeneration parallelism on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/746887 [15:53:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikibaseLexeme: Backport: [[gerrit:746815|Add form:hasGrammaticalFeature() method (T297478)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 57s) [15:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:18] T297478: Add form:hasGrammaticalFeature( itemId ) Lua method - https://phabricator.wikimedia.org/T297478 [15:53:30] alright, I’m done backporting for now [15:54:18] (03PS1) 10Lucas Werkmeister (WMDE): Remove most of mw.wikibase.lexeme Lua module [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746906 (https://phabricator.wikimedia.org/T297404) [15:54:51] (03PS1) 10Lucas Werkmeister (WMDE): Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746907 (https://phabricator.wikimedia.org/T297024) [15:55:09] (03PS1) 10Lucas Werkmeister (WMDE): Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746908 (https://phabricator.wikimedia.org/T297478) [15:55:24] and by “done backporting”, I mean I created ^ these but I won’t deploy them immediately ;) [15:56:03] might do them between 17:00 and 18:00 UTC [15:56:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] (03CR) 10Jgiannelos: "After monitoring tile pregeneration for the past week it looks like the errors caused by lack of PG connections available was only happeni" [deployment-charts] - 10https://gerrit.wikimedia.org/r/746887 (owner: 10Jgiannelos) [15:57:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:51] (03PS1) 10David Caro: set_maintenance: Use non-deprecated IcingaHosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/746889 [15:58:40] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [15:59:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P18176 and previous config saved to /var/cache/conftool/dbconfig/20211213-155909-marostegui.json [15:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] (03PS1) 10Vgutierrez: prometheus::ops: Collect envoy metrics for profile::cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/746890 (https://phabricator.wikimedia.org/T271421) [16:01:24] (03CR) 10Ladsgroup: "So PCC is failing (https://puppet-compiler.wmflabs.org/pcc-worker1001/1133/db1101.eqiad.wmnet/change.db1101.eqiad.wmnet.err) because passw" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [16:02:00] (03PS1) 10Vgutierrez: site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421) [16:03:26] (03PS2) 10Vgutierrez: site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421) [16:06:19] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Increase pregeneration parallelism on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/746887 (owner: 10Jgiannelos) [16:06:40] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:22] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10ayounsi) Could we trunk the new vlan instead of using a 2nd physical port? [16:10:12] (03Merged) 10jenkins-bot: tegola-vector-tiles: Increase pregeneration parallelism on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/746887 (owner: 10Jgiannelos) [16:14:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T277354)', diff saved to https://phabricator.wikimedia.org/P18177 and previous config saved to /var/cache/conftool/dbconfig/20211213-161414-marostegui.json [16:14:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:14:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [16:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) >>! In T297588#7566526, @Papaul wrote: > @aborrero can we test this for now on only one server to see if i... [16:20:08] (03PS1) 10Ladsgroup: DeprecationHelper: avoid closures [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746909 (https://phabricator.wikimedia.org/T297236) [16:21:17] 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10mseckington) [16:24:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) [16:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1630). [16:30:22] (03CR) 10BBlack: [C: 03+1] varnish: add second wikimedia enterprise elastic IP [puppet] - 10https://gerrit.wikimedia.org/r/745560 (https://phabricator.wikimedia.org/T294798) (owner: 10Hnowlan) [16:41:41] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32975/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:41:43] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32973/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:41:53] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32974/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:42:03] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32976/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:42:05] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32976/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:42:07] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32976/console" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:42:26] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) p:05High→03Unbreak! > In any case, it's not happening now, so it's not a UBN. We're currently on... [16:43:33] (03PS1) 10David Caro: pcc: replace compiler1001 with pcc-worker1003 [puppet] - 10https://gerrit.wikimedia.org/r/746893 (https://phabricator.wikimedia.org/T297356) [16:44:32] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) This can be similar to {T297236} I suggest we deploy the fix and roll with wmf.12 to see if it stil... [16:46:45] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) This looks good: https://logstash.wikimedia.org/goto/d85f220054b5e2145a56b5fe99c4e653 possibly some... [16:48:45] (03CR) 10Andrew Bogott: [C: 03+1] set_maintenance: Use non-deprecated IcingaHosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/746889 (owner: 10David Caro) [16:54:52] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10dancy) >>! In T297517#7566865, @Ladsgroup wrote: > This can be similar to {T297236} I suggest we deploy the fi... [16:55:22] (03PS1) 10Hnowlan: maps: add stub values for tilerator swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T235299) [16:55:45] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:20] (03PS1) 10Hnowlan: maps: write tegola credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) [17:03:24] (03PS1) 10Accraze: ml-services: update revscoring-editquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/746898 (https://phabricator.wikimedia.org/T293331) [17:07:40] (03PS1) 10MSantos: tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 [17:08:27] (03CR) 10David Caro: [C: 03+2] set_maintenance: Use non-deprecated IcingaHosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/746889 (owner: 10David Caro) [17:09:33] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [17:10:59] (03CR) 10Jgiannelos: [C: 03+1] tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 (owner: 10MSantos) [17:11:17] (03Merged) 10jenkins-bot: set_maintenance: Use non-deprecated IcingaHosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/746889 (owner: 10David Caro) [17:11:53] 10SRE, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) p:05Triage→03Medium [17:12:20] (03CR) 10Jgiannelos: tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 (owner: 10MSantos) [17:12:44] (03CR) 10Jgiannelos: [C: 04-1] "This change needs a chart version bump to be applied" [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 (owner: 10MSantos) [17:12:54] (03PS1) 10Jbond: O:pki::multirootca: make ferm rule ensurable [puppet] - 10https://gerrit.wikimedia.org/r/746900 [17:14:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32978/console" [puppet] - 10https://gerrit.wikimedia.org/r/746900 (owner: 10Jbond) [17:16:29] (03PS1) 10Accraze: ml-services: update revscoring-draftquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/746901 (https://phabricator.wikimedia.org/T293331) [17:21:00] jouncebot: nowandnext [17:21:01] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [17:21:01] In 0 hour(s) and 38 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1800) [17:21:20] alright, I’ll do the wmf.9 WikibaseLexeme backports [17:21:28] (should still be no-ops because Lexeme Lua access isn’t enabled yet) [17:21:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove most of mw.wikibase.lexeme Lua module [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746906 (https://phabricator.wikimedia.org/T297404) (owner: 10Lucas Werkmeister (WMDE)) [17:21:46] PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 537 MB (3% inode=89%): /tmp 537 MB (3% inode=89%): /var/tmp 537 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [17:21:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746907 (https://phabricator.wikimedia.org/T297024) (owner: 10Lucas Werkmeister (WMDE)) [17:21:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746908 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [17:23:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:pki::multirootca: make ferm rule ensurable [puppet] - 10https://gerrit.wikimedia.org/r/746900 (owner: 10Jbond) [17:27:57] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring-editquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/746898 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze) [17:28:27] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring-draftquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/746901 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze) [17:30:33] (03CR) 10Jgiannelos: "nit: Tilerator is not connected to swift. We need swift on maps master nodes for debugging purposes and maintenance of tiles generated by " [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [17:31:31] (03PS2) 10Hnowlan: maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T235299) [17:31:56] (03CR) 10Hnowlan: maps: add stub values for tegola swift credentials (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [17:34:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:09] (03CR) 10Jgiannelos: [C: 04-1] maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [17:37:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:37:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:50] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): conncet 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) [17:39:22] (03Merged) 10jenkins-bot: Remove most of mw.wikibase.lexeme Lua module [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746906 (https://phabricator.wikimedia.org/T297404) (owner: 10Lucas Werkmeister (WMDE)) [17:39:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:56] (03CR) 10Jgiannelos: [C: 04-1] maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [17:40:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:00] (03Merged) 10jenkins-bot: Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746907 (https://phabricator.wikimedia.org/T297024) (owner: 10Lucas Werkmeister (WMDE)) [17:41:02] (03Merged) 10jenkins-bot: Add form:hasGrammaticalFeature() method [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746908 (https://phabricator.wikimedia.org/T297478) (owner: 10Lucas Werkmeister (WMDE)) [17:41:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2021.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [17:41:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2021.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [17:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:43] alright, I’ll deploy those WikibaseLexeme wmf.9 backports (should all be no-ops) [17:43:49] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [17:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:49] (03PS1) 10Jbond: fail over idp to codfw [dns] - 10https://gerrit.wikimedia.org/r/746904 [17:45:16] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:45:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikibaseLexeme: Backport: [[gerrit:746906|Remove most of mw.wikibase.lexeme Lua module (T297404)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 58s) [17:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:39] T297404: Remove most of mw.wikibase.lexeme module (remove getLemmas, getLanguage, getLexicalCategory; keep splitLexemeId) - https://phabricator.wikimedia.org/T297404 [17:46:08] (03CR) 10Jbond: [C: 03+2] fail over idp to codfw [dns] - 10https://gerrit.wikimedia.org/r/746904 (owner: 10Jbond) [17:47:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:22] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikibaseLexeme: Backport: [[gerrit:746907|Add lexeme:getLemma(), sense:getGloss(), form:getRepresentation() (T297024)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 57s) [17:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:36] T297024: Add methods to get lemma, representation, gloss by language code - https://phabricator.wikimedia.org/T297024 [17:47:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikibaseLexeme: Backport: [[gerrit:746908|Add form:hasGrammaticalFeature() method (T297478)]] (no-op because Lexeme Lua is not yet enabled in prod) (duration: 00m 57s) [17:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:25] T297478: Add form:hasGrammaticalFeature( itemId ) Lua method - https://phabricator.wikimedia.org/T297478 [17:49:52] alright, I’m done backporting again [17:51:05] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [17:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:54] (03PS1) 10Cmjohnson: Add new lvs servers to site.pp role (insetup) [puppet] - 10https://gerrit.wikimedia.org/r/746927 (https://phabricator.wikimedia.org/T295804) [17:52:43] (03PS2) 10MSantos: tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 [17:53:10] (03CR) 10Cmjohnson: [C: 03+2] Add new lvs servers to site.pp role (insetup) [puppet] - 10https://gerrit.wikimedia.org/r/746927 (https://phabricator.wikimedia.org/T295804) (owner: 10Cmjohnson) [17:54:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:27] (03CR) 10Ladsgroup: [C: 03+2] DeprecationHelper: avoid closures [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746909 (https://phabricator.wikimedia.org/T297236) (owner: 10Ladsgroup) [17:58:09] (03PS2) 10Hnowlan: maps: write tegola credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) [17:58:12] (03CR) 10Jgiannelos: [C: 03+2] tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 (owner: 10MSantos) [17:59:27] (03PS2) 10Jbond: Cas 6.4.4: upgrade to cas 6.4.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746842 [17:59:51] (03CR) 10Hnowlan: maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [17:59:57] (03PS2) 10Jbond: 6.4.4: make new cas release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746844 [18:00:04] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1800). [18:00:05] (03PS2) 10Jbond: pmlinks: Add link to account creation process [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746843 (https://phabricator.wikimedia.org/T297524) [18:00:10] (03PS3) 10Jbond: 6.4.4: make new cas release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746844 [18:00:23] (03CR) 10Jbond: [C: 03+2] Cas 6.4.4: upgrade to cas 6.4.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746842 (owner: 10Jbond) [18:00:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] Cas 6.4.4: upgrade to cas 6.4.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746842 (owner: 10Jbond) [18:00:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] pmlinks: Add link to account creation process [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746843 (https://phabricator.wikimedia.org/T297524) (owner: 10Jbond) [18:00:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.4.4: make new cas release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/746844 (owner: 10Jbond) [18:00:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:21] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) 05Open→03Resolved [18:02:53] (03PS1) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access on first four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) [18:03:03] (03Merged) 10jenkins-bot: tegola: increase MVT buffersize in country_label layer [deployment-charts] - 10https://gerrit.wikimedia.org/r/746899 (owner: 10MSantos) [18:04:06] (03PS1) 10Jgiannelos: maps: Install s3 client cli/lib [puppet] - 10https://gerrit.wikimedia.org/r/746929 [18:05:06] 10SRE, 10Wikimedia-Mailing-lists: Chapter-ThOrg-Applications mailing list request - https://phabricator.wikimedia.org/T297622 (10Legoktm) @DNdubane_WMF to clarify, you'd like this to be a private mailing list like how usergroup-applications@ is configured? [18:05:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox, 10Patch-For-Review: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye [18:05:40] (03CR) 10Jgiannelos: maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [18:06:54] (03PS2) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access on first four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) [18:06:57] (03PS1) 10Lucas Werkmeister (WMDE): Set wgLexemeEnableDataTransclusion to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746930 (https://phabricator.wikimedia.org/T294159) [18:06:59] (03CR) 10Jgiannelos: [C: 03+1] maps: write tegola credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [18:07:57] (03CR) 10Jgiannelos: [C: 03+1] maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [18:09:00] !log upload cas_6.4.4-1+wmf10u2 [18:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:34] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:51] (03PS1) 10Bartosz Dziewoński: Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746912 (https://phabricator.wikimedia.org/T296269) [18:12:01] (03PS1) 10Bartosz Dziewoński: Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746913 (https://phabricator.wikimedia.org/T296269) [18:12:34] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] (03PS1) 10Bartosz Dziewoński: Re-apply "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746914 (https://phabricator.wikimedia.org/T296269) [18:13:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [18:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs... [18:13:59] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye [18:15:59] (03CR) 10Hnowlan: [C: 03+2] varnish: add second wikimedia enterprise elastic IP [puppet] - 10https://gerrit.wikimedia.org/r/745560 (https://phabricator.wikimedia.org/T294798) (owner: 10Hnowlan) [18:18:35] (03Merged) 10jenkins-bot: DeprecationHelper: avoid closures [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746909 (https://phabricator.wikimedia.org/T297236) (owner: 10Ladsgroup) [18:20:15] (03CR) 10Michael Große: [C: 03+1] Enable Lexeme Lua access on first four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [18:20:37] (03PS1) 10DLynch: Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) [18:21:26] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/: Backport: [[gerrit:746909|DeprecationHelper: avoid closures (T297236)]] (duration: 01m 02s) [18:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:32] T297236: timeouts and memory limits on translatewiki.net - https://phabricator.wikimedia.org/T297236 [18:22:26] dancy: the patch is synced, do you want to try rolling the train? [18:22:34] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, and 2 others: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10hnowlan) This has been merged and will come into effect over the next 25 minutes or so. [18:22:40] Can do. [18:22:48] Thanks [18:23:02] (03PS1) 10Ahmon Dancy: all wikis to 1.38.0-wmf.12 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746932 [18:23:04] (03CR) 10Ahmon Dancy: [C: 03+2] all wikis to 1.38.0-wmf.12 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746932 (owner: 10Ahmon Dancy) [18:23:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:01] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.12 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746932 (owner: 10Ahmon Dancy) [18:24:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:25] 10SRE, 10Wikimedia-Mailing-lists: Chapter-ThOrg-Applications mailing list request - https://phabricator.wikimedia.org/T297622 (10DNdubane_WMF) @Legoktm yes please. It must be a private mailing list [18:24:59] !log joining restbase2025-c to cassandra cluster [18:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:18] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.12 refs T293954 [18:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:23] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [18:25:27] Amir1: done [18:25:56] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.12 T293953 [18:25:57] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10jbond) I have added a link let me know if you want any changes making [18:26:00] fingers crossed [18:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:01] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [18:28:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS bullseye [18:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1018.eqiad.wmnet with OS bullseye [18:28:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS bullseye [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1019.eqiad.wmnet with OS bullseye [18:29:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS bullseye [18:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye [18:30:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:56] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Chapter-ThOrg-Applications mailing list request - https://phabricator.wikimedia.org/T297622 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done. Login with your mailman3 account and add people in https://lists.wikimedia.org/postorius/lists/chapter-thor... [18:31:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:27] jouncebot: next [18:31:27] In 0 hour(s) and 28 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1900) [18:32:52] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) The underlying memory leak has not been fixed but so far it looks good with the changes being backp... [18:32:57] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) (owner: 10DLynch) [18:34:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [18:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs... [18:35:00] (03PS3) 10Hnowlan: maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T292700) [18:45:45] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1020.eqiad.wmnet with OS bullseye [18:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye executed with errors: - lvs... [18:47:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:33] (03CR) 10DLynch: "Actually, wait, this won't work on beta because it's defaulting the new topic tool to `available` -- got to update that so it's not always" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) (owner: 10DLynch) [18:49:02] (03PS2) 10DLynch: Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) [18:50:41] (03PS2) 10Jdlrobson: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) [18:51:10] (03PS4) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [18:51:28] (03PS5) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [18:51:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bullseye [18:51:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS bullseye [18:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1018.eqiad.wmnet with OS bullseye completed: - lvs1018 (**PAS... [18:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye [18:51:44] (03CR) 10Jdlrobson: Clean up readers web team config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [18:52:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS bullseye [18:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye [18:53:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS bullseye [18:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1019.eqiad.wmnet with OS bullseye completed: - lvs1019 (**PAS... [18:55:41] (03PS1) 10Jbond: O:pki::root: add new intermidiate CA cloud_wmnet_ca [puppet] - 10https://gerrit.wikimedia.org/r/746935 [18:56:28] (03PS2) 10Jdlrobson: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) [18:56:51] (03CR) 10Jbond: [C: 03+2] O:pki::root: add new intermidiate CA cloud_wmnet_ca [puppet] - 10https://gerrit.wikimedia.org/r/746935 (owner: 10Jbond) [18:57:47] (03CR) 10Jforrester: "recheck" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249 (owner: 10Kormat) [18:59:01] (03CR) 10jerkins-bot: [V: 04-1] wmfdb/section: Add class for handling of sections. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249 (owner: 10Kormat) [19:00:04] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T1900). Please do the needful. [19:00:05] cjming, nemo-yiannis, nn1l2, and MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:10] hi [19:00:11] * urbanecm waves [19:00:15] hey [19:00:20] 👋 [19:00:27] hi [19:01:06] i can deploy today [19:01:39] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Chapter-ThOrg-Applications mailing list request - https://phabricator.wikimedia.org/T297622 (10DNdubane_WMF) Thank you so much for the express service! [19:01:41] (03CR) 10Urbanecm: [C: 03+2] Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746912 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:01:47] (03CR) 10Urbanecm: [C: 03+2] Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746913 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:02:13] (03PS4) 10Urbanecm: Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [19:02:18] (03CR) 10Urbanecm: [C: 03+2] Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [19:05:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:09] taking long time to merge [19:08:55] urbanecm: Yeah, someone's written a huge stack of termbox patches which is taking up almost all of CI. :-( [19:09:19] i thought gate-and-submit has higher priority? [19:09:24] anyway, started running now [19:09:30] It does, but we don't cancel running jobs. [19:09:46] So if 50 jobs that each take 20 minutes are already started… [19:09:53] (03Merged) 10jenkins-bot: Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [19:09:54] i see [19:10:15] cjming: your patch is at mwdebug1001 [19:10:28] looking [19:10:59] (03PS3) 10Urbanecm: kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:11:03] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bullseye [19:11:03] (03CR) 10Urbanecm: [C: 03+2] kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs... [19:11:11] nemo-yiannis: you're next; will ping you once patch is ready for testing [19:11:20] sounds good [19:11:28] cc mbsantos [19:12:01] (03Merged) 10jenkins-bot: kartographer: Enable tegola on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746869 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:12:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:04] sorry about those termbox changes, unfortunate timing :( but they shouldn’t take as long as Wikibase changes, at least [19:13:06] (03PS1) 10Jbond: O:pki::multirootca: Add new cloud_wmnet_ca [puppet] - 10https://gerrit.wikimedia.org/r/746939 [19:13:12] urbanecm: gtg [19:13:17] thanks, syncing [19:14:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1020.eqiad.wmnet with OS bullseye [19:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye executed with errors: - lvs... [19:15:51] (03CR) 10Jbond: [C: 03+2] O:pki::multirootca: Add new cloud_wmnet_ca [puppet] - 10https://gerrit.wikimedia.org/r/746939 (owner: 10Jbond) [19:16:08] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:21] (03PS6) 10Urbanecm: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [19:16:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bb9894271fb4faff4d95ab3b90398143bc0bfa59: Fix format of VectorWebABTestEnrollment (T295972) (duration: 00m 57s) [19:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:02] T295972: Deploy sticky header to office wiki and test wiki - https://phabricator.wikimedia.org/T295972 [19:17:03] cjming: should be live [19:17:13] nemo-yiannis: your patch is at mwdebug1001 [19:17:15] great - thanks [19:17:15] can you have a look? [19:18:51] (03Abandoned) 10Ladsgroup: [WIP] Re-architecture auto_schema [software] - 10https://gerrit.wikimedia.org/r/744042 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [19:18:56] nn1l2: i prefer having more time to review your patch than what i have during a busy B&C like this. Can we leave that for a future, less-crowded window please? [19:18:57] urbanecm: looks ok [19:19:06] (having a +1 would be helpful here, too :)) [19:19:11] ok [19:19:13] thanks nemo-yiannis, syncing [19:19:19] is tomorrow good? [19:19:22] thanks urbanecm [19:19:33] nn1l2: hopefully! :) [19:20:14] MatmaRex: hello, just to double check, are the config patches depending on the backports? [19:20:21] or can we do them before? [19:20:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fd325c55428b54a3c3e6a16bdd1b895d038dbecb: kartographer: Enable tegola on ruwiki (T280767) (duration: 00m 57s) [19:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:59] urbanecm: either order is fine [19:21:00] T280767: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 [19:21:06] thanks MatmaRex [19:21:27] (03PS2) 10Urbanecm: Re-apply "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746914 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:21:44] (03CR) 10Urbanecm: [C: 03+2] Re-apply "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746914 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:22:33] (03Merged) 10jenkins-bot: Re-apply "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746914 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:23:00] MatmaRex: your patch is at mwdebug1001 now, can you test? [19:23:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:54] looking [19:24:06] (03PS3) 10Urbanecm: Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) (owner: 10DLynch) [19:24:12] (03CR) 10Urbanecm: [C: 03+2] Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) (owner: 10DLynch) [19:24:21] (03CR) 10Jbond: [C: 03+1] rabbitmq: Add support for listening on TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745199 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [19:24:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:57] (03Merged) 10jenkins-bot: Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/746912 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:25:00] (03Merged) 10jenkins-bot: Check VisualEditorDisableForAnons in getEditPageEditor() [extensions/VisualEditor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746913 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:25:45] urbanecm: hmm actually, i think it actually depends on the backports [19:25:54] (03CR) 10Dzahn: [C: 03+2] "also checked with Rob and Papaul. going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [19:25:57] (03Merged) 10jenkins-bot: Enable A/B test for discussiontools new topic tool on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746931 (https://phabricator.wikimedia.org/T291307) (owner: 10DLynch) [19:26:02] which conveniently just merged [19:26:32] can we do them first? [19:27:10] MatmaRex: sure [19:27:26] sorry for the delay, got disconnected for a while [19:27:49] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [19:28:15] MatmaRex: both are at mwdebug1001, together with the config patch [19:28:16] can you test now? [19:28:31] yeah, looking [19:30:14] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Dzahn) [19:31:15] urbanecm: ugh, i think the config patch is actually wrong [19:31:29] MatmaRex: should i revert, or is a fix easy? [19:31:38] or i'm seeing the wrong version of the code [19:31:57] let me double check everything's in order first [19:33:04] it's definitely at mwdebug1001 [19:33:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:33:13] !log mdipietro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1012.eqiad.wmnet [19:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:21] urbanecm: wmf.12? [19:34:30] apparently we rolled it forward [19:34:37] news for me [19:34:40] but i checked both [19:35:15] line 362 of includes/VisualEditorHooks.php is definitely `if ( $config->get( 'VisualEditorDisableForAnons' ) && !$user->isRegistered() ) {` [19:35:30] (at mwdebug1001) [19:35:31] okay. there has to be some bug there, i'll need to investigate [19:35:37] okay [19:35:41] do i revert both config and backports? [19:35:43] or just one of them? [19:35:54] just config please, backports should be fine [19:35:57] okay [19:36:49] (fingers were crossed and the train rolled forward an hour ago https://sal.toolforge.org/log/EkIKtX0B1jz_IcWut0J1) [19:36:58] (03PS1) 10Urbanecm: Revert "Re-apply "VE on zh.wiki: Enable single-edit-tab mode"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746942 (https://phabricator.wikimedia.org/T296269) [19:37:00] (03CR) 10Urbanecm: [C: 03+2] Revert "Re-apply "VE on zh.wiki: Enable single-edit-tab mode"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746942 (https://phabricator.wikimedia.org/T296269) (owner: 10Urbanecm) [19:37:29] rolling the backports now [19:37:42] (03Merged) 10jenkins-bot: Revert "Re-apply "VE on zh.wiki: Enable single-edit-tab mode"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746942 (https://phabricator.wikimedia.org/T296269) (owner: 10Urbanecm) [19:37:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:00] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/VisualEditor/includes/VisualEditorHooks.php: 8144ab6577bdc93b4d81d5f8541437b746752610: Check VisualEditorDisableForAnons in getEditPageEditor() (T296269) (duration: 00m 56s) [19:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:05] wmf.9 live [19:39:05] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [19:39:23] doing wmf.12 now [19:40:15] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/VisualEditor/includes/VisualEditorHooks.php: fa01addb8ae29a97f6e141ec37a5e1204f0d7810: Check VisualEditorDisableForAnons in getEditPageEditor() (T296269) (duration: 00m 56s) [19:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:21] and wmf.12 live too [19:40:34] MatmaRex: backports and the beta config done, reverted the zhwiki config [19:40:44] thanks. and sorry [19:40:46] (beta will apply within 30 minutes at most) [19:40:52] no problem, it's what the testing is for 🙂 [19:41:38] i think everything's done now [19:42:00] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10Aklapper) Adding @komla as some data needs to be filled in above (user account registered on wikitech.wikimedia.org; separate SSH key; etc). [19:42:27] (03PS1) 10Volans: sre.hosts.reimage: add small sleep after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/746943 [19:42:51] !log UTC evening B&C window done [19:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:33] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: add small sleep after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/746943 (owner: 10Volans) [19:47:38] !log mdipietro@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1012.eqiad.wmnet [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:07] !log mdipietro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1013.eqiad.wmnet [19:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:06] (03PS1) 10BBlack: lvs1017-20: switch to insetup_noferm role [puppet] - 10https://gerrit.wikimedia.org/r/746946 (https://phabricator.wikimedia.org/T295804) [19:58:33] (03CR) 10BBlack: [C: 03+2] lvs1017-20: switch to insetup_noferm role [puppet] - 10https://gerrit.wikimedia.org/r/746946 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:00:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS buster [20:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox, 10Patch-For-Review: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster [20:01:26] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/745948 (https://phabricator.wikimedia.org/T289888) (owner: 10Andrew Bogott) [20:01:34] (03PS2) 10Andrew Bogott: Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/745948 (https://phabricator.wikimedia.org/T289888) [20:02:02] (03PS1) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 [20:02:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS buster [20:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox, 10Patch-For-Review: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster [20:03:20] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [20:05:09] (03CR) 10Jbond: "still wip" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [20:11:53] (03PS1) 10Accraze: ml-services: update revscoring-articlequality img [deployment-charts] - 10https://gerrit.wikimedia.org/r/746949 (https://phabricator.wikimedia.org/T293331) [20:15:55] (03PS1) 10Andrew Bogott: Added openstack client package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/746950 [20:15:55] !log mdipietro@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1013.eqiad.wmnet [20:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:48] !log mdipietro@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1014.eqiad.wmnet [20:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:47] (03CR) 10Andrew Bogott: [C: 03+2] Added openstack client package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/746950 (owner: 10Andrew Bogott) [20:22:45] !log deployed patch for T297322 [20:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:07] !log mdipietro@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1014.eqiad.wmnet [20:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:42] 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt10[2,3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T296792 (10mdipietro) a:05mdipietro→03wiki_willy [20:43:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:44] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1017.eqiad.wmnet with OS buster [20:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster executed with errors: - lvs10... [20:45:56] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1020.eqiad.wmnet with OS buster [20:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster executed with errors: - lvs10... [20:48:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt10[2,3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T296792 (10wiki_willy) a:05wiki_willy→03Cmjohnson [20:51:08] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [20:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS buster [20:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster [20:55:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS buster [20:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster [20:57:06] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T2100). [21:03:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Andrew) update; It seems we aren't ready to run grafana on bullseye yet so I'm rolling these back to Buster [21:04:27] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1003.eqiad.wmnet with OS buster [21:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudme... [21:06:49] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_sfacctd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:40] (03CR) 10Krinkle: [C: 03+1] "Result of my dig (also left an inline nit about search/discovery)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [21:16:19] (03CR) 10Krinkle: [C: 03+1] Choose wikiversions.php file relative to MWMultiVersion.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [21:17:21] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS buster [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster completed: - lvs1017 (**PASS*... [21:21:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1004.eqiad.wmnet with OS buster [21:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudme... [21:21:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1020.eqiad.wmnet with OS buster [21:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster completed: - lvs1020 (**PASS*... [21:36:48] (03PS3) 10Kosta Harlan: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) [21:38:12] (03PS5) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 [21:38:14] (03PS2) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 [21:39:08] (03PS3) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 [21:44:31] (03CR) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [21:45:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10wiki_willy) Hi @Joe - just following up on this. Can we get any specific racking criteria for you on this install task? Thanks, Willy [21:46:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1003.eqiad.wmnet with OS buster [21:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetric... [21:49:16] (03CR) 10MewOphaswongse: [C: 03+1] wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [21:51:34] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:54:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1004.eqiad.wmnet with OS buster [21:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudmetric... [21:58:24] Hey all - I wanted to get the sec patch for T297571 deployed here. Only one for today's window. [21:59:10] ok w/ me [22:00:04] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211213T2200). [22:00:34] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host [puppet] - 10https://gerrit.wikimedia.org/r/745949 (https://phabricator.wikimedia.org/T289888) (owner: 10Andrew Bogott) [22:00:39] (03PS2) 10Andrew Bogott: cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host [puppet] - 10https://gerrit.wikimedia.org/r/745949 (https://phabricator.wikimedia.org/T289888) [22:02:02] !log deployed security patch for T297571 (sync-file 1) [22:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:06] !log deployed security patch for T297571 (sync-file 2) [22:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:08] PROBLEM - Check systemd state on logstash2004 is CRITICAL: CRITICAL - degraded: The following units failed: logstash.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:26] RECOVERY - Check systemd state on logstash2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:56] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [22:21:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) The 2nd NIC is connected to port ge-1/0/34 only thing left is to do the config in Netbox. [22:37:40] (03PS3) 10Jdlrobson: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) [22:48:30] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:53:55] (03PS3) 10JHathaway: copernicium: rename to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 (https://phabricator.wikimedia.org/T297508) [22:54:28] (03PS4) 10JHathaway: copernicium: rename to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 (https://phabricator.wikimedia.org/T297508) [22:56:28] (03CR) 10JHathaway: [C: 03+2] copernicium: rename to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 (https://phabricator.wikimedia.org/T297508) (owner: 10JHathaway) [23:06:42] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye [23:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye [23:10:30] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [23:12:42] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [23:28:06] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:28:17] !log jhathaway@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mirror1001.wikimedia.org with OS bullseye [23:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with... [23:29:13] (03PS1) 10Cwhite: logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) [23:34:01] (03PS2) 10Cwhite: logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) [23:36:52] (03PS3) 10Cwhite: logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) [23:37:35] (03PS4) 10Cwhite: logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) [23:38:38] (03PS5) 10Cwhite: logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) [23:39:50] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/32981/" [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [23:43:15] (03CR) 10Andrew Bogott: Add initial script to manage/automate cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [23:44:24] (03CR) 10Cwhite: [C: 03+1] prometheus: pin discovery probes to their site [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [23:47:02] (03CR) 10Andrew Bogott: Add simple script to backup cinder volumes according to yaml config (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [23:48:02] (03PS3) 10Andrew Bogott: Add initial script to manage/automate cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) [23:48:04] (03PS2) 10Andrew Bogott: Add simple script to backup cinder volumes according to yaml config [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) [23:52:30] (03CR) 10Herron: [C: 03+1] logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [23:56:30] !log joining restbase2026-a to cassandra cluster [23:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:29] (03CR) 10Herron: [C: 03+1] prometheus: pin discovery probes to their site [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [23:58:56] (03CR) 10Herron: [C: 03+1] prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [23:59:44] (03CR) 10Herron: [C: 03+1] team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)