[00:41:25] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:51:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:05:47] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:07:55] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:22:53] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:27:13] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:06:19] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:08:33] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:44:01] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:46:15] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/ [04:32:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:39] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743484 (https://phabricator.wikimedia.org/T296715) (owner: 10Herron) [08:13:44] !log installing remaining icu security updates on buster [08:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:18] 10SRE, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10fgiunchedi) >>! In T212231#7546916, @Dzahn wrote: > The unused classes > > diamond::collector::servicestats > diamond::collector::servicestats_lib > > still exist and pop up in T272559 These can be safely... [08:36:35] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:43] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:51] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:53:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] role::kafka::main: use fixed uid/gid in the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/743351 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [08:58:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] rsyslog: allow the spool directory to be world-writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/743189 (owner: 10Giuseppe Lavagetto) [08:58:49] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] rsyslog: allow the spool directory to be world-writable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/743189 (owner: 10Giuseppe Lavagetto) [09:01:01] (03CR) 10Jelto: [C: 03+2] site: use gitlab_runner role on gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/743459 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:03:34] (03PS1) 10Giuseppe Lavagetto: rsyslog: do not chmod a non-existent directory [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/743908 [09:06:02] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] rsyslog: do not chmod a non-existent directory [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/743908 (owner: 10Giuseppe Lavagetto) [09:07:53] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: use fixed uid/gid in the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/743351 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [09:09:01] !log move kafka main codfw to fixed uid/gid for the kafka user (requires a stop/start of all daemons) - T296982 [09:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:06] T296982: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 [09:11:50] 10SRE, 10Observability-Metrics, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) p:05High→03Low In the last 24 hours we had just one overrun on 4 nodes: ` Dec 05 20:59:55 cp3060 varn... [09:12:31] kafka-main2001 done [09:15:31] (will wait for the broker to recover before proceeding) [09:16:37] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 1835 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:17:48] <_joe_> uhm [09:18:01] yeah I was about to say that [09:18:02] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200 [09:18:05] <_joe_> elukey: the coincidence is not great, let me take a look [09:18:47] <_joe_> uhm parsoid eqiad heh [09:18:50] <_joe_> just timeouts? [09:20:06] unrelated timeouts or kafka-related? [09:20:38] <_joe_> let's look at the slowlog [09:21:06] (03PS1) 10Ema: cache: enable single backend experiment on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/743910 (https://phabricator.wikimedia.org/T288106) [09:21:33] <_joe_> seem mostly during parsing [09:21:42] <_joe_> does the start time of the error coincide with your work? [09:22:41] <_joe_> a combination of [09:22:51] <_joe_> OOM, timeouts, related exceptions [09:23:00] <_joe_> so no definitely not tied to ypour work [09:23:05] yeah it lines up more or less, stop broker --> start broker [09:23:07] <_joe_> also started a few meinutes after [09:23:20] it is a very weird coincidence though [09:23:31] <_joe_> it's a specific parsing request I guess [09:23:37] <_joe_> series of requests [09:24:10] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/743910 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [09:35:20] PROBLEM - DPKG on people2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:36:19] proceeding with kafka-main2002 [09:37:10] done, waiting for the broker's recovery [09:41:00] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:42:55] (03PS1) 10Btullis: Configure the kafka jumbo cluster to use a fixed uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/743914 (https://phabricator.wikimedia.org/T296982) [09:45:09] I'm planning on making the same change to the kafka-jumbo cluster today. Would anyone prefer me to defer it until kafka-main in codfw has been completed, or is it OK for me to proceed? [09:52:41] Scratch that last comment. I'll defer the work on Kafka-jumbo until the new year. [09:55:02] <3 [09:55:10] proceeding with kafka-main2004 [09:58:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS buster [09:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:58] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster [10:02:58] and kafka-main2005 done :) [10:11:33] (03CR) 10Elukey: "LGTM! Left a non-blocking nit :)" [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis) [10:19:40] (03PS1) 10Kormat: cumin: Remove obsolete+redundant alias for db-backup-source [puppet] - 10https://gerrit.wikimedia.org/r/743918 (https://phabricator.wikimedia.org/T296285) [10:21:46] (03CR) 10Kormat: [C: 03+2] cumin: Remove obsolete+redundant alias for db-backup-source [puppet] - 10https://gerrit.wikimedia.org/r/743918 (https://phabricator.wikimedia.org/T296285) (owner: 10Kormat) [10:23:32] !log draining primary/secondary instances off ganeti2015 T296622 [10:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:37] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [10:24:29] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Kormat) 05Open→03Resolved >>! In T296285#7545417, @jcrespo wrote: > ` > from:... [10:24:46] (03PS1) 10Giuseppe Lavagetto: deployment_server: update rsyslog image version [puppet] - 10https://gerrit.wikimedia.org/r/743919 [10:26:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: update rsyslog image version [puppet] - 10https://gerrit.wikimedia.org/r/743919 (owner: 10Giuseppe Lavagetto) [10:28:28] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:40] <_joe_> jouncebot: next [10:28:41] In 1 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1200) [10:28:49] (03PS1) 10Filippo Giunchedi: service: add public_aliases list [puppet] - 10https://gerrit.wikimedia.org/r/743921 [10:28:51] (03PS1) 10Filippo Giunchedi: service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 [10:31:10] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:41] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) I should have realised this on Friday, but I think we can be sure the DHCP responses from install1003 are making it to the host. Initially the host sends a DH... [10:36:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS buster [10:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:23] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster completed: - ganeti2011 (**PASS**) - Removed from Puppet... [10:38:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage [10:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:21] (03PS1) 10Majavah: beta: WRITE_BOTH for centralauth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743923 (https://phabricator.wikimedia.org/T289068) [10:39:54] jouncebot: nowandnext [10:39:54] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [10:39:55] In 1 hour(s) and 20 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1200) [10:40:26] (03CR) 10Ladsgroup: [C: 03+1] beta: WRITE_BOTH for centralauth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743923 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [10:40:55] ^ I'll merge a beta-only change [10:41:08] (03CR) 10Majavah: [C: 03+2] beta: WRITE_BOTH for centralauth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743923 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [10:41:54] (03Merged) 10jenkins-bot: beta: WRITE_BOTH for centralauth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743923 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [10:46:39] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) @JAllemandou This is great, thanks! Note that we can tune sampling to adapt. What would be the next steps? [11:12:05] (03PS1) 10Btullis: Increase envoy upstream timeout for superset [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) [11:12:37] !log draining primary/secondary instances off ganeti2016 T296622 [11:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [11:16:42] (03PS1) 10Ladsgroup: wmcs: Change maintain-views to prepare for schema change [puppet] - 10https://gerrit.wikimedia.org/r/743948 (https://phabricator.wikimedia.org/T297094) [11:16:54] (03CR) 10Elukey: [V: 03+1] "Arzhel/Cathal - Ok to proceed? Any special procedure to restart pmacct?" [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [11:18:19] (03CR) 10Elukey: "From pcc: parameter 'upstream_response_timeout' expects a Float value, got Integer :)" [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) (owner: 10Btullis) [11:19:53] (03PS2) 10Btullis: Increase envoy upstream timeout for superset [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) [11:21:46] !log dropping wikiadmin@localhost from all of s2 (T296511) [11:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:52] T296511: Drop wikiadmin@localhost MySQL user from core dbs - https://phabricator.wikimedia.org/T296511 [11:22:15] (03CR) 10Ayounsi: [C: 03+1] netflow: move kafka config to new CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [11:22:27] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32819/console" [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) (owner: 10Btullis) [11:23:25] (03CR) 10Elukey: [C: 03+1] Increase envoy upstream timeout for superset [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) (owner: 10Btullis) [11:23:41] (03CR) 10Btullis: [V: 03+1 C: 03+2] Increase envoy upstream timeout for superset [puppet] - 10https://gerrit.wikimedia.org/r/743947 (https://phabricator.wikimedia.org/T294771) (owner: 10Btullis) [11:24:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:02] (03CR) 10Ladsgroup: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/743948 (https://phabricator.wikimedia.org/T297094) (owner: 10Ladsgroup) [11:28:16] !log dropping wikiadmin@localhost from all of s3 (T296511) [11:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:21] T296511: Drop wikiadmin@localhost MySQL user from core dbs - https://phabricator.wikimedia.org/T296511 [11:28:24] poor wikiadmins :( [11:31:00] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:28] they had their fun, now it's time to go home [11:41:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [11:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [11:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:43] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:56:54] (03CR) 10Btullis: Refactor superset caching to enable dual caches (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1200). [12:00:05] kart_, nn1l2, Juan_90264, and Urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] o/ [12:00:09] hey! I can deploy today [12:01:07] oh hey, congrats on becoming a deployer \o/ [12:01:16] thanks :D [12:02:18] kart_: nn1l2: *ping* [12:02:23] hi [12:02:36] hi! let's start with your patch then [12:02:45] (03PS2) 10Majavah: hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [12:03:02] (03CR) 10Majavah: [C: 03+2] hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [12:03:36] * urbanecm around too [12:05:05] (03Merged) 10jenkins-bot: hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [12:05:49] nn1l2: your patch is live on mwdebug1001.eqiad.wmnet, can you test please? [12:06:57] LGTM [12:07:08] thanks, syncing [12:07:45] majavah: Sorry, zoned out and missed ping. But, I'm here. [12:07:48] (Now) [12:07:54] I'm present [12:07:58] majavah: please ping me when you're done (unless you wish to do the aliases too!) [12:08:12] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742833|hewiki: add "templateeditor" permission group (T296769)]] (duration: 00m 57s) [12:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:19] T296769: hewiki: add "templateeditor" permission group - https://phabricator.wikimedia.org/T296769 [12:08:26] nn1l2: your patch is live! [12:09:14] kart_: hi, do you want to self-service or should I deploy the patches for you? [12:09:35] majavah: If you're OK, Please deploy. I'll test. [12:09:40] sure [12:09:55] (03PS3) 10Majavah: Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743158 (https://phabricator.wikimedia.org/T285842) (owner: 10KartikMistry) [12:10:02] (03CR) 10Majavah: [C: 03+2] Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743158 (https://phabricator.wikimedia.org/T285842) (owner: 10KartikMistry) [12:10:38] Majavah, are you training? [12:10:55] Thanks, majavah [12:11:23] (03Merged) 10jenkins-bot: Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743158 (https://phabricator.wikimedia.org/T285842) (owner: 10KartikMistry) [12:11:54] kart_: your patch is on mwdebug1001, can you test please? [12:12:00] Sure. Testing. [12:13:07] Juan_90264: not in the official training process (https://wikitech.wikimedia.org/wiki/Deployments/Training), but I wouldn't yet do this without more experienced people around [12:13:16] got access last week [12:14:20] Okay [12:14:43] majavah: looks good! Please deploy. [12:14:54] sure [12:15:31] (03PS4) 10Majavah: Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:15:44] (03CR) 10Majavah: [C: 03+2] Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:15:46] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743158|Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs (T285842)]] (duration: 00m 56s) [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] T285842: Enable Section Translation for Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian Wikipedias - https://phabricator.wikimedia.org/T285842 [12:16:20] Thanks majavah ! [12:17:46] (03Merged) 10jenkins-bot: Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:17:59] Great merged [12:18:22] Juan_90264: the first patch is on mwdebug1001, can you test please? [12:18:34] Yes, I can [12:21:35] majavah, I tested and approved [12:21:42] syncing [12:22:16] (03PS4) 10Majavah: Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:22:35] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743528|Enable groups autopatrolled and patroller for bnwikivoyage (T296637)]] (duration: 00m 56s) [12:22:38] (03CR) 10Majavah: [C: 03+2] Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:44] T296637: Requesting features for bnwikivoyage - https://phabricator.wikimedia.org/T296637 [12:24:10] (03Merged) 10jenkins-bot: Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [12:24:31] About Snadboxlink, I have created https://meta.wikimedia.org/wiki/Requests_for_comment/Enable_sandbox_for_all_Wikipedias Please consider voting. Thanks! [12:24:38] Juan_90264: next one is available for testing on mwdebug1001 [12:24:54] Okay majavah [12:25:30] nn1l2: B&C time is not a good time to advetise rfcs [12:25:44] nn1l2, great initiative, i will consider voting [12:26:56] majavah, I tested and approved [12:27:07] great [12:27:51] urbanecm, I understand. At the moment, some sandboxlink is being deployed at bnwikivoyage. And I thought it might be a good idea to tell others about this. [12:28:00] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743529|Enable SandboxLink extension for bnwikivoyage (T296637)]] (duration: 00m 55s) [12:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] T296637: Requesting features for bnwikivoyage - https://phabricator.wikimedia.org/T296637 [12:29:14] (03CR) 10Majavah: [C: 04-1] "The task description states that "only administrators and autopatrollers [can] edit the protected page", but this patch seems to only add " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) (owner: 10Juan90264) [12:29:30] Juan_90264: ^ left a comment on the last change, can you have a look? [12:31:14] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10ayounsi) Latest Junos recommended is 20.4R3-S1.3 I downloaded it to apt1001:/srv/junos/jinstall-ex-4300-20.4R3-S1.3-signed.tgz You can also find it on https://webdownload.juniper.net/swdl/dl/secure/si... [12:34:28] majavah, I've read it, and that won't be a problem with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/529043. This change standardizes this protection, leaving it automatically defined for administrators as well. [12:35:22] This change was standardized by Urbanecm [12:35:34] ah, indeed! thanks [12:35:56] (03CR) 10Majavah: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/529043 fixed the sysop issue => deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) (owner: 10Juan90264) [12:36:06] (03PS7) 10Majavah: Enable Autopatroller level page protection for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) (owner: 10Juan90264) [12:36:34] (03CR) 10Majavah: [C: 03+2] Enable Autopatroller level page protection for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) (owner: 10Juan90264) [12:37:52] (03Merged) 10jenkins-bot: Enable Autopatroller level page protection for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) (owner: 10Juan90264) [12:38:26] Juan_90264: the last patch is available for testing on mwdebug1001 [12:38:44] Okay [12:40:29] In testing, the system already automatically allowed administrators to edit as well. https://en.wiktionary.org/wiki/Special:ListGroupRights [12:40:50] syncing [12:41:11] majavah, I tested and approved late [12:41:58] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743533|Enable Autopatroller level page protection for English Wiktionary (T296580)]] (duration: 00m 56s) [12:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:03] T296580: Autopatroller level page protection on English Wiktionary - https://phabricator.wikimedia.org/T296580 [12:42:05] that one is live too [12:42:32] urbanecm: I'm done with everything else, let's do the namespace aliases next? [12:43:17] yup! [12:43:21] majavah: want to do it, or should i? [12:44:13] I can do it but you'll need to help with the namespaceDupes.php run [12:44:24] wfm [12:44:31] so feel free to +2 and fetch to mwdebug :) [12:44:38] (03PS2) 10Majavah: Set default two-letter NS_PROJECT aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [12:44:50] (03CR) 10Majavah: [C: 03+2] "deploying!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [12:46:01] (03Merged) 10jenkins-bot: Set default two-letter NS_PROJECT aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734383 (https://phabricator.wikimedia.org/T293839) (owner: 10Urbanecm) [12:46:27] urbanecm: live on mwdebug1001 [12:46:51] Working, thanks majavah! [12:47:02] happy to help :-) [12:47:27] majavah: works, https://sk.wikipedia.org/wiki/WP:K?noredirect=1 now says "page does not exist" (as the page is in NS0) [12:47:31] please sync [12:47:47] sure [12:48:44] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734383|Set default two-letter NS_PROJECT aliases (T293839)]] (duration: 00m 55s) [12:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:48] T293839: Set default namespace aliases for projects - https://phabricator.wikimedia.org/T293839 [12:49:38] majavah: so, over to me i assume? [12:49:39] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [12:49:50] PROBLEM - ganeti-confd running on ganeti2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:50:06] anything works for me, I can also try [12:50:14] if you want :) [12:50:18] try skwiki first please [12:50:25] (it's the project i tested on) [12:50:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2011.codfw.wmnet with reason: readding to cluster after reimage [12:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2011.codfw.wmnet with reason: readding to cluster after reimage [12:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] 382 links to fix, 382 were resolvable, 0 were deleted. [12:50:54] (dry run) [12:51:07] I guess I can do it with --fix now? [12:51:09] yes [12:51:18] side note: why does it say "Oh noeees"? [12:51:26] full output please [12:51:29] there may be conflicts [12:51:40] (feel free to run it with --fix anyway, we'll fix those in second round) [12:53:05] umm, I tried that and it failed with "Error 1062: Duplicate entry '4-\xC5\xA0' for key 'page_name_title" [12:53:12] meh [12:53:21] known though, let me find the task... [12:53:42] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:54:05] !log mwscript namespaceDupes.php --wiki skwiki --fix # T293839 [12:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:10] T293839: Set default namespace aliases for projects - https://phabricator.wikimedia.org/T293839 [12:54:38] well...https://phabricator.wikimedia.org/T293407 is what i was looking for, but it claims to be resolved [12:55:31] 13:51 side note: why does it say "Oh noeees"? <== at the start of the output, you'll `id=594576 ns=0 dbk=WP:2NNSZ *** dest title exists and --add-prefix not specified` [12:55:53] (03PS1) 10Jelto: gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) [12:56:02] https://sk.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:K&redirect=no is now recognized at least [12:56:36] I see that now, at the end it said "382/382 were resolveable" and I didn't think it had a first block of output + numbers above the second one [12:56:49] yup, a "bit" confusing :)) [12:57:16] (03CR) 10jerkins-bot: [V: 04-1] gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:57:24] majavah: mind running skwiki with --add-prefix=BROKEN (or something else), so we can finish the rest? I can't reproduce the duplicate entry error now [12:57:38] yes, was just about to ask about that [12:57:56] done [12:58:35] works: https://sk.wikipedia.org/wiki/%C5%A0peci%C3%A1lne:V%C5%A1etkyStr%C3%A1nky/Wikipedia:BROKEN [12:58:43] mind !_log'ing? :)) [12:58:47] !log mwscript namespaceDupes.php --wiki skwiki --fix --add-prefix=BROKEN # T293839 [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] thanks [12:59:38] (03PS2) 10Jelto: gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) [12:59:55] I guess a foreachwiki dry-run next? [13:00:08] i would pick first other wikis in https://people.wikimedia.org/~urbanecm/onetime/global_aliases_grfc/wpPagesInWikipedias.html first [13:00:18] (those with non-zero pagesWithPrefix, of course) [13:00:21] (03CR) 10jerkins-bot: [V: 04-1] gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:00:59] barwiki appears to be a nice candidate (36 pages) [13:01:07] ok [13:01:20] * urbanecm is curious whether the duplicate entry thing will happen again [13:02:16] it has a few invalid pagelinks: "pagelinks from=131103 ns=0 dbk=WP: *** INVALID" [13:02:25] in theory it should delete those [13:03:00] !log $ mwscript namespaceDupes.php --wiki barwiki --fix --add-prefix=BROKEN # T293839 [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:05] T293839: Set default namespace aliases for projects - https://phabricator.wikimedia.org/T293839 [13:04:30] appears to have worked fine, https://bar.wikipedia.org/wiki/Spezial:Alle_Seiten?from=BROKEN&to=&namespace=4 and pages like https://bar.wikipedia.org/wiki/WP:BOT work [13:04:39] great! [13:05:31] what next? [13:05:43] dry run w/o --add-prefix at all wikis? [13:05:52] sure [13:06:00] majavah: can you run it in a script session? [13:06:03] (so we have a file with log) [13:06:43] (also, i don't see you at mwmaint1002 -- where are you running those?) [13:07:12] ummm, I've been accidentally running them on deploy1002 [13:07:33] * majavah switches servers [13:07:36] thanks [13:07:58] (03PS4) 10Jelto: gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) [13:11:20] I'm tailing the file, so far things look good to me [13:14:06] (03PS1) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:14:08] (03PS1) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:14:10] (03PS1) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:15:12] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:27] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:18:40] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 1 days) https://phabricator.wikimedia.org/tag/toolforge/ [13:20:13] Wow, the certificate will expire... [13:20:18] <_joe_> vgutierrez: is that ncredir? [13:20:39] _joe_: no, that's on toolforge, I'll have a look [13:20:54] (03PS2) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:20:56] (03PS2) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:20:58] (03PS2) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:21:05] it should probably alert on #wikimedia-cloud-feed and not here [13:21:07] <_joe_> majavah: thanks [13:21:14] that should be a acme-chief issued cert [13:21:39] majavah: the alert was alerting back and forth yesterday IIRC [13:21:43] (the toolserver.org one) [13:21:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:11] urbanecm: might be the bug where it doesn't reload properly post renew [13:22:17] I think lists saw it too [13:22:36] "Not After : Feb 6 03:00:07 2022 GMT" [13:22:49] https://phabricator.wikimedia.org/T293826 [13:22:51] yeah... a reload will fix that [13:22:51] I think the bot misinterpreted the expiration, here it appears that it will expire on February 6, 2022 [13:22:53] majavah: ^ [13:23:01] Juan_90264: it's a known issue [13:23:04] <_joe_> majavah: maybe one frontend doesn't reload? [13:23:06] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 61 days) https://phabricator.wikimedia.org/tag/toolforge/ [13:23:28] it's not load balanced [13:23:34] `root@toolserver-proxy-01:~# systemctl restart apache2.service` did the trick [13:24:32] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:25:26] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:05] (03PS3) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:27:07] (03PS3) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:27:09] (03PS3) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:28:04] I got disconnected. I want to know if B&C is still going on or is finally finished? [13:28:44] nn1l2: urbanecm and I are still rolling out the namespace alias patch [13:29:10] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [13:29:11] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:29:22] I have one question. Can I ask it now, or should I postpone it? [13:29:26] urbanecm: the dry run is finally over 50% complete by the count of wikis :P [13:29:31] feel free to [13:30:30] (03PS4) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:30:32] (03PS4) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:30:34] (03PS4) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:30:43] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2016.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [13:31:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2016.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [13:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:19] The patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/743533 had no +1 before being scheduled. Is having at least one +1 (code review) mandatory before scheduling? [13:31:58] RECOVERY - ganeti-confd running on ganeti2011 is OK: PROCS OK: 1 process with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:32:05] It's not a hard requirement for the config changes if it's one easily understood nn1l2 [13:32:14] As long as the deployer can be confident [13:33:00] the deployer will review all patches in any case when deploying, so not mandatory but recommended for non-trivial patches [13:35:01] I usually only do easy tasks, so I assume I don't need to wait for reviewing by third parties from now on. I sometimes waited for several days to some volunteers take up the reviewing task :) [13:35:52] nn1l2: if it's config changes feel free to add me to gerrit patch and I'll review [13:35:54] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:37:07] RhinosF1, thanks! I will do so :) [13:37:48] (03PS5) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:37:50] (03PS5) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:37:52] (03PS5) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:37:54] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:39:12] urbanecm: btw, current "Looks good"/"Oh noeees" ratio is 622 to 76 [13:41:13] for "Looks good"? not bad [13:43:09] majavah: we should get you to name all our metrics :D [13:43:32] kormat: not mine, it comes from mediawiki/core.git/maintenance/namespaceDupes.php [13:44:05] lol, nice! [13:44:16] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:44:24] 10SRE, 10SRE Observability (FY2021/2022-Q2): DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration - https://phabricator.wikimedia.org/T292603 (10Volans) @lmata almost another month has passed and we're approaching the holiday season, is there by any chance any news from their side? [13:45:07] (03PS1) 10MSantos: tegola: update layer_country_label function parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/743988 [13:45:46] (03PS2) 10MSantos: tegola: update layer_country_label function parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/743988 [13:45:49] * urbanecm sneaks in two config patches while waiting on aliases to finish [13:46:16] urbanecm: only 100 wikis remaining for the dry run :P [13:49:05] (03PS1) 10Urbanecm: Deploy Growth features on zhwiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743992 (https://phabricator.wikimedia.org/T287884) [13:49:21] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features on zhwiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743992 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm) [13:51:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32824/console" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:52:43] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=zhwiki growthexperiments # T287884 [13:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:48] T287884: Deploy Growth features on Chinese Wikipedia - https://phabricator.wikimedia.org/T287884 [13:53:21] urbanecm: should we get worried at the amount of links on viwiki? [13:53:25] (03Merged) 10jenkins-bot: Deploy Growth features on zhwiki in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743992 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm) [13:53:50] majavah: not really. There were cases when the script took over an hour without issues. [13:54:00] if you want to be extra careful, we can run it at viwiki first [13:54:19] (03PS2) 10Urbanecm: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) [13:54:23] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [13:55:26] the log file is currently at 5 222 962 lines, without viwiki it is 1 158 245 lines [13:56:03] i see. once it completes, let's do viwiki first and proceed with rest [13:56:19] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=zhwiki --phab=T287884 [13:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:11] (03PS7) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [13:59:54] (03PS6) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:59:56] (03PS6) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:59:58] (03PS6) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [14:00:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4d8a75d5f01e8e2cf724e19db2e9bcc12fb8f5f4: Deploy Growth features on zhwiki in dark mode (T287884) (duration: 00m 56s) [14:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:43] T287884: Deploy Growth features on Chinese Wikipedia - https://phabricator.wikimedia.org/T287884 [14:03:16] update: the entire log file just hit 10M lines [14:04:07] happens :)) [14:06:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS buster [14:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:56] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster [14:07:51] apparently this is some template that links to WP:N and is linked on tons of IP user talk pages [14:07:57] yeah [14:14:04] is it seriously now doing the exact same thing but with another page?? [14:14:35] :D [14:14:44] this is going to take hours [14:15:07] (03CR) 10Herron: [C: 03+2] admin: add ollieshotton to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/743484 (https://phabricator.wikimedia.org/T296715) (owner: 10Herron) [14:15:51] I count 5 instances of "WP:" on https://vi.wikipedia.org/w/index.php?title=Th%E1%BA%A3o_lu%E1%BA%ADn_Th%C3%A0nh_vi%C3%AAn:171.248.21.67&action=edit [14:16:09] (03CR) 10Jgiannelos: [C: 03+1] tegola: update layer_country_label function parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/743988 (owner: 10MSantos) [14:16:09] so we'll have 3 more pages? [14:16:18] I guess yes? [14:16:28] the first one took like half an hour [14:17:31] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [14:17:53] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2016. Ready to be powered off any time. [14:18:12] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) [14:19:17] !log draining primary/secondary instances off ganeti2012 T296622 [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [14:19:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) 05Open→03Resolved a:03herron Hi @Ollie.Shotton_WMDE your account has been added to the requested ldap groups. I'll transition this to resolved... [14:23:56] urbanecm: in case you are curious: https://phabricator.wikimedia.org/P18028 [14:24:21] quite heavily linked stuff [14:24:32] why do you have different prompt than the default of `sql` at toolforge majavah ? [14:24:50] if just the dry run takes half an hour, I wonder how much it will take with the actual run [14:25:11] can be easily a couple of hours [14:27:17] (03PS5) 10Jelto: gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) [14:27:36] urbanecm: because I like having the details in my prompt :-) it's actually a custom script in `~/bin/sql` that makes it use a custom config file for the prompt [14:27:44] i see :) [14:31:33] I'm going to step away for a bit, it's running in a screen and it should be fine to leave running as it's a dry run [14:31:50] SGTM [14:32:33] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32826/console" [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:32:56] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > @BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow). Did you mean _not_ a hard... [14:34:56] (03PS2) 10Btullis: Refactor superset caching to enable dual caches [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) [14:36:26] (03PS3) 10Elukey: varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) [14:36:33] (03PS3) 10Elukey: netflow: move kafka config to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) [14:37:34] (03PS4) 10Elukey: netflow: move kafka config to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) [14:37:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10Aklapper) @herron: I've edited https://phabricator.wikimedia.org/project/profile/1564/ to state "Check and follow https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access ",... [14:41:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32827/console" [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:44:04] (03CR) 10ZPapierski: [C: 03+1] rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (owner: 10Ebernhardson) [14:44:35] (03CR) 10Elukey: [C: 03+2] netflow: move kafka config to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:44:45] (03PS6) 10Jelto: gitlab_runner: create module for runner config and enable metrics [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) [14:45:06] !log roll restart of nfacctd on netflow* nodes to pick up the new CA bundle for librdkafka [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:19] Cc: XioNoX, topranks --^ [14:45:33] thx! [14:47:22] (03CR) 10Andrew Bogott: [C: 03+2] "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/742211 (owner: 10Majavah) [14:47:28] (03PS4) 10Andrew Bogott: hieradata: remove old project-proxies [puppet] - 10https://gerrit.wikimedia.org/r/742211 (owner: 10Majavah) [14:47:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32828/console" [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:49:28] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32829/console" [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:51:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10herron) >>! In T296654#7549793, @Aklapper wrote: > is that what you had in mind? If not, what text exactly would you like to see in the form? I was thinking of appending the specifics as checklist... [14:54:36] (03PS1) 10Jelto: profile::gitlab-runner add registration_token for protected GitLab Runners [labs/private] - 10https://gerrit.wikimedia.org/r/744015 (https://phabricator.wikimedia.org/T295481) [14:54:47] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.90% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:55:59] XioNoX: on netflow1001 everything looks ok, is there a way to check in the netflow topic if data from 1001 is being sent? [14:56:28] (03CR) 10Yahya: "@" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [14:57:23] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=netflow looks fine, no drops, so I guess we are ok [14:57:34] also librdkafka doesn't emit errors in the logs [14:58:00] (03CR) 10MSantos: [C: 03+2] tegola: update layer_country_label function parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/743988 (owner: 10MSantos) [14:58:39] elukey: filter for eqiad routers [14:59:06] on turnilo [14:59:35] exporter name or exporter region [15:00:01] perfect [15:00:26] kafkacat confirms that all is fine [15:01:18] (03CR) 10Jelto: [V: 03+2 C: 03+2] profile::gitlab-runner add registration_token for protected GitLab Runners [labs/private] - 10https://gerrit.wikimedia.org/r/744015 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:02:24] (03Merged) 10jenkins-bot: tegola: update layer_country_label function parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/743988 (owner: 10MSantos) [15:06:49] XioNoX: all good! [15:07:00] (03CR) 10Btullis: "Setting to WIP and holding until the New Year." [puppet] - 10https://gerrit.wikimedia.org/r/743914 (https://phabricator.wikimedia.org/T296982) (owner: 10Btullis) [15:10:52] (03PS4) 10Ottomata: Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) [15:12:12] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32831/console" [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [15:13:08] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Deploy research_poc thanos swift auth env file to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/743214 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [15:16:48] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10Ollie.Shotton_WMDE) Thank you! [15:19:13] (03PS1) 10Ottomata: profile::analytics::cluster::secrets - run as hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/744019 (https://phabricator.wikimedia.org/T296945) [15:21:27] (03CR) 10Ottomata: [C: 03+2] profile::analytics::cluster::secrets - run as hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/744019 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [15:24:37] (03PS2) 104nn1l2: bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) [15:29:31] (03CR) 104nn1l2: bnwikibooks: add autopatrolled and patroller user groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [15:29:45] (03PS1) 10Ottomata: profile::analytics::cluster::secrets - fix for research_poc swift user [puppet] - 10https://gerrit.wikimedia.org/r/744022 (https://phabricator.wikimedia.org/T296945) [15:32:47] (03CR) 10Ottomata: [C: 03+2] profile::analytics::cluster::secrets - fix for research_poc swift user [puppet] - 10https://gerrit.wikimedia.org/r/744022 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [15:33:55] would someone be kind enough to update the channel topic? I am on clinic duty this week (till Thursday). thanks! [15:36:46] (03PS1) 10Kormat: dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 [15:37:33] (03PS1) 10Ssingh: P:wikidough: remove redundant space [puppet] - 10https://gerrit.wikimedia.org/r/744030 [15:38:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32833/console" [puppet] - 10https://gerrit.wikimedia.org/r/744030 (owner: 10Ssingh) [15:39:49] (03CR) 10jerkins-bot: [V: 04-1] dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [15:40:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10homer, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) 05In progress→03Stalled Waiting for Capirca upstream to merge PRs. [15:40:32] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: remove redundant space [puppet] - 10https://gerrit.wikimedia.org/r/744030 (owner: 10Ssingh) [15:41:18] oh great. completely unrelated (i think?) mypy ci failures [15:42:37] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:43:12] (03PS2) 10Filippo Giunchedi: team-sre: port node-exporter textfile stale alert [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) [15:43:14] (03PS1) 10Filippo Giunchedi: team-sre: port job unavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/744033 (https://phabricator.wikimedia.org/T288726) [15:44:08] checking ml-serve2004 [15:44:27] the indexing failures is the knative/dev known logging problem btw [15:44:29] elukey: ^ [15:44:45] T288549 that is [15:44:46] T288549: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 [15:44:57] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:40] yes yes [15:46:03] (the "yesyes was for icinga not Filippo :D) [15:46:26] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:46:29] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:01] mmmm tg3 0000:01:00.0 eno3: Link is down [15:47:14] (03PS1) 10Filippo Giunchedi: prometheus: remove job unavailable alert [puppet] - 10https://gerrit.wikimedia.org/r/744035 (https://phabricator.wikimedia.org/T288726) [15:47:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:50:42] papaul: o/ around by any chance? [15:51:27] smells like a faulty cable [15:52:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10joanna_borun) [15:52:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:55:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2012.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:55:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2012.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:38] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) > Did you mean _not_ a hard requirement? Yep, my bad :) [15:56:11] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:58] elukey: yes [15:59:12] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [15:59:13] urbanecm: it's finally done with viwiki! :D [15:59:18] 10ops-codfw, 10Machine-Learning-Team: Possible faulty cable between asw-d-codfw and ml-serve2004 - https://phabricator.wikimedia.org/T297126 (10elukey) [15:59:22] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2012. Ready to be powered off any time. [16:01:15] papaul: o/ hiii I just opened a task, there seems to be a weird connectivity issue between asw-d-codfw and ml-serve2004, if you are in the DC could you please check when you have a moment? [16:01:45] urbanecm: and now finally done with all the wikis! [16:01:50] elukey: ok [16:01:56] <3 [16:03:01] urbanecm: what's next? start a foreachwiki run with --fix but not --add-prefix? [16:03:17] (03PS1) 10Ottomata: profile::analytics::cluster::secrets - fix chmod for swift research_poc env file [puppet] - 10https://gerrit.wikimedia.org/r/744037 (https://phabricator.wikimedia.org/T296945) [16:03:41] (03CR) 10Ottomata: [V: 03+2 C: 03+2] profile::analytics::cluster::secrets - fix chmod for swift research_poc env file [puppet] - 10https://gerrit.wikimedia.org/r/744037 (https://phabricator.wikimedia.org/T296945) (owner: 10Ottomata) [16:07:16] 10SRE-swift-storage, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10Ottomata) @fkaelin `sudo -u analytics-research kerberos-run-command analytics-research hdfs dfs -cat /user/analytics-re... [16:07:20] (03PS1) 10Ladsgroup: [WIP] Re-architecture auto_schema [software] - 10https://gerrit.wikimedia.org/r/744042 (https://phabricator.wikimedia.org/T288235) [16:08:19] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 35.04 ms [16:08:31] (03PS1) 10Esanders: Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) [16:08:47] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:11] 10ops-codfw, 10Machine-Learning-Team: Possible faulty cable between asw-d-codfw and ml-serve2004 - https://phabricator.wikimedia.org/T297126 (10Papaul) @elukey Replaced the cable ` Interface Admin Link Description ge-6/0/4 up up ml-serve2004`` ` note: If ml-serve200[1-4] are in service can y... [16:11:14] 10ops-codfw, 10Machine-Learning-Team: Possible faulty cable between asw-d-codfw and ml-serve2004 - https://phabricator.wikimedia.org/T297126 (10Papaul) 05Open→03Resolved a:03Papaul [16:11:26] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:13:15] (03PS1) 10Muehlenhoff: Add current OS upgrade estimation for restbase/sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/744046 [16:14:00] 10ops-codfw, 10Machine-Learning-Team: Possible faulty cable between asw-d-codfw and ml-serve2004 - https://phabricator.wikimedia.org/T297126 (10elukey) Applied the "Active" label to all nodes, thanks! [16:14:03] papaul: you rock thanks [16:20:37] (03CR) 10Kormat: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [16:21:47] (03CR) 10Varac: "Hej, do you mind reviewing my pull req ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [16:21:54] majavah: as you said, --fix but w/o add-prefix (I want to go through those semi-manually to review how the pages look like) [16:23:11] urbanecm: ok, starting now [16:23:28] majavah: can you log the start and end too, please? [16:24:26] !log starting "foreachwiki namespaceDupes.php --fix | tee namespaceDupes-T293839-fix.txt" in mwmaint1002 screen session, T293839 [16:24:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [16:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] T293839: Set default namespace aliases for projects - https://phabricator.wikimedia.org/T293839 [16:24:56] thanks [16:27:50] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10herron) p:05Triage→03Medium [16:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1630). [16:31:14] (03CR) 10Hnowlan: [C: 03+1] Add current OS upgrade estimation for restbase/sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff) [16:33:52] (03CR) 10Klausman: dbutil: Make testing easier (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [16:34:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [16:36:04] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) [16:40:41] (03CR) 10Kormat: dbutil: Make testing easier (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [16:43:54] (03CR) 10Klausman: dbutil: Make testing easier (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [16:45:57] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10Security: Process for granting wmf LDAP access is vulnerable to impersonation (after creating a Wikitech account with an unconfirmed email address) - https://phabricator.wikimedia.org/T259746 (10sbassett) [16:47:19] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10Security: Process for granting wmf LDAP access is vulnerable to impersonation (after creating a Wikitech account with an unconfirmed email address) - https://phabricator.wikimedia.org/T259746 (10sbassett) 05Open→03Resolved [16:48:50] (03PS1) 10Esanders: Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) [16:50:14] (03CR) 10jerkins-bot: [V: 04-1] Enable VE on zh.wiki, but only for logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744066 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [16:50:54] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10Security: Process for granting wmf LDAP access is vulnerable to impersonation (after creating a Wikitech account with an unconfirmed email address) - https://phabricator.wikimedia.org/T259746 (10nshahquinn-wmf) Thanks, @sbassett! [16:53:05] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:55:50] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) And one more question: Once the authentication is working, and some files are inserted with a `public-read` ACL - will these files be publicly accessible from outside the WMF? If... [16:59:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) 05Open→03Resolved Alright, closing this for now then :) [17:06:41] (03CR) 10Dzahn: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:08:08] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [17:08:15] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/32834/gitlab-runner1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/743975 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:13:55] (03CR) 10Yahya: [C: 03+1] bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [17:17:35] (03PS1) 10Ebernhardson: Move cirrus traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744070 (https://phabricator.wikimedia.org/T296897) [17:21:10] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) [17:22:12] going to deploy mediawiki-config to shift cirrus traffic to codfw. should move back in a few hours if all goes well [17:23:01] (03CR) 10Ebernhardson: [C: 03+2] Move cirrus traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744070 (https://phabricator.wikimedia.org/T296897) (owner: 10Ebernhardson) [17:24:58] (03Merged) 10jenkins-bot: Move cirrus traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744070 (https://phabricator.wikimedia.org/T296897) (owner: 10Ebernhardson) [17:27:02] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T296897 Move cirrus traffic to codfw (duration: 00m 56s) [17:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:07] T296897: Eqiad Geosearch API queries return errors on Commons - https://phabricator.wikimedia.org/T296897 [17:33:19] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10Dzahn) a:03Dzahn [17:33:32] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) [17:33:36] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10Dzahn) [17:33:39] (03CR) 10Volans: "Looks reasonable to me, see a couple of optional suggestions inline." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [17:35:46] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10Dzahn) This is basically T297017 but I take it because I was IC and interpret this ticket as the doc part, to write the public incident report and put it on Wikitech. (in addition to existi... [17:36:59] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: Netbox support for svc allocation - https://phabricator.wikimedia.org/T263429 (10Volans) 05In progress→03Open [17:37:10] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) 05In progress→03Resolved a:03Dzahn This became incident T297127 for which we will shortly release a public incident report (as part of the incident ticket, but will... [17:41:14] 10SRE-tools, 10Infrastructure-Foundations: Manage DHCP of Ganeti VMs from Netbox - https://phabricator.wikimedia.org/T297133 (10Volans) [17:41:27] 10SRE-tools, 10Infrastructure-Foundations: Manage DHCP of Ganeti VMs from Netbox - https://phabricator.wikimedia.org/T297133 (10Volans) p:05Triage→03Medium [17:43:05] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Dzahn) Try if the server can talk http to apt1001.wikimedia.org / apt2001.wikimedia.org. After getting an IP from DHCP but before starting the Debian installer it need... [17:44:13] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10herron) >>! In T297127#7550481, @Dzahn wrote: > This is basically T297017 but I take it because I was IC and interpret this ticket as the doc part, to write the public incident report and p... [17:46:29] urbanecm: the --fix just finished too, logs in mwmaint1002:~taavi. afaics the few highly-linked viwiki pages all conflicts and were not included in this run [17:46:31] 10SRE: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) [17:46:57] 10SRE, 10Infrastructure-Foundations, 10Mail: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) [17:52:34] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident - 2021-12-03: mx2001 - https://phabricator.wikimedia.org/T297127 (10Dzahn) [17:55:10] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) [17:58:29] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:55] majavah: great! I'll look at the logs tomorrow, thanks for the help [18:00:03] !log "foreachwiki namespaceDupes.php --fix | tee namespaceDupes-T293839-fix.txt" FINISHED about 15 minutes ago T293839 [18:00:04] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1800). [18:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:07] T293839: Set default namespace aliases for projects - https://phabricator.wikimedia.org/T293839 [18:09:32] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Dzahn) also might want to check logs on apt1001, it should be nginx there, around this: ` [apt1001:/var/log/nginx] $ grep preseed *.log access.log:10.192.32.142 - - [... [18:21:10] (03CR) 10Alex Paskulin: [C: 03+1] Disable CentralNotice on API portal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [18:21:54] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10Legoktm) [18:25:28] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) @Dzahn Thanks for the pointers. There is one log entry which seems to be from the affected host requesting the URL that is being returned in DHCP ` 10.64.20.5... [18:34:58] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [18:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:09] i was referred to this channel to ask about irc.wikimedia.org [18:38:35] do connected clients no longer show up in the user list? [18:39:03] i relied upon that to check if the connection is still working [18:39:15] this is probably the wrong channel to ask. I think you want #wikimedia-ops which is for irc channel operations [18:39:31] gifti: it's possible that the actual server that irc.wikimedia.org points was changed recently (to do maintenance on one of the servers powering it) and your multiple connections have ended up on different servers [18:39:35] this channel here is for keeping Wikipedia up and such... I know the names are confusing [18:39:59] apergos: nope, irc.wm.o refers to the irc recent changes relay which is not a chanop thing [18:40:44] we don't send people there for the relay and the irc.wm.o channels? huh, I may be remembering wrong [18:42:16] (03PS3) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [18:43:56] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [18:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:21] apergos: -ops have nothing to do with irc.wikimedia.org [18:45:45] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [18:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:57] that's what I already said [18:46:56] it's more like -analytics ? [18:47:07] If irc.wikimedia.org was named recent-changes-via-irc.wikimedia.org it would be less confusing :) [18:47:17] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:48:03] good luck getting all of its users to use a new hostname :P [18:48:16] mutante: believe they own it [18:48:18] if you search for "irc" In https://wikitech.wikimedia.org/wiki/Server_Admin_Log you can see the reboot [18:48:35] irc2001.wikimedia.org [18:49:21] gifti: it was a maintenance reboot for kernel upgrade afaict, cant always be avoided [18:49:45] but you can ask them to announce it somewhere [18:49:50] if it affects clients [18:49:55] that don't auto reconnect [18:52:32] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [18:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:02] u-huh, a reboot and then joins, parts and channel listings don't work anymore, but the clients still receive data? [18:56:59] is there configuration that came into effect through the reboot? can it be switched on again? [18:57:22] eh,, that sounds strange indeed but afraid I have no idea. we should have a ticket for that and add wider audience like people involved in the reboot [18:57:47] would you mind making one on phab for this? [18:58:00] will do, any projects i should tag? [18:58:08] I still think that the client observing is just connected to a separate server than all other clients [18:58:15] start with just SRE and dont worry about it, others will edit tags [18:58:22] ok [18:58:58] oh yea, you can try irc1001 vs irc2001, gifti [18:59:08] yup [18:59:38] I just connected to #wikipedia.en and can see other users on the channel and joins/parts of my other clients [19:00:04] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T1900). [19:00:04] nn1l2: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:14] hi [19:00:16] hey [19:00:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [19:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:53] (03PS3) 10Majavah: bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [19:01:00] (03CR) 10Majavah: [C: 03+2] bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [19:01:17] just to say, I have an unstable internet connection. I may get disconnected. If so, please proceed yourself. [19:01:39] it seems that clients are indeed only visible on the server they're connected to [19:01:47] (03Merged) 10jenkins-bot: bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) (owner: 104nn1l2) [19:02:00] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1028.eqiad.wmnet with OS buster [19:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:05] i now see all that i didn't see but none of which i saw before [19:02:27] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [19:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:32] nn1l2: your patch is available for testing on mwdebug1001 [19:02:38] i should still make a ticket though [19:03:22] LGTM https://bn.wikibooks.org/wiki/%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:%E0%A6%A6%E0%A6%B2%E0%A6%97%E0%A6%A4_%E0%A6%85%E0%A6%A7%E0%A6%BF%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%87%E0%A6%B0_%E0%A6%A4%E0%A6%BE%E0%A6%B2%E0%A6%BF%E0%A6%95%E0%A6%BE [19:03:57] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1028.eqiad.wmnet with OS buster [19:03:57] majavah, you can sync. [19:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:04] doing [19:04:32] 10SRE, 10Wikimedia-Developer-Portal, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10bd808) [19:04:55] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742835|bnwikibooks: add autopatrolled and patroller user groups (T296640)]] (duration: 00m 56s) [19:04:56] or should i just monitor both? [19:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:00] T296640: Requesting some features for bnwikibooks - https://phabricator.wikimedia.org/T296640 [19:05:24] (03CR) 10Krinkle: Disable CentralNotice on API portal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:06:22] gifti: I'd say that it's an intentional feature that the two servers are not linked together (redundancy reasons) [19:06:52] yeah, also seems a lot of work for such small gain [19:06:53] also I'd say that the actual backends (irc1001/2001) aren't guaranteed to remain the same unlike irc.wm.o [19:07:11] *the irc.wm.o name [19:07:17] unfortunate [19:07:41] is there a place where you can look up the server names? [19:07:50] should they ever change [19:08:09] gifti: so..in an ideal world you should only monitor irc.wikimedia.org and not care about the backend names [19:08:24] and during a reboot or switch it should move with it [19:08:34] yes [19:09:08] but also your client needs to play a role in it and auto-reconnect and see the DNS change and not use hardcoded IP [19:09:14] I guess [19:10:03] it doesn't use a hardcoded ip.. it just hasn't disconnected since the name was last changes [19:10:09] changed* [19:10:29] (03PS7) 10Alex Paskulin: Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:10:31] yea, ok. so "notice it is time to reconnect in some way" [19:10:59] the names are somewhere in the mediawiki-config repo, although that's not a guaranteed stable format either [19:11:19] yes, you can get the info "what is the currently active IRC server backend" from public repos if you wanted to [19:11:56] you can do a DNS lookup of irc.wikimedia.org and grep for "alias" [19:12:02] irc.wikimedia.org is an alias for irc2001.wikimedia.org. [19:12:34] (03PS8) 10Alex Paskulin: Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:13:14] ah [19:14:38] (03CR) 10Zabe: [C: 03+1] Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:15:34] i see in the log that 2001 was rebooted but i was on the other server [19:18:03] (03Abandoned) 104nn1l2: Set 'WP' namespace alias to NS_PROJECT in mnw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742279 (https://phabricator.wikimedia.org/T296606) (owner: 104nn1l2) [19:18:47] the names have been swapped over the past two weeks because of maintenance requiring reboots [19:18:54] gifti: checked uptime. indeed irc1001 is up 257 days [19:19:11] but what legoktm said then [19:19:30] I think m.oritz was just waiting a bit longer before rebooting 1001 [19:19:35] see https://gerrit.wikimedia.org/r/c/operations/dns/+/742730 [19:20:11] we learned in the last set of reboots that most clients reconnect to the active server within a week [19:20:49] gifti: ^ see in the gerrit change above. it is part of T296721 [19:20:49] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [19:21:14] if you still wanted to leave comments on how to handle reboots or so [19:21:33] also in the theoretical future, the plan is that each IRC user will only be able to see themselves [19:22:24] that'll be annoying for debugging [19:23:36] how? [19:23:39] https://phabricator.wikimedia.org/T234234 [19:23:48] > the daemon should offer a "sandbox" to each client/bot joining, offering a "private"-like IRC channel with only rc-pmtpa writing updates. In this way running the daemon on multiple pods in kubernetes wouldn't require to share state (like the list of connected clients, etc..) [19:25:25] majavah: btw are you done deploying now? [19:25:38] legoktm: yes, forgot to !log it I guess [19:26:18] no worries [19:26:32] problems like https://github.com/countervandalism/CVNBot/issues/72 [19:26:36] !log installing php-yaml on all appservers [19:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:47] (03CR) 10Majavah: [C: 04-1] "Please update the commit message to match reality, currently this does not disable CN but changes its settings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:30:20] 10SRE, 10Wikimedia-Developer-Portal, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10RhinosF1) [19:30:47] (03PS1) 10Legoktm: mediawiki: Enable php-yaml on appservers and api_appservers [puppet] - 10https://gerrit.wikimedia.org/r/744079 (https://phabricator.wikimedia.org/T296331) [19:31:09] oof [19:32:41] AntiComposite: I guess, and leaving a comment on that task would be helpful, just so it's acknowledged as a tradeoff being made. [19:32:50] 10SRE, 10Phabricator: H34 adds an archived project - https://phabricator.wikimedia.org/T297141 (10RhinosF1) [19:33:30] (03CR) 10Legoktm: [C: 03+2] mediawiki: Enable php-yaml on appservers and api_appservers [puppet] - 10https://gerrit.wikimedia.org/r/744079 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [19:34:20] (03PS9) 10Alex Paskulin: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [19:35:01] legoktm: any ideas on who to ask to review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/724049? [19:36:07] me I suppose :p [19:36:20] I need to double check, I thought it needed a re-dump [19:41:46] view-source:https://static-codereview.wikimedia.org/MediaWiki/75446.html [19:41:51] > Follow up 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10LGoto) [19:43:57] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) a:05Ejegg→03Tsevener [19:44:31] (03CR) 10Legoktm: [C: 04-1] "I think this needs a re-dump first :/" [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [19:47:13] legoktm: do the dump contents just live on miscweb* hosts? I don't see any git clones or similar deployment tools in profile::microsites::static_codereview [19:47:39] yes [19:48:07] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) Ok well despite what I said earlier I can confirm the DHCP is failing. It isn't visible on the serial console, but via the virtual monitor port you can see it... [19:48:13] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:17] of course, the HTML structure has entirely changed that it no longer works :( [19:51:35] 10SRE, 10Phabricator: H34 adds an archived project - https://phabricator.wikimedia.org/T297141 (10Aklapper) 05Open→03Resolved a:03Aklapper Thanks for catching that! Done. [19:51:56] 10SRE, 10Phabricator: H34 adds an archived project - https://phabricator.wikimedia.org/T297141 (10Aklapper) ...and backlinking to T101712 for the records [19:58:37] !log trying new dump of Special:CodeReview on mwmaint1002 (T205361) [19:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:41] T205361: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 [19:59:10] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [20:00:45] (03PS11) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [20:06:25] (03CR) 10Legoktm: [C: 04-1] mediawiki: Redirect Special:CodeReview to static archives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [20:06:32] (03PS2) 10Cwhite: site: reprovision codfw logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) [20:12:48] (03PS2) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) [20:14:09] !log begin codfw opensearch upgrade T288612 [20:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:14] T288612: Remove outdated Wikibase settings from production config - https://phabricator.wikimedia.org/T288612 [20:14:37] !log begin codfw opensearch upgrade T288621 [20:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:40] T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621 [20:17:19] (03PS3) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) [20:18:19] (03CR) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [20:19:23] ^ how to parse URLs with regexes [20:20:33] (03CR) 10Cwhite: [C: 03+2] site: reprovision codfw logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [20:23:32] cwhite: good luck! [20:24:56] (03PS1) 10Ssingh: dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087 [20:25:24] <3 [20:25:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32835/console" [puppet] - 10https://gerrit.wikimedia.org/r/744087 (owner: 10Ssingh) [20:26:51] (03CR) 10Wugapodes: "FYI, we're still waiting on community consensus on when to deploy this per the phab task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [20:27:36] (03PS1) 10Cwhite: hiera: synchronize cluster name [puppet] - 10https://gerrit.wikimedia.org/r/744088 (https://phabricator.wikimedia.org/T288621) [20:28:25] (03CR) 10Cwhite: [C: 03+2] hiera: synchronize cluster name [puppet] - 10https://gerrit.wikimedia.org/r/744088 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [20:30:47] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:31:27] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:35:11] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) [20:36:34] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Dzahn) deep link to existing Icinga check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mx2001&service=exi... [20:44:49] (03PS1) 10Ssingh: wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 [20:45:54] (03CR) 10jerkins-bot: [V: 04-1] wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 (owner: 10Ssingh) [20:51:30] (03PS2) 10Ssingh: wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 [20:57:35] 10SRE, 10Phabricator: H34 adds an archived project - https://phabricator.wikimedia.org/T297141 (10RhinosF1) Thanks for the quick fix! [21:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T2100). Please do the needful. [21:00:59] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:09:58] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10herron) [21:10:02] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) [21:10:28] (03Abandoned) 10RLazarus: Merge tag '0.0.1' into debian [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/742573 (owner: 10RLazarus) [21:13:06] (03PS1) 10Ssingh: durum: show if the user is using DoH or DoT [puppet] - 10https://gerrit.wikimedia.org/r/744095 [21:14:24] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32836/console" [puppet] - 10https://gerrit.wikimedia.org/r/744095 (owner: 10Ssingh) [21:19:07] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10herron) In addition to the overall queue totals `exiqsumm` provides a breakdown by destination domain. It would... [21:19:42] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) T275867 may be of interest here as well [21:22:41] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) We'll also want to think about the failure modes for this alert specifically, e.g. if mail is significantly impacted how w... [21:24:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged did not reach VictorOps because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) [21:24:15] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) [21:24:21] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10herron) [21:26:37] 10SRE, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn) [21:27:19] 10SRE, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn) [21:28:02] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn) [21:53:08] (03CR) 10BBlack: [C: 03+1] wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 (owner: 10Ssingh) [22:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211206T2200). [22:01:16] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:11:08] 10SRE, 10Fundraising-Backlog, 10Thank-You-Page, 10Wikimedia-Apache-configuration, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) Fix proposal for issue above is in https://github.com/wikimedia/wikipedia-ios/pull/4081. [22:13:04] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) @bcampbell This actually turned out to be a firewall dropping packets due to a kernel bug. I shared a doc with you if you are curious. [22:15:15] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Legoktm) My understanding per T225623#5253119 is that `@ticket.wikim... [22:15:40] Hey all - mstyles and I are deploying this sec patch right now: https://phabricator.wikimedia.org/T271037#7178772 [22:15:48] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:03] !log mstyles@deploy1002 Synchronized php-1.38.0-wmf.9/includes/content/ContentModelChange.php: Deploy security patch for T271037 (duration: 00m 56s) [22:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:31] what happened to https://sal.toolforge.org/? it's no longer available [22:21:29] hm, bd808 ^ [22:21:47] I just used that a little while ago [22:21:58] webservice go boom. I'm looking [22:22:13] 2021-12-06 22:21:59: (http-header-glue.c.1250) read(): Connection reset by peer 11 12 [22:22:13] 2021-12-06 22:21:59: (gw_backend.c.2149) response not received, request sent: 1276 on socket: unix:/var/run/lighttpd/php.socket.sal-1 for /index.php?, closing connection [22:22:20] maybe an hour ago I was using that as normal but got a single 500 error that was gone on next reload [22:24:10] bd808: https://phabricator.wikimedia.org/T296072 again? [22:24:11] same, I just reloaded and it seems okay [22:24:34] Same here [22:25:04] RhinosF1: yeah. the logs had the same "oops something bad happened" sort of messages [22:25:15] :( [22:25:17] I restarted it [22:25:38] I guess it might need a closer look [22:26:21] or I rewrite it in flask and get rid of fcgi and php ;) [22:26:41] That works :) [22:26:54] lighttpd+fcgi likes to melt under load [22:27:48] * RhinosF1 blames php in general [22:30:46] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10bcampbell) @Dzahn Thanks for sharing the doc, that's helpful. Are there any outstanding emails left in the queue? [22:41:24] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) @bcampbell No more mails in the queue and exim is stil disabled on the server that was affected. mail is currently handled by the other server. [22:41:46] (03CR) 10Seddon: [C: 04-1] "Procedural -1 pending community consensus on timing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [22:42:44] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10bcampbell) @Dzahn Got it, thank you for clarifying. [22:44:47] 10SRE, 10Wikimedia-Developer-Portal, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10bd808) [23:16:36] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:23:14] I'm resuming the php-yaml rollout on api_appservers now [23:33:10] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2025.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:39:15] (03PS2) 10Ssingh: dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087 [23:49:26] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10Dzahn) https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx [23:49:42] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) [23:49:50] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx [23:50:34] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) a:05Dzahn→03None [23:50:57] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) created the public doc and also done with the private google doc from my end [23:52:01] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2025.codfw.wmnet, logstash2030.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:53:31] ^ is me [23:56:47] thanks for letting us know [23:57:16] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal