[00:34:26] (03CR) 10Krinkle: [C: 03+2] multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:35:09] (03Merged) 10jenkins-bot: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:39:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:42:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:42:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:43:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:45:53] !log krinkle@deploy1002 Synchronized multiversion/MWMultiVersion.php: I9d363abd7cfef (duration: 03m 17s) [00:48:04] (03PS7) 10Krinkle: multiversion: Remove use of the $globals temporary JSON cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) [00:48:12] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove use of the $globals temporary JSON cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:48:56] (03Merged) 10jenkins-bot: multiversion: Remove use of the $globals temporary JSON cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:49:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:53:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:56:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:56:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:57:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:00:12] !log krinkle@deploy1002 Synchronized multiversion/: Ic0dbcba9f60f20a (duration: 03m 31s) [01:08:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:15] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:57] (03PS1) 10Krinkle: multiversion: Remove unused $cacheDir and writeToStaticCache (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818645 (https://phabricator.wikimedia.org/T169821) [01:54:59] (03PS1) 10Krinkle: multiversion: Remove unused $cacheDir (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818646 (https://phabricator.wikimedia.org/T169821) [01:58:57] (03PS1) 10Samwilson: Enable RealtimePreview on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818647 (https://phabricator.wikimedia.org/T314150) [02:03:58] (03PS2) 10Tim Starling: Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) [02:10:25] (03PS1) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) [02:10:28] (03PS1) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) [02:11:00] (03CR) 10Tim Starling: [C: 03+2] Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [02:11:29] (03PS2) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) [02:11:36] (03CR) 10CI reject: [V: 04-1] multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:15:15] (03PS1) 10Krinkle: noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 [02:16:44] (03PS2) 10Krinkle: noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 [02:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:23:02] (03CR) 10Tim Starling: [C: 03+2] "root@cp2027:/etc/trafficserver# host appservers-ro.discovery.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [02:47:56] (03PS1) 10Krinkle: multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) [03:39:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:41:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:52:21] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:17:21] (03PS3) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [04:21:31] (03CR) 10Eigyan: [config]: Add click event logging for mobile and desktop (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) (owner: 10Eigyan) [04:25:36] (03PS1) 10Tim Starling: Discovery: codfw should be pooled for api-ro and appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/818652 (https://phabricator.wikimedia.org/T279664) [04:34:13] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:45:39] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:19:57] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:24:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10MoritzMuehlenhoff) >>! In T313876#8117293, @Raymond_Ndibe wrote: > @jbond I copied that request of someone else and edited it to fit my o... [05:29:38] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10MoritzMuehlenhoff) [05:31:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Discovery: codfw should be pooled for api-ro and appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/818652 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [05:43:00] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:23] !log installing Linux 5.10.127-2 on Gitlab runners [05:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:39] (03CR) 10Muehlenhoff: [C: 03+1] "This looks fine to me. If we're concerned about the puppetdb growth we had seen before, there's one other option which came to my mind: We" [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [05:55:09] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119) (owner: 10Jelto) [06:03:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10MoritzMuehlenhoff) [06:13:40] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=(appserver|api)-ro [06:13:50] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=appserver-ro [06:14:01] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=appservers-ro [06:14:33] (03PS1) 10Muehlenhoff: Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) [06:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:17:33] (03CR) 10CI reject: [V: 04-1] Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [06:19:09] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw [06:27:57] (03PS2) 10Muehlenhoff: Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) [06:28:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [06:36:15] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [06:44:10] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [06:52:53] (03PS1) 10Tim Starling: Switch test2.wikipedia.org to multi-DC local routing mode [puppet] - 10https://gerrit.wikimedia.org/r/818991 [06:57:35] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:21:11] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:25:37] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:14:38] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1002.eqiad.wmnet [08:20:30] (03CR) 10FNegri: [C: 03+1] Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [08:22:04] (03Abandoned) 10FNegri: Add node16 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/818474 (https://phabricator.wikimedia.org/T310821) (owner: 10FNegri) [08:22:41] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: 9.x upgrade: add compatibility for session_sharing.match [puppet] - 10https://gerrit.wikimedia.org/r/818504 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [08:25:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1002.eqiad.wmnet [08:27:17] (03PS1) 10Vgutierrez: Revert "cache::haproxy: Switch from UNIX sockets to TCP on cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/818561 [08:27:52] (03CR) 10CI reject: [V: 04-1] Revert "cache::haproxy: Switch from UNIX sockets to TCP on cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/818561 (owner: 10Vgutierrez) [08:28:12] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [08:28:26] (03CR) 10Jaime Nuche: phabricator: Support scap3 deployment of configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [08:28:38] (03PS1) 10Ladsgroup: api: Support for links migration in ApiQueryBacklinks [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818562 (https://phabricator.wikimedia.org/T312865) [08:30:06] (03CR) 10Ladsgroup: [C: 03+2] api: Support for links migration in ApiQueryBacklinks [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:30:13] (03PS2) 10Vgutierrez: Revert "cache::haproxy: Switch from UNIX sockets to TCP on cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/818561 [08:30:54] (03PS3) 10Vgutierrez: Revert "cache::haproxy: Switch from UNIX sockets to TCP on cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/818561 [08:35:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36533/console" [puppet] - 10https://gerrit.wikimedia.org/r/818561 (owner: 10Vgutierrez) [08:36:07] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: Allocation of .0 and .255 IP address from 10.65.3.0/16 and 10.65.2.0/16 network - https://phabricator.wikimedia.org/T314183 (10ayounsi) Even though they look surprising, they are valid IPs. We do `ip_address = prefix.get_first_available_ip()`... [08:36:33] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] Revert "cache::haproxy: Switch from UNIX sockets to TCP on cp4032" [puppet] - 10https://gerrit.wikimedia.org/r/818561 (owner: 10Vgutierrez) [08:37:41] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.2 [puppet] - 10https://gerrit.wikimedia.org/r/818426 (https://phabricator.wikimedia.org/T314119) (owner: 10Jelto) [08:38:39] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks columns in itwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818998 (https://phabricator.wikimedia.org/T312865) [08:39:07] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [08:39:30] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet [08:41:18] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:41:39] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks columns in itwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818998 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:43:05] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:43:11] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks columns in itwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818998 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:43:55] !log rolling upgrade of HAProxy to version 2.4.18 [08:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:03] 10SRE-swift-storage: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) [08:45:35] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:46:40] (03Merged) 10jenkins-bot: api: Support for links migration in ApiQueryBacklinks [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:47:19] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:818998|Stop writing to the old templatelinks columns in itwikisource (T312865)]] (duration: 03m 12s) [08:47:21] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:48:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:48:30] !log thanos-be2004: copy quarantined and tmp off sdb3 and into sdb4 for analysis and to free space - T314275 [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:33] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [08:50:06] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet [08:50:28] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2002.codfw.wmnet [08:50:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:50:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:51:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:53:11] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.22/includes/api: Backport: [[gerrit:818562|api: Support for links migration in ApiQueryBacklinks (T312865 T314112)]] (duration: 03m 01s) [08:53:14] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:53:15] T314112: Retrieving embeddedin pages fails with internal_api_error_DBQueryError on testwiki since yesterday - https://phabricator.wikimedia.org/T314112 [08:53:56] (03CR) 10FNegri: [C: 03+2] Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [08:56:12] (03CR) 10CI reject: [V: 04-1] Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [08:57:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:58:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:58:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:59:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:59:16] (03CR) 10Majavah: [C: 03+2] "retrying after a CI failure" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [09:00:52] (03CR) 10CI reject: [V: 04-1] Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [09:00:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2002.codfw.wmnet [09:01:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2003.codfw.wmnet [09:03:21] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10Aklapper) [09:08:57] (03CR) 10Jaime Nuche: [C: 03+1] scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [09:09:08] (03CR) 10Jaime Nuche: [C: 03+1] phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [09:10:16] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2003.codfw.wmnet [09:10:24] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [09:21:15] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2004.codfw.wmnet [09:31:13] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [09:35:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:36:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:36:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Maintenance [09:36:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance [09:38:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:38:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1112.eqiad.wmnet with reason: Maintenance [09:38:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:38:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P32120 and previous config saved to /var/cache/conftool/dbconfig/20220801-093845-ladsgroup.json [09:38:48] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:40:24] (03PS3) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 [09:41:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: Export sshkey for gerrit shared services [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [09:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P32121 and previous config saved to /var/cache/conftool/dbconfig/20220801-094156-ladsgroup.json [09:42:49] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:43:00] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:44:17] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10elukey) silenced the alert in alerts.wikimedia.org for a couple of weeks :) [09:47:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] "deployed and looks good" [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [09:48:09] (03CR) 10CI reject: [V: 04-1] reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond) [09:53:26] (03CR) 10Samtar: [C: 03+1] Enable RealtimePreview on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818647 (https://phabricator.wikimedia.org/T314150) (owner: 10Samwilson) [09:56:14] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@85585b0]: (no justification provided) [09:56:19] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@85585b0]: (no justification provided) (duration: 00m 05s) [09:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32122 and previous config saved to /var/cache/conftool/dbconfig/20220801-095702-ladsgroup.json [10:00:22] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@4da9195]: (no justification provided) [10:00:39] (03PS3) 10Muehlenhoff: Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) [10:00:41] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@4da9195]: (no justification provided) (duration: 00m 19s) [10:04:37] (03CR) 10Vgutierrez: [C: 03+2] hiera: enable ATS9 on cp6008 [puppet] - 10https://gerrit.wikimedia.org/r/818456 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [10:05:28] !log test ATS 9.1.2 on cp6008 - T309651 [10:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:32] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [10:08:07] (03CR) 10Jbond: admin: Add mraish to analytics_privatedata_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [10:09:23] (03CR) 10Vgutierrez: [C: 03+2] hiera: enable ATS9 on cp6016 [puppet] - 10https://gerrit.wikimedia.org/r/818458 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [10:09:37] !log test ATS 9.1.2 on cp6016 - T309651 [10:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [10:11:16] (03CR) 10Jbond: [C: 03+2] admin: add raymond-ndibe user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817843 (https://phabricator.wikimedia.org/T313876) (owner: 10Volans) [10:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32123 and previous config saved to /var/cache/conftool/dbconfig/20220801-101208-ladsgroup.json [10:15:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10jbond) >>! In T313876#8117293, @Raymond_Ndibe wrote: >>>! In T313876#8108700, @jbond wrote: >> @Raymond_Ndibe i noticed the following in... [10:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:18:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10jbond) 05In progress→03Resolved [10:24:53] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [10:25:15] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:27:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T314041)', diff saved to https://phabricator.wikimedia.org/P32124 and previous config saved to /var/cache/conftool/dbconfig/20220801-102714-ladsgroup.json [10:27:18] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:29:50] 10SRE-swift-storage: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) I have freed some space on thanos-be2004 `sdb3`, though depending on how data is shuffled around the ring the free space might not last long. AFAICT this is due to the tegola containers being quite big a... [10:42:52] (03PS1) 10Phuedx: Revert "testwiki: Add mediawiki.web_ui.interactions stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819014 (https://phabricator.wikimedia.org/T314151) [11:04:23] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:10] (03PS1) 10Jcrespo: Attempt to follow Wikimedia's Design Style Guide [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) [11:25:01] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:26:24] ^ I'm working on that [11:27:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:31:17] (03CR) 10Lucas Werkmeister (WMDE): "I haven’t tested this at all (and wouldn’t know how to test it), but I’m hoping I can get some eyes on this and get T293614 unstuck…" [puppet] - 10https://gerrit.wikimedia.org/r/819016 (https://phabricator.wikimedia.org/T293614) (owner: 10Lucas Werkmeister (WMDE)) [11:31:27] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:32:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:33:09] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Elitre) Sorry for the delay, all's approved. [11:33:52] (03CR) 10Ayounsi: customscripts: export 'mgmt' entries from hiera_export (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:37:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Scheduled for upcoming backport+config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [11:43:27] !log uploaded openjdk-8 8u342-b07-1~deb9u1 for stretch-wikimedia [11:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:55] !log installing openjdk-8 security updates for stretch [11:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:57] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:54:55] (03PS1) 10Jbond: tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 [11:55:03] (03CR) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:01:05] (03PS1) 10Jbond: setup.py: prevent installing flake 8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/819042 [12:01:19] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:01:39] (03CR) 10CI reject: [V: 04-1] tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [12:03:47] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:06:44] (03PS2) 10Jbond: tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 [12:07:03] (03CR) 10VolkerE: "Thanks for starting this! Please compare also https://phabricator.wikimedia.org/source/wikimedia-ui-base/browse/master/wikimedia-ui-base.l" [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:09:08] (03PS2) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/spicerack] - 10https://gerrit.wikimedia.org/r/819042 [12:13:52] (03CR) 10CI reject: [V: 04-1] tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [12:14:24] (03CR) 10Jbond: tox: move all flak8 config to setup.cfg (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [12:14:48] 10SRE, 10SRE-Access-Requests, 10User-Raymond_Ndibe: Requesting access to cloud-roots for Raymond Ndibe - https://phabricator.wikimedia.org/T313876 (10Aklapper) (It's always best to look up and follow documentation, and not to copy some outdated old stuff instead. Thanks.) [12:16:39] (03PS1) 10Esanders: DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) [12:16:47] (03PS3) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/spicerack] - 10https://gerrit.wikimedia.org/r/819042 [12:20:45] (03PS1) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/cumin] - 10https://gerrit.wikimedia.org/r/819047 [12:29:52] (03PS1) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819048 [12:29:54] (03PS1) 10Jbond: flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 [12:33:39] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [12:34:26] (03PS1) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/homer] - 10https://gerrit.wikimedia.org/r/819050 [12:34:28] (03PS1) 10Jbond: flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 [12:40:20] (03PS3) 10Urbanecm: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) [12:40:22] (03PS1) 10Urbanecm: testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) [12:40:44] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond) [12:45:03] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [12:45:08] (03PS1) 10Urbanecm: Growth: Switch pilot wikis to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) [12:45:22] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819054 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [12:59:19] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:59:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T1300). [13:00:05] hauskatze, koi, phuedx, Lucas_WMDE, and jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] o/ [13:00:30] I can deploy today! [13:00:45] o/ [13:00:57] o/ [13:01:10] jan_drewniak: your commit looks to be already deployed. is there anything else to follow up on? [13:01:40] urbanecm: yeah, I need a revert of https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/818157 [13:01:49] ack [13:01:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:02:13] jan_drewniak: can you upload the revert to gerrit please? [13:02:27] and i just saw the MW alerts [13:02:47] urbanecm: OK, I didn't click revert because I thought reverts get automatically merged, but I guess that's not the case [13:03:16] (03PS1) 10Jdrewniak: Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819066 [13:03:17] ah, makes sense jan_drewniak. nope, reverts need to be +2'ed as any other commits. [13:03:51] o/ [13:05:37] urbanecm: ok updated the deployments page and made the revert here https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/819066 [13:05:40] thanks [13:06:58] (03Abandoned) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/spicerack] - 10https://gerrit.wikimedia.org/r/819042 (owner: 10Jbond) [13:07:18] looks like something's happening with MW. icinga complains about fatals+latency. [13:07:36] (03Abandoned) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/cumin] - 10https://gerrit.wikimedia.org/r/819047 (owner: 10Jbond) [13:07:41] not comfortable deploying until the alerts are clarified. [13:07:58] (03Abandoned) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819048 (owner: 10Jbond) [13:08:04] (03PS2) 10Jbond: flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 [13:08:37] (03Abandoned) 10Jbond: setup.py: prevent installing flake 8 version 5 or above [software/homer] - 10https://gerrit.wikimedia.org/r/819050 (owner: 10Jbond) [13:09:11] (03PS2) 10Jbond: flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 [13:10:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:10:30] hm [13:10:31] looks like we're in a partial outage? I can't load https://en.wikipedia.org/wiki/Multiple_integral at all. [13:10:37] <_joe_> yes we are [13:10:50] sorry I'm late [13:11:02] <_joe_> did we just deploy something? [13:11:20] * urbanecm was about to start deployment, but I didn't even SSH in. [13:11:31] * Lucas_WMDE is also only sshed into stat1007 [13:11:34] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [13:11:43] hauskatze: no problem. the window is delayed due to an incident. [13:11:45] <_joe_> vgutierrez: the php fpm slowlog doesn't say much sadly [13:12:05] urbanecm: ok, I'll make some tea while we wait :) [13:12:09] <_joe_> batchGetMathML [13:12:16] <_joe_> yes hauskatze please wait [13:12:25] sure [13:14:01] urbanecm: if that page has lots of tags, Tech News said something about improved LaTeX being deployed this week which might be the cause? [13:14:43] [4017ce0e-f1fb-4750-906c-8f663a883cc3] 2022-08-01 13:13:56: Fatal exception of type "Wikimedia\RequestTimeout\RequestTimeoutException" <-- if it is of any help [13:15:12] <_joe_> hauskatze: that just says it's taking too long to reply to your requests [13:15:18] (03CR) 10Elukey: sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:15:20] en wiki is getting Lua erros on refs [13:15:29] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond) [13:15:30] <_joe_> Sario: refs to wikidata? [13:15:36] All refs [13:16:29] https://usercontent.irccloud-cdn.com/file/Pdcxm49O/tesm4.png [13:16:29] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:16:43] Example screenshot [13:17:28] Another screenshot by a different user in #wikipedia-en [13:18:37] <_joe_> Sario: we're investigating [13:18:49] Thanks [13:20:17] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:53] https://phabricator.wikimedia.org/T314292 FYI channel, just from what I saw, I know its known \o/ [13:21:13] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:21:54] <_joe_> Sario: FTR, the cause of the issue was an edit on CS1 [13:22:38] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [13:24:17] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [13:24:17] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2044.codfw.wmnet with OS bullseye [13:24:24] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2044.codfw.wmnet with OS bullseye [13:24:59] 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - July 2022 - https://phabricator.wikimedia.org/T313783 (10ayounsi) 05Open→03Resolved a:03ayounsi > Issue on the Subsea portion, betwen Bellport and Bude. No event and unable to isolate the cause [13:30:13] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:30:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:40:53] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add to Acme-chief's Hieradata. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse) [13:42:51] (03CR) 10Jbond: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond) [13:43:02] (03PS3) 10Jbond: flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 [13:43:29] (03PS3) 10Jbond: tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 [13:43:54] (03PS3) 10Jbond: flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 [13:44:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2044.codfw.wmnet with reason: host reimage [13:45:57] _joe_: just checking, would it be ok to do MW deployments now? [13:46:52] <_joe_> urbanecm: ask who's editing CS1 :D [13:46:56] <_joe_> jokes aside, yes [13:46:58] <_joe_> sorry [13:47:18] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/pywmflib] - 10https://gerrit.wikimedia.org/r/819049 (owner: 10Jbond) [13:47:26] (03CR) 10CI reject: [V: 04-1] flake8: move all flake8 config to setup.cfg [software/homer] - 10https://gerrit.wikimedia.org/r/819051 (owner: 10Jbond) [13:47:52] :D [13:48:01] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2044.codfw.wmnet with reason: host reimage [13:49:12] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:49:17] Well, with ~10 minutes left, we don't have time for everything. hauskatze jan_drewniak koi phuedx Lucas_WMDE anything urgent to start with? [13:49:25] mine isn’t urgent at all [13:49:31] I can do it later (beta-only anyways) [13:49:39] not urgent either [13:49:39] Ack [13:49:41] well, mine was kind of a "design" UBN... [13:50:15] I vote for jan_drewniak and koi’s first change [13:50:23] (srwikisource logo aspect ratio fix) [13:50:40] Lets try it. [13:50:43] Seconded. If jan_drewniak's is a design UBN then it should be prioritized. Mine can wait [13:50:46] (03CR) 10Urbanecm: [C: 03+2] Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819066 (owner: 10Jdrewniak) [13:50:50] (03CR) 10ArielGlenn: [C: 03+1] kiwix: create dest dir before rsyncing if it does not exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814707 (owner: 10David Caro) [13:50:55] Is koi still around? [13:51:02] yes, I'm around [13:51:07] Great [13:51:45] fyi: An icon change that caused an RfC on enwiki... 🤨 https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#RfC:_Should_we_use_the_longstanding_external_links_icon_or_the_new_one? [13:51:47] (03PS2) 10Urbanecm: srwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818576 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:51:51] (03CR) 10Urbanecm: [C: 03+2] srwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818576 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:52:27] we're literally changing the whole skin, but the external link icon, that's where people draw the line :P [13:52:31] ah, joy [13:52:37] (03PS1) 10Slyngshede: Bump version number to 0.2 [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/819061 (https://phabricator.wikimedia.org/T311288) [13:52:42] (03Merged) 10jenkins-bot: srwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818576 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:53:02] jan_drewniak: it's always quite hard to predict what people will complain about :D [13:53:04] jan_drewniak: answer is simple. everyone still uses monobook and cologneblue [13:53:41] lol, yeah it's a lottery [13:54:27] Lucas_WMDE: grr [13:54:31] Legacy Vector! [13:55:24] koi: your patch is at mwdebug1001, can you check? [13:55:31] looking [13:55:51] * hauskatze uses MonoBook and he ain't changing [13:56:31] Lucas_WMDE: jokes aside, did someone actually look into that skin question (which skins are used by who)? [13:56:42] urbanecm: LGTM [13:56:45] thanks, syncing [13:56:49] no idea, I have nothing to offer beyond the joke sorry ^^ [13:57:04] was just wondering :) [13:57:16] (03PS1) 10MVernon: hieradata: make restbase1016 a 3.11.13 canary [puppet] - 10https://gerrit.wikimedia.org/r/819062 (https://phabricator.wikimedia.org/T309896) [13:58:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:58:07] !log UTC afternoon backport window is going to overflow by a couple of minutes [13:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:22] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/819062 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [13:59:34] urbanecm: found https://w.wiki/5XU4 with a list of statistics; latest check (2018) seems to have found that Monobook is more popular with active users [13:59:52] thanks! [14:01:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:01:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:01:28] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: bcb7b0d4d07b454a169804d7b1011ec3f2530c00: srwikisource: Adjust width-height ratio of logo to fix display issue (T310961; 1/2) (duration: 03m 41s) [14:01:31] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [14:02:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:04:11] !log Purge https://en.wikipedia.org/static/images/project-logos/srwikisource{.png;-1.5x.png;-2x.png} (T310961) [14:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:47] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: bcb7b0d4d07b454a169804d7b1011ec3f2530c00: Adjust width-height ratio of logo to fix display issue (T310961; 2/2) (duration: 03m 17s) [14:04:53] koi: should be live! [14:04:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2044.codfw.wmnet with OS bullseye [14:05:05] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2044.codfw.wmnet with OS bullseye completed: - elastic2044 (**WAR... [14:05:07] 10SRE, 10SRE-OnFire, 10serviceops, 10serviceops-collab, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10CDanis) ping @Dzahn and also @Joe -- would love some advice on stateful services on k8s [14:05:23] indeed, thanks! [14:05:27] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [14:05:29] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [14:05:29] no problem [14:06:24] jan_drewniak: still waiting on CI for your patch [14:06:31] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:55] (03Merged) 10jenkins-bot: Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819066 (owner: 10Jdrewniak) [14:07:17] great :) [14:07:31] (03PS1) 10Elukey: Add fake config for ml-service drafttopic [labs/private] - 10https://gerrit.wikimedia.org/r/819086 [14:07:31] 👍 [14:08:22] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake config for ml-service drafttopic [labs/private] - 10https://gerrit.wikimedia.org/r/819086 (owner: 10Elukey) [14:08:51] jan_drewniak: pulled to mwdebug1001. can you check? [14:09:29] urbanecm: yup! looks like the old icon! good to sync [14:09:34] syncing! [14:10:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Ottomata) Approved! [14:11:53] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Ottomata) @volans, sounds like ssh access is not needed for this request. Group membership, LDAP membership, but no ssh key needed. [14:12:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:12:35] (03CR) 10AOkoth: gitlab: add gitlab role to gitlab2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [14:12:39] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [14:12:40] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [14:12:41] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [14:13:01] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.22/skins/Vector/: b5007c5f1c389deb344c5bb99e950b4190436cab: Revert "styles: Unify on standard external link icon"" (duration: 03m 16s) [14:13:03] (03PS2) 10AOkoth: gitlab: add gitlab role to gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) [14:13:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:13:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:22] (03PS1) 10Elukey: profile::k8s::deployment_server: add config for revscoring-drafttopic [puppet] - 10https://gerrit.wikimedia.org/r/819087 [14:13:30] jan_drewniak: and, live [14:13:42] all the other patch owners: please reschedule for a different window -- thanks! [14:13:52] urbanecm: perfect! thank you! [14:13:55] hth [14:14:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:16:27] (03CR) 10Elukey: [C: 03+2] profile::k8s::deployment_server: add config for revscoring-drafttopic [puppet] - 10https://gerrit.wikimedia.org/r/819087 (owner: 10Elukey) [14:16:53] (03CR) 10Jbond: [C: 03+2] tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [14:17:02] (03CR) 10Jbond: [C: 03+2] tox: move all flak8 config to setup.cfg (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [14:17:42] (03CR) 10Majavah: "recheck" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [14:18:02] (03PS1) 10Elukey: ml-services: add base configuration for revscoring-drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/819088 [14:18:07] (03CR) 10Jelto: "gitlab2002.wikimedia.org should be added to profile::gitlab::passive_hosts, otherwise backup sync is not working." [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [14:19:55] (03CR) 10FNegri: [C: 03+2] Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [14:24:28] (03Merged) 10jenkins-bot: tox: move all flak8 config to setup.cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/819041 (owner: 10Jbond) [14:24:30] (03Merged) 10jenkins-bot: Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821) (owner: 10Majavah) [14:26:30] (03CR) 10Elukey: [C: 03+2] ml-services: add base configuration for revscoring-drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/819088 (owner: 10Elukey) [14:28:20] (03PS3) 10AOkoth: gitlab: add gitlab role to gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) [14:28:47] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [14:29:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:29:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [14:29:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:29:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:29:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:29:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:29:38] (03CR) 10AOkoth: gitlab: add gitlab role to gitlab2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [14:29:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:30:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:30:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:30:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:34:33] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:36:51] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10Sustainability (Incident Followup): Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10EBernhardson) [14:38:22] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:39:14] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:39:21] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:39:44] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [14:42:27] !log installing openjdk-11 security updates [14:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:41] (03PS1) 10Majavah: hieradata: fix lvs config for dbproxy1018/1019 [puppet] - 10https://gerrit.wikimedia.org/r/819090 [14:48:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36537/console" [puppet] - 10https://gerrit.wikimedia.org/r/819090 (owner: 10Majavah) [14:50:24] (03CR) 10Btullis: [C: 03+2] hieradata: fix lvs config for dbproxy1018/1019 [puppet] - 10https://gerrit.wikimedia.org/r/819090 (owner: 10Majavah) [14:51:38] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [14:51:56] (03PS1) 10FNegri: Add new Node16 image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819091 (https://phabricator.wikimedia.org/T310821) [14:52:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [14:52:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [14:53:02] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) [14:53:06] (03CR) 10CI reject: [V: 04-1] Add new Node16 image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819091 (https://phabricator.wikimedia.org/T310821) (owner: 10FNegri) [14:53:46] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:54:05] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:55:18] (03PS1) 10Majavah: tox: Pin flake8 to 4.0.x for now [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819093 [14:57:25] (03CR) 10FNegri: [C: 03+2] tox: Pin flake8 to 4.0.x for now [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819093 (owner: 10Majavah) [14:58:12] 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) p:05Triage→03Medium [14:58:55] (03Merged) 10jenkins-bot: tox: Pin flake8 to 4.0.x for now [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819093 (owner: 10Majavah) [14:59:13] (03PS2) 10Majavah: Add new Node16 image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819091 (https://phabricator.wikimedia.org/T310821) (owner: 10FNegri) [15:00:14] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:01:15] jouncebot: now [15:01:16] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [15:01:31] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) p:05Triage→03Medium [15:01:40] then I’ll deploy that beta change I had scheduled for the window earlier, if that’s okay with everyone [15:01:49] (03PS2) 10Lucas Werkmeister (WMDE): Beta: add configuration for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [15:04:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Beta: add configuration for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [15:04:42] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:05:21] (03Merged) 10jenkins-bot: Beta: add configuration for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [15:06:30] pulled to mwdebug1001, checking [15:07:13] looks good, will sync [15:08:16] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) ` Create Dispatch: Success You have successfully submitted request SR147890192. [15:08:18] (03PS1) 10Filippo Giunchedi: swift: add script to grow the SSD partition for container databases [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) [15:09:10] (03CR) 10CI reject: [V: 04-1] swift: add script to grow the SSD partition for container databases [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [15:09:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:10:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:10:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:11:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:818127|Beta: add configuration for redirect badges (T313896)]] (1/2, should be a no-op) (duration: 03m 15s) [15:11:10] T313896: Create configuration for redirect badges - https://phabricator.wikimedia.org/T313896 [15:11:29] (03PS2) 10Filippo Giunchedi: swift: add script to grow the SSD partition for container databases [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) [15:11:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:14:20] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:14:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:818127|Beta: add configuration for redirect badges (T313896)]] (2/2, should be a no-op) (duration: 03m 30s) [15:18:27] (03CR) 10Eevans: [C: 03+1] hieradata: make restbase1016 a 3.11.13 canary [puppet] - 10https://gerrit.wikimedia.org/r/819062 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [15:20:16] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/819062 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [15:25:03] (03CR) 10BryanDavis: [C: 03+1] "I keep hoping for T237773 to happen, but yeah..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) (owner: 10Krinkle) [15:25:22] (03CR) 10MVernon: [C: 03+2] hieradata: make restbase1016 a 3.11.13 canary [puppet] - 10https://gerrit.wikimedia.org/r/819062 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [15:25:42] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: Allocation of .0 and .255 IP address from 10.65.3.0/16 and 10.65.2.0/16 network - https://phabricator.wikimedia.org/T314183 (10Papaul) @ayounsi make sense if it is /16 and yes it is working on IDRAC. The only issue is we received an alert on... [15:29:17] !log mvernon@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Canary testing of 3.11.13 on Restbase T309896 - mvernon@cumin1001 [15:29:20] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [15:29:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T1530). [15:33:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:02] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:39:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Canary testing of 3.11.13 on Restbase T309896 - mvernon@cumin1001 [15:39:41] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [15:41:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1018.eqiad.wmnet with OS bullseye [15:42:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10MoritzMuehlenhoff) [15:42:35] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [15:45:40] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:50:32] (03CR) 10Filippo Giunchedi: sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:52:31] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10BBlack) >>! In T138093#8117992, @ori wrote: > [...] Thanks again for working on this! Sounds like a good plan overall to me! > The is... [15:54:43] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1018.eqiad.wmnet with reason: host reimage [15:55:20] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [15:57:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1018.eqiad.wmnet with reason: host reimage [15:59:14] (03PS3) 10Filippo Giunchedi: swift: add script to grow the SSD partition for container databases [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) [15:59:31] (03CR) 10Filippo Giunchedi: swift: add script to grow the SSD partition for container databases (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [15:59:40] (03CR) 10Majavah: [C: 03+2] Add new Node16 image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819091 (https://phabricator.wikimedia.org/T310821) (owner: 10FNegri) [16:00:34] (03Merged) 10jenkins-bot: Add new Node16 image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819091 (https://phabricator.wikimedia.org/T310821) (owner: 10FNegri) [16:01:48] (03PS1) 10Jcrespo: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) [16:03:04] (03CR) 10CI reject: [V: 04-1] Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [16:05:01] (03CR) 10MVernon: [C: 03+1] "LGTM thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [16:08:22] (03PS2) 10Jcrespo: Enable gitlab backup type for wmfbackups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) [16:09:22] (03PS2) 10Jcrespo: Attempt to follow Wikimedia's Design Style Guide [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) [16:10:33] !log cwhite@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kibana7,name=logstash2023.codfw.wmnet [16:10:36] (03Abandoned) 10MdsShakil: code - Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (owner: 10MdsShakil) [16:10:45] (03Restored) 10MdsShakil: code - Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 (owner: 10MdsShakil) [16:12:04] (03PS2) 10MdsShakil: code - Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 [16:13:02] (03PS3) 10MdsShakil: code (under construction) - Add bnwiki in wgImportSources to bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819071 [16:13:45] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [16:16:11] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1018.eqiad.wmnet with OS bullseye [16:17:22] PROBLEM - OpenSearch health check for shards on 9200 on logstash2023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f4af72789b0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [16:17:22] org/wiki/Search%23Administration [16:23:28] !log cwhite@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kibana7,name=logstash2023.codfw.wmnet [16:25:04] !log installing tcpdump updates from bullseye point release [16:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:39] howdy, i think we may need an NSCA stop/start since we are getting passive check failures for the frack hosts [16:29:57] see https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=fundraising for what we are seeing and https://phabricator.wikimedia.org/T196336 for the history. [16:30:58] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [16:31:54] (03CR) 10Ahmon Dancy: "This is very helpful. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [16:34:40] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team - https://phabricator.wikimedia.org/T314319 (10BTullis) [16:35:00] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team - https://phabricator.wikimedia.org/T314319 (10BTullis) [16:38:08] (03CR) 10AOkoth: [C: 03+2] gitlab: add gitlab role to gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/818505 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [16:39:21] 10SRE, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) 05Resolved→03Open 10:10 < elukey> mutante: o/ I found out in https://phabricator.wikimedia.org/T230178 that siege/wrk/etc.. were removed from deploy1002.... [16:41:34] (03CR) 10Btullis: airflow - Modify platform_eng instance to do deployment of airflow-dags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [16:42:53] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: Allocation of .0 and .255 IP address from 10.65.3.0/16 and 10.65.2.0/16 network - https://phabricator.wikimedia.org/T314183 (10ayounsi) 05Open→03Resolved a:03ayounsi Assuming the duplicate IPs issue got solved. Feel free to re-open if... [16:43:02] RECOVERY - OpenSearch health check for shards on 9200 on logstash2023 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 10, discovered_master: True, active_primary_shards: 482, active_shards: 1111, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [16:43:02] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:44:05] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) >>! In T171482#8114976, @Ottomata wrote: > GRIZZLYYYYY? https://github.com/grafana/grizzly also: https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Gri... [16:48:20] (03CR) 10Btullis: airflow - Modify platform_eng instance to do deployment of airflow-dags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [16:50:04] (03PS1) 10BryanDavis: striker: remove legacy settings [labs/private] - 10https://gerrit.wikimedia.org/r/819116 (https://phabricator.wikimedia.org/T306469) [16:53:11] (03CR) 10LSobanski: [C: 03+1] doc: set role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/812430 (owner: 10Dzahn) [16:54:33] (03CR) 10Dzahn: [C: 03+1] rsync::quickdatacopy: Allow specifying a custom interval for auto_sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [16:54:39] (03PS4) 10Dzahn: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [16:54:43] (03CR) 10Dzahn: [C: 03+2] doc: set role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/812430 (owner: 10Dzahn) [16:57:37] (03PS5) 10Dzahn: rsync::quickdatacopy: Allow specifying a custom interval for auto_sync [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [17:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T1700). [17:01:12] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway) [17:01:14] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [17:01:28] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) 05Stalled→03Open exim 4.96 is now in bullseye backports [17:02:13] (03CR) 10BryanDavis: [C: 04-1] striker: remove legacy settings (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/819116 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:06:08] !log alert1001 - systemctl restart nsca - pinged by fundraising tech because fundraising hosts have the "passive check is awol" issue again (T196336) [17:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:12] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [17:08:36] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [17:08:37] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [17:08:38] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [17:09:43] 10SRE, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) Yes, just restarting the service fixes it. Passive checks are coming in and turning OK again. [17:16:43] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2025.codfw.wmnet with OS bullseye [17:16:49] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2025.codfw.wmnet with OS bullseye [17:18:30] !log T289135 T314078 Manually reimaging remaining codfw stretch hosts (`elastic[2025,2031,2054,2059-2060]`) to bullseye, one host at a time, waiting for green cluster status to return between each run. `ryankemper@cumin1001` tmux session `codfw_reimage` [17:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:34] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [17:18:34] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [17:20:35] (03PS1) 10BryanDavis: striker: remove legacy deployment [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) [17:23:18] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36547/" [puppet] - 10https://gerrit.wikimedia.org/r/715637 (owner: 10Legoktm) [17:24:28] (03PS2) 10BryanDavis: striker: remove legacy settings [labs/private] - 10https://gerrit.wikimedia.org/r/819116 (https://phabricator.wikimedia.org/T306469) [17:24:38] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [17:25:44] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [17:31:57] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2025.codfw.wmnet with reason: host reimage [17:32:31] ^ Two alerts for elastic2025 are just noise, the host is being reimaged. Unsure why they fired before the downtime though [17:36:30] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [17:37:18] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2025.codfw.wmnet with reason: host reimage [17:41:36] PROBLEM - Check for large files in client bucket on elastic2025 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.77: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [17:41:41] (03PS1) 10Andrea Denisse: netmon: Add the netmon1003 host as a syslog destination in homer [homer/public] - 10https://gerrit.wikimedia.org/r/819124 (https://phabricator.wikimedia.org/T309074) [17:41:54] PROBLEM - MD RAID on elastic2025 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.77: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:41:55] PROBLEM - Check size of conntrack table on elastic2025 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.77: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:42:00] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:42:06] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson [17:43:30] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [17:44:09] (03CR) 10BryanDavis: [V: 04-1 C: 04-1] striker: remove legacy deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:44:22] RECOVERY - Check for large files in client bucket on elastic2025 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [17:44:42] RECOVERY - Check size of conntrack table on elastic2025 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:44:50] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-omega-codfw.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:45] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:49:38] PROBLEM - Host elastic2025 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:46] RECOVERY - Host elastic2025 is UP: PING OK - Packet loss = 0%, RTA = 31.93 ms [17:51:08] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:51:31] 10SRE, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dwisehaupt) 05Resolved→03Open We had another instance of this today that wasn't solved by restarting the service. mutante was kind enough to restart and try to debug l... [17:52:45] RECOVERY - MD RAID on elastic2025 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:53:12] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:54:22] (03PS19) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:55:20] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:34] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2025.codfw.wmnet with OS bullseye [17:58:39] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2025.codfw.wmnet with OS bullseye completed: - elastic2025 (... [18:01:13] (03PS2) 10Muehlenhoff: smart: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811229 (https://phabricator.wikimedia.org/T308013) [18:03:52] (03CR) 10Muehlenhoff: [C: 03+2] smart: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811229 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:05:02] mutante: I'm merging your rsync patch along [18:05:27] moritzm: please do. reason: multiple PMs [18:11:30] (03PS2) 10BryanDavis: striker: remove legacy deployment [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) [18:12:14] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2031.codfw.wmnet with OS bullseye [18:12:19] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2031.codfw.wmnet with OS bullseye [18:17:27] (03CR) 10Andrea Denisse: [C: 03+2] Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [18:17:44] (03CR) 10Andrea Denisse: Use diff --color instead of colordiff as colordiff is not standard (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 (owner: 10Andrea Denisse) [18:18:47] (03CR) 10Andrea Denisse: librenms: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [18:20:03] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [18:20:08] (03CR) 10BryanDavis: "PCC output showing lots and lots of changes at: https://puppet-compiler.wmflabs.org/pcc-worker1002/36553/" [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:20:54] 10SRE, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dwisehaupt) We did see these entries in our logs, but they are not unique and have happened in the past: ` user.warning: 2022-08-01T15:57:12.813867+00:00 frdb2001 nagios_n... [18:21:25] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [18:27:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2031.codfw.wmnet with reason: host reimage [18:31:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:32:57] !log gitlab - created group 'data_persistence' - added Ladsgroup and upgraded from member to maintainer [18:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2031.codfw.wmnet with reason: host reimage [18:36:38] (03PS1) 10Daniel Kinzler: Parsoid REST handler: allow pagebundle input without original HTML. [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 [18:38:15] PROBLEM - Check size of conntrack table on elastic2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.156: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:38:37] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.156: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:35] (03PS2) 10Daniel Kinzler: Parsoid REST handler: allow pagebundle input without original HTML. [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 [18:40:41] RECOVERY - Check size of conntrack table on elastic2031 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:44:24] !log gitlab - moved data_persistence group to new parent, under /repos/ [18:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10KFrancis) @volans I am confirming Aline's signed NDA. Please proceed with the access request. Thanks! [18:45:23] (03PS3) 10Daniel Kinzler: Parsoid REST handler: allow pagebundle input without original HTML. [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 [18:45:47] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:46:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) a:05odimitrijevic→03Dzahn [18:46:25] (03PS1) 10Muehlenhoff: librenms: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/819131 [18:46:27] (03PS1) 10Muehlenhoff: librenms: Install PHP packages via the meta packages [puppet] - 10https://gerrit.wikimedia.org/r/819132 [18:46:29] (03PS1) 10Muehlenhoff: librenms: Collate PHP packages used in bullseye and buster [puppet] - 10https://gerrit.wikimedia.org/r/819133 [18:46:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) Thanks @KFrancis I am taking over as this week's clinic duty. Going ahead. [18:47:29] (03CR) 10Muehlenhoff: librenms: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [18:47:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) [18:49:17] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:50:43] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:11] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10lmata) [18:56:26] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2031.codfw.wmnet with OS bullseye [18:56:31] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2031.codfw.wmnet with OS bullseye completed: - elastic2031 (... [18:57:07] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:57:20] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10lmata) [18:59:08] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10lmata) [19:02:16] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/819131 (owner: 10Muehlenhoff) [19:03:05] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/819132 (owner: 10Muehlenhoff) [19:04:28] (03CR) 10Andrea Denisse: [C: 03+1] "I didn't know about 'ensure_packages()', it looks and works better indeed!" [puppet] - 10https://gerrit.wikimedia.org/r/819133 (owner: 10Muehlenhoff) [19:04:34] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) So.. on alert1001.wikimedia.org in `tail -f /var/log/icinga/icinga.log` you can see it both. A lot of: ` [1659380327] SERVICE ALERT: frdev1... [19:07:22] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) a:05Volans→03None [19:10:45] (03CR) 10Subramanya Sastry: [C: 03+1] Parsoid REST handler: allow pagebundle input without original HTML. [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 (owner: 10Daniel Kinzler) [19:10:49] (03PS2) 10Bartosz Dziewoński: DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) (owner: 10Esanders) [19:10:53] (03CR) 10Bartosz Dziewoński: [C: 03+1] DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) (owner: 10Esanders) [19:12:52] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2054.codfw.wmnet with OS bullseye [19:12:59] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye [19:13:01] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "sudo cumin 'C:librenms' 'lsb_release -c' shows no more stretch" [puppet] - 10https://gerrit.wikimedia.org/r/819131 (owner: 10Muehlenhoff) [19:13:17] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:15:04] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/819131 (owner: 10Muehlenhoff) [19:16:37] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:57] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:39] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 33.51 ms [19:25:25] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [19:27:24] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [19:28:39] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [19:28:44] Not super clear to me why these `Elasticsearch HTTPS` checks are firing during reimages. AFAIK they should be blocked by the downtime that the reimage cookbook puts in place [19:29:25] ryankemper: "cookbook tries to set downtime but fails due to some race condition" was / is a known bug I believe [19:29:56] like when it tries to send the downtime in the short time that host does not exist in puppet db or something [19:30:02] mutante: ack, and does it fail in a "silent" way? cause the cookbook seems to think it succeeded: `Downtimed on Icinga/Alertmanager` [19:30:35] the one I had in mind was not silent [19:30:41] But looking at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=elastic2054 I do see `No` for `In Scheduled Downtime?` [19:30:45] it would show during the cookbook run [19:31:12] maybe it was downtimed but 10 seconds after it already alerted? [19:31:15] Here's the relevant cookbook output, this should be before it actually does the reimage/reboot etc: [19:31:17] https://www.irccloud.com/pastebin/08xYgf0P/ [19:31:51] Oh actually that `11/12` is suspicious, that reads like it was trying and then gave up [19:32:39] ryankemper: best you can do for now is run the downtime cookbook directly [19:32:46] that is otherwise used by the reimage cookbook [19:33:10] Makes sense. I'll add a manual downtime for the next host(s) [19:33:52] er, amend my reimage command to start by running the downtime cookbook, not preemptively downtime all the hosts to be clear [19:35:56] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2054.codfw.wmnet with reason: host reimage [19:36:52] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [19:39:03] (03PS1) 10Urbanecm: ServiceImageRecommendationProvider: Add extra logging when no JSON response received [extensions/GrowthExperiments] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819075 (https://phabricator.wikimedia.org/T313973) [19:41:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2054.codfw.wmnet with reason: host reimage [19:41:56] (03CR) 10Daniel Kinzler: [C: 03+2] "Merging for backport window" [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 (owner: 10Daniel Kinzler) [19:42:06] (03PS3) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 [19:45:21] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [19:49:31] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2054 is OK: SSL OK - Certificate search.discovery.wmnet valid until 2027-01-23 13:10:52 +0000 (expires in 1635 days) https://wikitech.wikimedia.org/wiki/Search [19:49:57] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [19:51:07] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:51:53] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10lmata) [19:52:22] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10lmata) [19:53:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logstash103[67] - https://phabricator.wikimedia.org/T313849 (10lmata) [19:53:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10lmata) [19:54:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10lmata) [19:58:53] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:59:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10lmata) [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T2000). [20:00:04] koi, duesen, subbu, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] i can deploy today! [20:00:22] o/ [20:00:25] o/ [20:00:27] o/ [20:00:33] hi [20:00:42] (03PS2) 10Urbanecm: newiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818614 (https://phabricator.wikimedia.org/T311700) (owner: 10Stang) [20:00:48] (03CR) 10Urbanecm: [C: 03+2] newiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818614 (https://phabricator.wikimedia.org/T311700) (owner: 10Stang) [20:01:08] my patch is a no-op, nothing i can test [20:01:16] ack [20:01:17] urbanecm: i can deploy mine, i need some practice. I'm still waiting for it to merge though [20:01:40] subbu can hopefully help verify it once it's up [20:01:57] yes. [20:02:24] duesen: ack, that's great! if you want, i'm also happy to leave (some?) config patches to you, and stand by in case you have any questions or something happens. [20:02:50] (03Merged) 10jenkins-bot: newiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818614 (https://phabricator.wikimedia.org/T311700) (owner: 10Stang) [20:03:13] eh... ok, i can do one while I wait for mine to merge i guess [20:03:24] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2054.codfw.wmnet with OS bullseye [20:03:29] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye completed: - elastic2054 (... [20:03:51] (03PS4) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 [20:04:13] duesen: okay! koi's first one just merged, so please go ahead with it :). [20:05:22] ok. I'm logged in and verified the diff [20:05:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:05:30] (03CR) 10Jbond: "All tests are working now" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (owner: 10Jbond) [20:05:59] (03Merged) 10jenkins-bot: Parsoid REST handler: allow pagebundle input without original HTML. [core] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819129 (owner: 10Daniel Kinzler) [20:06:08] pulling to mwdebug [20:06:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:06:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:07:13] trying to verify on the debug host, hold on [20:07:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:07:34] duesen: koi can do that too (it's their patch) [20:07:56] yeah I could check [20:08:22] ok, let me know if it looks good [20:08:47] checked and LGTM [20:09:05] out of curiosity: where should i be able to spot the difference? [20:10:03] this is about adding a new wordmark, so there's a new file at https://ne.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ne.svg [20:10:20] and using mobile view, the wordmark on the top left changed [20:10:36] one can also use https://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0?useskin=vector-2022 [20:10:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [20:11:04] or https://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0?useskin=timeless. probably used in other places too [20:11:06] ah, right, mobile [20:11:22] ok. ready to scap? [20:11:32] i think so! [20:12:03] Is it correct to use two separate sync-file calls? Can't it be done in one? [20:12:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [20:12:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [20:12:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:58] urbanecm: ? [20:12:59] duesen: two syncs are fine for this. sync-world would do it at once, but that's an overkill for such a small change. [20:13:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:27] Yes, I thought sync-file could just take two files parameters [20:13:45] unfortunately, it doesn't [20:13:50] ok, first pig is flying... [20:14:04] (it can be provided with a dir, but that's not helpful when the files don't share a common directory) [20:14:11] right, ok [20:14:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:40] Huh. Running '/usr/local/sbin/restart-php-fpm-all php7.2-fpm 9223372036854775807' on 296 host(s) [20:16:14] Why is restarting fpm? [20:16:23] a recent scap change. lemme find the phab task [20:16:36] looks like that's going to take a while... [20:16:49] unfortunately [20:17:14] !log daniel@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:818614|newiki: Update wordmark (T311700)]] (duration: 03m 32s) [20:17:19] T311700: Requesting logo change for ne.wikipedia.org (Mobile version) - https://phabricator.wikimedia.org/T311700 [20:17:36] ok done. three minutes is not too bad, but a progress indicator would be nice [20:18:10] I just realized i am deploying the files in the wrong order. the word mark url is a 404 now [20:18:50] " The order is random please be careful with the deployment " <--- indeed :) [20:19:05] let's just sync the static file and we should be good soon :) [20:19:14] yes, already going [20:19:31] duesen: fyi, T266055 is the task when the rolling restart was added. [20:19:31] T266055: Update Scap to perform rolling restart for all MW deploy - https://phabricator.wikimedia.org/T266055 [20:21:05] !log daniel@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-ne.svg: Config: [[gerrit:818614|newiki: Update wordmark (T311700)]] (duration: 03m 17s) [20:21:13] Bwahaha.... "When this happens, php7-opcache shits itself and causes unrecoverable corruption to the interpreted source code. " [20:21:22] haha [20:21:36] Hey Daniel. Long time no see [20:21:40] I guess we better restart fpm then ;) [20:21:58] Hey dancy! [20:22:45] urbanecm: ok, looks like it's done. koi, can you check once more? [20:23:08] great! looks like core patch merged too, so i guess you'll deploy that too [20:23:21] yep, I was just going to say that! [20:23:30] heh :) [20:23:46] * duesen loves deploy-commands [20:23:49] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Papaul) [20:23:54] it stay unchanged, I thought there's some cache issue? [20:24:07] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2087 - https://phabricator.wikimedia.org/T313483 (10Papaul) 05Open→03Resolved complete [20:24:21] actually I mean it is currently 404 on my side [20:24:22] ok, the diff looks clean [20:24:32] 10SRE-swift-storage, 10Observability-Alerting, 10Patch-For-Review: Port swift prometheus-based alerts from icinga to alertmanager - https://phabricator.wikimedia.org/T312765 (10lmata) p:05Triage→03Medium from patch notes alerts have been downgraded from page to critical for testing. [20:24:33] koi: lemme purge the static URL [20:24:40] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Discovery-Search (Current work): Decommission elastic2049.codfw.wmnet - https://phabricator.wikimedia.org/T313842 (10Papaul) [20:25:04] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Discovery-Search (Current work): Decommission elastic2049.codfw.wmnet - https://phabricator.wikimedia.org/T313842 (10Papaul) 05Open→03Resolved complete [20:25:12] !log Purge https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ne.svg (T311700) [20:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:14] T311700: Requesting logo change for ne.wikipedia.org (Mobile version) - https://phabricator.wikimedia.org/T311700 [20:25:23] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Jgreen) >>! In T196336#8120799, @Dzahn wrote: > Finally we have these cases where a host IS sending packets but does not exist in Icinga (yet?). I... [20:25:27] koi: what about now? [20:25:40] oh yeah, looks great now :) [20:25:47] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Papaul) [20:25:47] great! [20:25:56] *phew* [20:26:13] subbu: I pulled our batch to mwdebug1001 [20:26:15] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2086 - https://phabricator.wikimedia.org/T313482 (10Papaul) 05Open→03Resolved complete [20:26:17] can you verify it there? [20:26:19] ok. [20:27:15] meh .. the rest api isn't enabled there it appears ... so cannot verify there. "Error: Got status code: 404; body: "{\"messageTranslations\":{\"en\":\"The requested relative path (/en.wikipedia.org/v3/page/wikitext/Hospet) did not match any known handler\"},\"httpCode\":404,\"httpReason\":\"Not Found\"}"" [20:27:19] I'm just reading your message on slack now, sorry. [20:27:51] that sucks. i mean, we can pull to prod and verify the fix after it deployed. [20:27:59] I just really hope my fix doesn't break anything :) [20:28:18] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [20:28:20] yes ... we can revert it promptly if so. [20:28:58] duesen: It turns out that adding progress reporting the the php-fpm restart phase is more complicated than just https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/793842. The long pause in output is disconcerting and many people have complained. I hope to make it better soon. [20:29:01] hm, is it expected that the scap doesn't include tests files? [20:29:50] dancy: yea, it's scary. "is it stuck? what exactly happens if i hit ctrl-c now"?... [20:30:02] duesen: wdym by the tests files question? [20:30:07] dancy: but I understand that seemingly simpel things often just... aren't simple. [20:30:16] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [20:30:37] urbanecm: the patch updates a phpunit test. deploy-commands doesn't give me a sync-file command for that. [20:30:57] ah. that's fine. tests aren't loaded by prod, so it doesn't make sense to sync them [20:31:09] yea, i figured. just making sure [20:31:12] yeah [20:31:21] scap itself syncs almost everything [20:31:33] (unless told otherwise) [20:31:38] duesen, let me know once it is everywhere and i can verify. [20:32:00] piggy is flying... [20:35:01] !log daniel@deploy1002 Synchronized php-1.39.0-wmf.22/includes/Rest/Handler: Fix: [[gerrit:819129|Parsoid REST handler: allow pagebundle input without original HTML.]] (duration: 03m 15s) [20:35:09] dancy: would it be an idea to do the fpm restarts only once all the deployments are done? E.g. periodically (not during deployment windows) detect whether changes have been deployed, and trigger the restart of so. [20:35:19] subbu: it's up [20:36:00] the rt testing script now runs properly .. on scandium and wtp1025 ... [20:36:21] subbu: \o/ [20:36:54] urbanecm: can you do the remaining config changes? it's late -_- [20:37:00] yeah, sure [20:37:21] duesen: can i start now? [20:37:45] duesen: Interesting idea.. So basically batching deployments? [20:38:12] * urbanecm doesn't think that's a good idea. sometimes, syncs depend on each other. [20:38:15] (which you can do by +2'ing multiple commits before deploying) [20:38:22] duesen, did a test VE edit on officewiki and also on enwiki. didn't save. [20:38:28] but works fine. [20:38:32] dancy: well, at least the restarts, yea. [20:38:51] dancy: or just do them every 24 hours, unconditionally? [20:39:43] “I’m not a real programmer. I throw together things until it works then I move on. The real programmers will say Yeah it works but you’re leaking memory everywhere. Perhaps we should fix that. I’ll just restart Apache every 10 requests.” — Rasmus Lerdorf, original author og PHP [20:40:02] duesen: are you done with your deploys? just ensuring i don't start with configs when you're still doing sth :) [20:40:16] urbanecm: yes, done. logging out now [20:40:22] great, thanks! [20:40:30] (03PS2) 10Urbanecm: viwikibooks: Change wgArticleCountMethod to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818599 (https://phabricator.wikimedia.org/T314239) (owner: 10Stang) [20:40:37] (03CR) 10Urbanecm: [C: 03+2] viwikibooks: Change wgArticleCountMethod to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818599 (https://phabricator.wikimedia.org/T314239) (owner: 10Stang) [20:41:35] let's do r818599 together with r819044. first one can't be tested, because numbers are updated via a maintenance job, not on the fly, second one is no-op per MatmaRex [20:41:42] (03PS3) 10Urbanecm: DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) (owner: 10Esanders) [20:41:47] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) (owner: 10Esanders) [20:42:02] (03Merged) 10jenkins-bot: viwikibooks: Change wgArticleCountMethod to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818599 (https://phabricator.wikimedia.org/T314239) (owner: 10Stang) [20:42:55] I just doubt if the change for viwikibooks is testable [20:43:09] (03Merged) 10jenkins-bot: DiscussionTools: Make new reply buttons available at mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819044 (https://phabricator.wikimedia.org/T314076) (owner: 10Esanders) [20:43:23] koi: yeah, it's not (well, not until the job runs) [20:43:28] i'm just syncing it [20:44:13] (03PS3) 10Urbanecm: mnwwiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818569 (https://phabricator.wikimedia.org/T314023) (owner: 10Stang) [20:44:17] (03CR) 10Urbanecm: [C: 03+2] mnwwiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818569 (https://phabricator.wikimedia.org/T314023) (owner: 10Stang) [20:44:27] seems need to run updateArticleCount.php [20:45:25] yeah [20:45:43] (03Merged) 10jenkins-bot: mnwwiktionary: Create Appendix namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818569 (https://phabricator.wikimedia.org/T314023) (owner: 10Stang) [20:46:09] (or wait for initSiteStats.php to run automatically) [20:46:19] but not a problem running updateArticleCount now [20:46:31] after it syncs, ofc [20:47:15] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c19c3e36ab: DiscussionTools: Make new reply buttons available at mediawiki.org (T314076); 24db016c4: viwikibooks: Change wgArticleCountMethod to any (T314239) (duration: 03m 10s) [20:47:19] T314239: Change Vietnamese Wikibooks article count method to any - https://phabricator.wikimedia.org/T314239 [20:47:19] T314076: [Config Change] Make new Reply affordance styling available at mediawiki.org - https://phabricator.wikimedia.org/T314076 [20:47:36] koi: appendix namespace patch pulled to mwdebug1001, can you check? [20:47:40] MatmaRex: your patch is live [20:47:44] looking [20:47:46] (03PS2) 10Urbanecm: itwiki: Change robot policy on NS2 and NS3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818566 (https://phabricator.wikimedia.org/T314165) (owner: 10Stang) [20:47:49] (03CR) 10Urbanecm: [C: 03+2] itwiki: Change robot policy on NS2 and NS3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818566 (https://phabricator.wikimedia.org/T314165) (owner: 10Stang) [20:47:59] thanks urbanecm [20:48:17] np [20:48:30] dry run of mwscript updateArticleCount.php --wiki=viwikibooks says `27115`. bigger than number shown on https://vi.wikibooks.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Th%E1%BB%91ng_k%C3%AA [20:48:40] (03Merged) 10jenkins-bot: itwiki: Change robot policy on NS2 and NS3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818566 (https://phabricator.wikimedia.org/T314165) (owner: 10Stang) [20:48:45] !log [urbanecm@mwmaint1002 ~]$ mwscript updateArticleCount.php --wiki=viwikibooks --update # T314239 [20:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:50:20] urbanecm: the appendix ns is there, though no page inside [20:50:27] i guess that's fine [20:50:30] I thought need to run namespaceDupes.php after sync [20:50:35] might change with the namespaceDupes.php one [20:50:39] syncing [20:50:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:50:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:53:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ba8c17759b7e737a6757792ad4136ff3af00030c: mnwwiktionary: Create Appendix namespace (T314023) (duration: 03m 09s) [20:53:51] T314023: Create Appendix namespace on Mon Wiktionary (mnwwiktionary) - https://phabricator.wikimedia.org/T314023 [20:53:57] koi: namespace should be there! [20:54:12] koi: and the itwiki patch is at mwdebug1001 now [20:55:04] I don't think the robot config is testable, so maybe sync? [20:55:20] it should be (check the meta tag on the page) [20:55:31] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=mnwwiktionary --fix # T314023 [20:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:26] source of https://it.wikipedia.org/wiki/Discussioni_utente:Dispe looks to have [20:56:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:57:05] ok, I tested on a user page and a user talk page, the information said's Indexing by robots: Disallowed now [20:57:10] great [20:57:11] syncing [20:57:12] so I think it works well [20:57:13] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:57:51] !log phab1001 - rsyncing repo data /srv/repos/ to phab2002 (in addition to phab1004 previously) T313360 [20:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:54] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [20:57:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:57:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:58:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:59:38] mnwwiktionary's namespaceDupes finished, all done [21:00:05] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220801T2100). [21:00:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 461e0709a8987b110f669b74afc38c706b616e5d: itwiki: Change robot policy on NS2 and NS3 (T314165) (duration: 03m 18s) [21:00:35] T314165: Changing the ns2/ns3 robot policy on itwiki - https://phabricator.wikimedia.org/T314165 [21:00:53] should be all done :) [21:01:01] !log UTC late backport window done [21:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:08] Secteam: over to you [21:02:57] !log gerrit2002 - mkdir /var/lib/gerrit2/review_site | gerrit1001 - rsyncing /var/lib/gerrit2/review_site/ to gerrit2002 T313250 T313972 [21:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:04] T313972: Add gerrit2002 as a replica of gerrit1001 - https://phabricator.wikimedia.org/T313972 [21:03:05] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [21:04:06] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36559/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/819132 (owner: 10Muehlenhoff) [21:05:06] (03CR) 10Dzahn: [C: 03+1] librenms: Collate PHP packages used in bullseye and buster [puppet] - 10https://gerrit.wikimedia.org/r/819133 (owner: 10Muehlenhoff) [21:06:39] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [21:08:45] !log drain ganeti2028 T309957 [21:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:47] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [21:11:29] (03PS1) 10Dzahn: admin: upgrade Aline Bruenger from ldap_only to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/819165 (https://phabricator.wikimedia.org/T314117) [21:12:20] (03CR) 10Dzahn: "This type of access is a new class of users, it's not shell with SSH key but it's also not "ldap_only"." [puppet] - 10https://gerrit.wikimedia.org/r/819165 (https://phabricator.wikimedia.org/T314117) (owner: 10Dzahn) [21:13:43] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) a:05sgrabarczuk→03Dzahn [21:19:23] 10SRE, 10Discovery-Search, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10MPhamWMF) 05Open→03Declined Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order... [21:19:53] (03PS1) 10Dzahn: admin: upgrade Szymon Grabarczuk from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/819166 (https://phabricator.wikimedia.org/T313616) [21:24:50] (03PS1) 10Dzahn: Revert "deployment_server: remove packages wrk, siege and lua-cjson" [puppet] - 10https://gerrit.wikimedia.org/r/819076 [21:25:04] 10SRE, 10Discovery, 10Discovery-Search, 10Elasticsearch: Collect metrics on CirrusSearch usage of PoolCounter - https://phabricator.wikimedia.org/T130617 (10MPhamWMF) 05Open→03Declined Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the b... [21:26:34] 10SRE, 10Discovery, 10Discovery-Search, 10Elasticsearch: Setup backups of elasticsearch indices - https://phabricator.wikimedia.org/T91404 (10MPhamWMF) 05Open→03Declined Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of ticke... [21:28:32] 10SRE, 10LDAP-Access-Requests, 10User-Raymond_Ndibe: Grant Access to wmf for Raymond Ndibe - https://phabricator.wikimedia.org/T314222 (10Dzahn) Hello @Raymond_Ndibe you already (or meanwhile) have this access you are requesting. I see you are already a member of the wmf LDAP group. [21:28:58] 10SRE, 10LDAP-Access-Requests, 10User-Raymond_Ndibe: Grant Access to wmf for Raymond Ndibe - https://phabricator.wikimedia.org/T314222 (10Dzahn) 05Open→03Resolved a:03Dzahn Let us know if you run into any problems logging into the tools you listed. [21:29:32] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Dzahn) p:05Triage→03Medium [21:30:27] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: cannot use multiple accounts - https://phabricator.wikimedia.org/T314251 (10Dzahn) p:05Triage→03Medium [21:30:49] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: unclarity of the remove-all button - https://phabricator.wikimedia.org/T314250 (10Dzahn) p:05Triage→03Medium [21:31:14] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: suppress owner notification while subscribing or unsubscribing users - https://phabricator.wikimedia.org/T314248 (10Dzahn) p:05Triage→03Medium [21:31:31] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247 (10Dzahn) p:05Triage→03Medium [21:32:06] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10Dzahn) p:05Triage→03Medium [21:33:04] 10SRE, 10Wikimedia-Mailing-lists: postorius list overview should be sorted - https://phabricator.wikimedia.org/T314246 (10Dzahn) p:05Triage→03Low [21:33:35] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/819171 [21:33:37] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/819171 (owner: 10Ahmon Dancy) [21:35:20] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/819171 (owner: 10Ahmon Dancy) [21:37:19] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [21:47:36] (03PS1) 10DDesouza: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) [21:50:49] (03PS1) 10DDesouza: QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) [21:51:45] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [21:52:23] (03PS2) 10DDesouza: QuickSurveys(beta): Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819174 (https://phabricator.wikimedia.org/T314333) [21:59:01] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [22:07:03] (03PS1) 10Andrea Denisse: netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) [22:12:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:22:07] (03PS1) 10Andrea Denisse: netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) [22:27:18] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Collate PHP packages used in bullseye and buster [puppet] - 10https://gerrit.wikimedia.org/r/819133 (owner: 10Muehlenhoff) [22:27:48] (03CR) 10Andrea Denisse: [C: 03+2] librenms: Install PHP packages via the meta packages [puppet] - 10https://gerrit.wikimedia.org/r/819132 (owner: 10Muehlenhoff) [22:29:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:31:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:39:31] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [22:41:00] (03PS1) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [22:45:44] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36562/" [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse) [22:45:49] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add regex that matches the netmon instances to get certs from Acme Chief [puppet] - 10https://gerrit.wikimedia.org/r/818494 (https://phabricator.wikimedia.org/T314162) (owner: 10Andrea Denisse) [22:47:37] (03CR) 10Ahmon Dancy: "pcc results https://puppet-compiler.wmflabs.org/pcc-worker1003/36561/" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [22:48:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:59:55] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:15] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:06:07] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:13:41] (03PS2) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) [23:13:43] (03PS1) 10Dduvall: devtools: Configure keyholder for scap3 deployment of phabricator [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) [23:16:06] (03CR) 10Dduvall: "I've successfully tested this by adding the hiera values from this patch to the devtools project puppet in Horizon. Once this merges I'll " [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [23:18:07] (03PS2) 10Dduvall: devtools: Configure keyholder for scap3 deployment of phabricator [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) [23:18:17] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:18:32] (03CR) 10Dduvall: "Rebased to drop the unnecessary relation." [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [23:23:09] (03PS1) 10Dduvall: Add scap.cfg section for devtools environment [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/819194 (https://phabricator.wikimedia.org/T314195) [23:24:37] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:32:57] (03PS1) 10Tim Starling: Disable credits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819196 (https://phabricator.wikimedia.org/T130820) [23:35:23] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:35:30] (03CR) 10Jforrester: [C: 03+1] Disable credits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819196 (https://phabricator.wikimedia.org/T130820) (owner: 10Tim Starling) [23:44:20] (03CR) 10Krinkle: [C: 03+1] "LGTM, the rationale on the ticket seems inapplicalbe unless it's on a path to be used elsewhere in prod. testwiki isn't for integrating ra" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819196 (https://phabricator.wikimedia.org/T130820) (owner: 10Tim Starling) [23:45:20] TimStarling: I could roll it out with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/818286 now if you like [23:46:46] sure [23:46:54] (03CR) 10Krinkle: [C: 03+2] Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) (owner: 10Krinkle) [23:46:56] (03CR) 10Krinkle: [C: 03+2] Disable credits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819196 (https://phabricator.wikimedia.org/T130820) (owner: 10Tim Starling) [23:49:44] looks like CI isn't doing anything useful [23:49:58] https://integration.wikimedia.org/zuul/ [23:49:59] everything is pending [23:50:44] > !log drain ganeti2028 T309957 [23:50:45] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [23:51:13] maybe ganeti2028 contains something essential for CI [23:52:12] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [23:52:18] I was just about to say the test queue should be fine, but having looked... I'll shush. fwiw the test queue *was* okay about 30 minutes ago or so, and jobs have been processing (I recently cleared out a few stuck beta deployment jobs) [23:52:42] dduvall: might need a pair of eyes from someone who knows more about zuul/gearman etc [23:52:57] contint2001 appears generally up and reachable [23:53:41] (03Merged) 10jenkins-bot: Disable credits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819196 (https://phabricator.wikimedia.org/T130820) (owner: 10Tim Starling) [23:53:43] ok, business has resumed, thanks to whomever :) [23:54:31] (03PS2) 10Krinkle: Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) [23:54:37] (03CR) 10Krinkle: [C: 03+2] Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) (owner: 10Krinkle) [23:55:27] confirmed TimStarling 's change on testwiki/mwdebug1002 [23:55:41] (03Merged) 10jenkins-bot: Disable BounceHandler on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818286 (https://phabricator.wikimedia.org/T225097) (owner: 10Krinkle) [23:56:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:56:31] I did nothing helpful but glad it's going again :) [23:57:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:57:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:58:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:58:12] dduvall: both zuul and jenkins itself were showing no jobs executing for a good few minutes [23:59:34] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Id1ce285631f5, I194d419fbfe (duration: 03m 09s)