[00:00:31] what's the logo URL? [00:00:53] https://jv.wiktionary.org/static/images/project-logos/jvwiktionary.png [00:01:05] jv.wiktionary.org/static/images/project-logos/jvwiktionary-1.5x.png [00:01:24] https://jv.wiktionary.org/static/images/project-logos/jvwiktionary-2x.png [00:02:11] done [00:02:58] This has not yet updated https://jv.wiktionary.org/static/images/project-logos/jvwiktionary.png [00:03:27] (https://sal.toolforge.org/log/eckgXXwB1jz_IcWu51NG) [00:05:40] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GrowthExperiments: Backport: [[gerrit:727498|Mentee overview: Make UncachedMenteeOverviewDataProvider::getBlocksForUsers faster (T290609)]] (duration: 00m 56s) [00:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:47] T290609: Make mentee overview module's updateMenteeData.php scale better - https://phabricator.wikimedia.org/T290609 [00:07:07] !log deploy window over [00:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:32] not sure what's going on with that logo. I can confirm the new logo is synced. [00:15:50] tgr_: Can purge cache of this logo: https://sal.toolforge.org/log/eckgXXwB1jz_IcWu51NG [00:17:30] fetching that file gives `x-cache: cp1087 hit, cp1089 hit/1 x-cache-status: hit-front` [00:18:03] ? [00:18:05] right after a purge it gives `x-cache: cp1087 hit, cp1089 miss x-cache-status: hit-local` [00:18:13] so the purge does do something [00:18:38] but it should be a miss/miss, I think? [00:18:55] so maybe Varnish is purged but ATS not? [00:26:11] Just to understand the situation: Did you purge the cache? [00:28:32] (03PS1) 10Bstorm: toolforge harbor: dockerize the config file and such [puppet] - 10https://gerrit.wikimedia.org/r/727638 (https://phabricator.wikimedia.org/T267616) [00:29:54] (03PS2) 10Bstorm: toolforge harbor: puppetize the install/compose config file and such [puppet] - 10https://gerrit.wikimedia.org/r/727638 (https://phabricator.wikimedia.org/T267616) [00:31:17] legoktm: Can purge cache from https://sal.toolforge.org/log/eckgXXwB1jz_IcWu51NG? [00:40:29] Anyone online for help? [00:41:37] Juan_90264: I need to get offline soon, but the cache purge appears to have worked from here, I see the newer version of the logo [00:41:48] if you're still seeing the older version, you may need to clear your browser cache or try a hard refresh [00:43:14] rzl: Can you see "Wikisastra" here: https://jv.wiktionary.org/static/images/project-logos/jvwiktionary.png ? [00:43:37] yes, I see "wikisastra" and not "wiktionary" [00:43:54] Is WikimediaDebug off? [00:44:00] yes [00:47:03] Now see [00:47:10] Thanks for attention [00:47:56] no worries! I'm heading out, have a good rest of your day [00:50:26] You too [00:51:07] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10Legoktm) [01:07:56] rzl: , Juan_90264 : filed T292810 about the purge issue [01:07:57] T292810: purgeList.php does not seem to work in Wikimedia production - https://phabricator.wikimedia.org/T292810 [01:59:01] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:02] 10SRE, 10Traffic: purgeList.php does not seem to work in Wikimedia production - https://phabricator.wikimedia.org/T292810 (10Krinkle) https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Changing_files_in_/static https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging TLDR: Purge via en.wikip... [02:02:09] tgr_: ^ [02:02:57] still cached for mw indeed, adding ?foobar to the url shows a visually different image [02:04:20] !log krinkle@deploy1002$ echo 'https://en.wikipedia.org/static/images/project-logos/jvwiktionary.png' | mwscript purgeList.php , ref T287425, T292810 [02:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:27] T292810: purgeList.php does not seem to work in Wikimedia production - https://phabricator.wikimedia.org/T292810 [02:04:28] T287425: Change Javanese Wiktionary logo - https://phabricator.wikimedia.org/T287425 [02:05:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:13] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:30] (03PS7) 10Juan90264: Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) [02:30:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:53] 10SRE, 10Traffic: purgeList.php does not seem to work in Wikimedia production - https://phabricator.wikimedia.org/T292810 (10Tgr) 05Open→03Invalid D'oh, thanks. [02:43:22] (03CR) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue (031 comment) [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [02:45:07] (03Abandoned) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [02:51:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:51] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [03:14:33] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:57] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 14 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [03:35:43] Hello, i returned again [03:38:04] I try to claim in Phabricator a task that allows the addition and removal of users in some bureaucrats privileges (It was alleged that they are a small wiki and that they do not have bureaucrats) to the sysops. Is there any rule that I don't know that doesn't allow this, before performing the task? [03:38:59] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [03:40:20] Someone online? [03:48:37] PROBLEM - SSH on mw2253.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:14] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [03:58:15] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [03:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:21] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [03:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [04:03:31] (03PS1) 10Ryan Kemper: elastic: nodes is a local variable, not attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/727887 (https://phabricator.wikimedia.org/T280221) [04:14:33] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.89`. Pre-deploy tests passing on canary `wdqs1003` [04:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:44] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@8f57a56]: 0.3.89 [04:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:18] !log [WDQS Deploy] Tests passing following deploy of `0.3.89` on canary `wdqs1003`; proceeding to rest of fleet [04:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:27] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [04:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:04] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [04:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:22] (03CR) 10Ryan Kemper: [C: 03+2] elastic: nodes is a local variable, not attribute [cookbooks] - 10https://gerrit.wikimedia.org/r/727887 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:20:48] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [04:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:54] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [04:20:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [04:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:06] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@8f57a56]: 0.3.89 (duration: 08m 22s) [04:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:49] (03PS1) 10Ryan Kemper: elastic: sleep comes from time package [cookbooks] - 10https://gerrit.wikimedia.org/r/727923 (https://phabricator.wikimedia.org/T280221) [04:28:50] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [04:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:54] !log [WDQS Deploy] Restarted `wdqs-categories` across both test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [04:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:01] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [04:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:01] (03CR) 10Ryan Kemper: [C: 03+2] elastic: sleep comes from time package [cookbooks] - 10https://gerrit.wikimedia.org/r/727923 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [04:31:24] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [04:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:30] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [04:32:05] !log T292814 Beginning rolling restart of `cloudelastic`: `sudo -i cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic restart" --nodes-per-run 1 --start-datetime 2021-10-08T03:53:49 --task-id T292814` on `ryankemper@cumin1001` tmux `elastic` [04:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:28] (03PS2) 10Ryan Kemper: blazegraph: relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/725000 (owner: 10DCausse) [04:40:29] (03CR) 10Ryan Kemper: [C: 03+2] blazegraph: relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/725000 (owner: 10DCausse) [04:49:35] RECOVERY - SSH on mw2253.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:56:32] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [04:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:17] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:11:30] (03PS31) 10Ryan Kemper: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [05:11:57] (03CR) 10Ryan Kemper: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [05:35:17] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [05:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:23] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [05:43:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:49] (03CR) 10Ayounsi: [C: 04-1] "Overall I agree it's a good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/727594 (https://phabricator.wikimedia.org/T292792) (owner: 10CDanis) [05:52:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:42] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:12:06] (03CR) 10Elukey: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) (owner: 10Elukey) [06:23:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Allow setting http2 protocol options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:25:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think that configuration really pertains to ProductionServices.php; moreover, the conditional is useful for labs as well if we're " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726660 (owner: 10Dduvall) [06:26:42] (03PS1) 10Elukey: profile::hadoop::yarn_proxy_testcluster: fix if condition [puppet] - 10https://gerrit.wikimedia.org/r/728047 [06:28:11] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [06:28:12] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:02] (03PS2) 10Elukey: profile::hadoop::yarn_proxy_testcluster: remove if condition [puppet] - 10https://gerrit.wikimedia.org/r/728047 [06:31:37] (03PS3) 10Elukey: profile::hadoop::yarn_proxy_testcluster: fix if condition [puppet] - 10https://gerrit.wikimedia.org/r/728047 [06:32:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31565/console" [puppet] - 10https://gerrit.wikimedia.org/r/728047 (owner: 10Elukey) [06:35:59] Someone oline? [06:36:03] * online [06:37:43] Juan_90264: https://nohello.net/ :-) [06:41:11] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10ayounsi) [06:42:51] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [06:42:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211008T0700) [07:04:51] (03PS1) 10Muehlenhoff: Switch ganeti2025 to buster as part of Ganeti update tests [puppet] - 10https://gerrit.wikimedia.org/r/728070 [07:07:46] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::yarn_proxy_testcluster: fix if condition [puppet] - 10https://gerrit.wikimedia.org/r/728047 (owner: 10Elukey) [07:09:14] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to cn=nda for Majavah - https://phabricator.wikimedia.org/T292783 (10Majavah) [07:11:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti2025 to buster as part of Ganeti update tests [puppet] - 10https://gerrit.wikimedia.org/r/728070 (owner: 10Muehlenhoff) [07:17:26] 10SRE, 10Traffic, 10Patch-For-Review, 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) [07:24:46] 10SRE, 10Traffic, 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) [07:25:04] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q2), 10User-ema: Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10ema) [07:25:15] 10SRE, 10Observability-Alerting, 10Traffic, 10User-ema: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10ema) [07:25:25] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-ema: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10ema) [07:25:37] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-ema: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used - https://phabricator.wikimedia.org/T292180 (10ema) [07:25:50] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-ema, 10User-fgiunchedi: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) [07:26:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:26] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:36] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [07:33:22] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [07:35:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Marginal question, but LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [07:36:28] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:20] !log manually resuming the data reloads on wdqs1009 and wdqs2008 [07:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:28] ryankemper: ^ [07:42:34] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:51] !log reboot ms-be2045 T290881 [07:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:57] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [07:56:17] (03CR) 10David Caro: base::environment: use only vars inside ::realm ifs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [07:56:35] (03PS32) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [07:56:54] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:04:33] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=frwiki --force [08:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:27] (03PS8) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [08:16:29] (03PS3) 10David Caro: base::environment: move to profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [08:16:31] (03CR) 10David Caro: base::environment: move to profile::environment and parametrize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:20:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] 10SRE, 10Traffic: Wikipedia not accessible in Russia on 2021-10-07 16:00-17:00UTC - https://phabricator.wikimedia.org/T292776 (10Zemant) its OK now! [08:20:34] (03PS4) 10David Caro: base::environment: move to profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [08:21:19] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31567/console" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:21:47] (03PS1) 10Majavah: acme_chief: add wildcard to openstack certs [puppet] - 10https://gerrit.wikimedia.org/r/728246 (https://phabricator.wikimedia.org/T267194) [08:24:28] (03PS1) 10Urbanecm: Revert "Mentee overview: Truncate long usernames" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728247 (https://phabricator.wikimedia.org/T292224) [08:25:28] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10ema) [08:25:35] 10SRE, 10Traffic, 10User-ema: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10ema) [08:25:36] (03PS9) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [08:25:38] (03PS5) 10David Caro: base::environment: move to profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [08:25:39] hello, would it be possible to go ahead with deployment of ^^? It's a recently released Growth feature that has a big visual bug that this should fix. Happy to wait for Monday if need be. [08:25:40] (03PS1) 10David Caro: P:auto_restarts: add missing value to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/728248 [08:26:17] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31569/console" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:30:38] (03Abandoned) 10David Caro: P:auto_restarts: add missing value to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/728248 (owner: 10David Caro) [08:32:03] (03PS1) 10Muehlenhoff: Fix exit code for correctly generated reports [puppet] - 10https://gerrit.wikimedia.org/r/728249 [08:33:05] (03PS10) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [08:33:07] (03PS6) 10David Caro: base::environment: move to profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [08:33:53] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:06] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31571/console" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:34:11] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31570/console" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:36:33] (03PS1) 10Ayounsi: Add transit BGP communities for anycast traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) [08:36:37] (03PS1) 10Ayounsi: Configure transit specific outbound BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) [08:37:31] (03CR) 10David Caro: [V: 03+1] "Implemented the changes, the only thing is missing is for me to figure out why the core dump patterns was just `core` in cloud before (acc" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:39:28] (03CR) 10Muehlenhoff: [C: 03+2] Fix exit code for correctly generated reports [puppet] - 10https://gerrit.wikimedia.org/r/728249 (owner: 10Muehlenhoff) [08:39:38] (03CR) 10Ayounsi: "Example diff for cr3-ulsfo:" [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [08:40:09] (03CR) 10Ayounsi: "Example diff for cr3-ulsfo:" [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [08:41:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "Mentee overview: Truncate long usernames" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728247 (https://phabricator.wikimedia.org/T292224) (owner: 10Urbanecm) [08:42:15] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728247 (https://phabricator.wikimedia.org/T292224) (owner: 10Urbanecm) [08:45:58] (03CR) 10Jbond: base::environment: move to profile::environment and parametrize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:47:05] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31572/console" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [08:47:34] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31573/console" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [08:49:55] (03CR) 10Jbond: [C: 03+1] "LGTM i think we can either update tools/toolsbeta with the new key or simply drop the current key (which looks like it has a bad value)" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:53:27] (03CR) 10David Caro: [V: 03+1] base::environment: move to profile::environment and parametrize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:56:00] (03CR) 10David Caro: [V: 03+1 C: 03+2] base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [08:56:06] (03CR) 10David Caro: [V: 03+1 C: 03+2] base::environment: move to profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:56:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/727387 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [08:56:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 on the idea." [puppet] - 10https://gerrit.wikimedia.org/r/727594 (https://phabricator.wikimedia.org/T292792) (owner: 10CDanis) [08:57:15] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: remove legacy code [cookbooks] - 10https://gerrit.wikimedia.org/r/727411 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [08:57:25] (03CR) 10David Caro: [V: 03+1 C: 03+2] base::environment: move to profile::environment and parametrize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [08:59:35] (03CR) 10Jbond: [C: 03+1] cumin: remove wmf-auto-reimage scripts [puppet] - 10https://gerrit.wikimedia.org/r/727415 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [08:59:52] (03CR) 10Jbond: [C: 03+1] sre.hosts.reimage: renamed from experimental [cookbooks] - 10https://gerrit.wikimedia.org/r/727412 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:02:17] 10SRE, 10Traffic, 10User-ema: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 (10ema) [09:06:05] (03PS1) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [09:06:38] (03CR) 10Jbond: [C: 03+1] admin: add taavi to ldap_only users (nda) [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) (owner: 10Dzahn) [09:07:39] (03CR) 10jerkins-bot: [V: 04-1] openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:08:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ayounsi) I currently assume that: * IX peers are mostly local, so no special care needs to happen to them ** If this happens to be incorrect we could inv... [09:09:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) (owner: 10Dzahn) [09:13:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10jbond) lgtm > using the NO-EXPORT BGP community (most likely not supported by many peers) FYI i have had a good experience using no-export at IX's, i.e.... [09:18:51] (03PS2) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [09:20:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:24:38] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:53] (03CR) 10Jbond: "see comments inline" [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [09:28:21] (03PS3) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [09:28:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [09:29:55] (03CR) 10jerkins-bot: [V: 04-1] openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:30:36] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. One thing, using the shorthand "AS" for Asia confused me at first (I guess cos it's BGP). But no biggie. We could maybe use "AP" " [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [09:31:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10cmooney) >> IX peers are mostly local, so no special care needs to happen to them >> >> - If this happens to be incorrect we could investigate not sendin... [09:32:22] (03PS1) 10Muehlenhoff: Align profile contacts [puppet] - 10https://gerrit.wikimedia.org/r/728279 [09:32:48] (03CR) 10Ayounsi: Add transit BGP communities for anycast traffic engineering (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [09:38:25] (03CR) 10Btullis: [C: 03+2] Mark Christina Macholan's account as Kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/727388 (https://phabricator.wikimedia.org/T292532) (owner: 10Btullis) [09:39:10] !log installing stress on ms-be2045 given recent h/w issues T290881 [09:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:16] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [09:39:38] (03PS1) 10Muehlenhoff: Remove old test entry [puppet] - 10https://gerrit.wikimedia.org/r/728282 [09:40:05] (03Abandoned) 10Muehlenhoff: Remove old test entry [puppet] - 10https://gerrit.wikimedia.org/r/728282 (owner: 10Muehlenhoff) [09:49:04] !log wikiadmin@10.64.16.85(wikidatawiki)> delete from wb_changes_subscription where cs_subscriber_id in ('testcommonswiki', 'mowiki'); [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [09:57:39] (03PS35) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:58:10] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:59:59] (03PS36) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:00:29] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:01:08] (03PS1) 10Muehlenhoff: Switch to puppet-generated contacts file [puppet] - 10https://gerrit.wikimedia.org/r/728318 [10:02:08] (03PS37) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:02:36] (03CR) 10Jbond: [C: 03+1] Align profile contacts [puppet] - 10https://gerrit.wikimedia.org/r/728279 (owner: 10Muehlenhoff) [10:03:42] (03CR) 10Muehlenhoff: "I've tested the report successfully with the puppet-generated YAML, before this can be removed there are two violations of the profile/pat" [puppet] - 10https://gerrit.wikimedia.org/r/728318 (owner: 10Muehlenhoff) [10:04:40] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [10:06:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:11:31] (03CR) 10Muehlenhoff: P:base: move production specific code to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:12:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/728318 (owner: 10Muehlenhoff) [10:13:06] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:36] (03PS38) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:15:11] (03CR) 10Jbond: "thanks, FYI this should be stable enough for a final review now" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:18:58] (03CR) 10Muehlenhoff: [C: 03+2] Align profile contacts [puppet] - 10https://gerrit.wikimedia.org/r/728279 (owner: 10Muehlenhoff) [10:20:59] (03CR) 10Ayounsi: "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [10:23:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:57] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1001/31562/" [puppet] - 10https://gerrit.wikimedia.org/r/727370 (owner: 10Effie Mouzeli) [10:30:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great! Two final nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:31:14] (03PS1) 10Jcrespo: mediabackups: Start backup of dewiki files on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/728341 (https://phabricator.wikimedia.org/T262668) [10:31:44] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:00] (03PS2) 10Jcrespo: mediabackups: Start backup of dewiki files on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/728341 (https://phabricator.wikimedia.org/T262668) [10:40:54] (03PS1) 10Jbond: C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 [10:41:22] (03CR) 10jerkins-bot: [V: 04-1] C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 (owner: 10Jbond) [10:42:33] (03PS2) 10Jbond: C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 [10:42:58] (03PS39) 10Jbond: P:base: move production specific code to their own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:43:05] (03CR) 10jerkins-bot: [V: 04-1] C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 (owner: 10Jbond) [10:43:49] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Start backup of dewiki files on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/728341 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [10:43:51] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:44:04] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:53] (03PS3) 10Jbond: C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 [10:45:23] (03CR) 10jerkins-bot: [V: 04-1] C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 (owner: 10Jbond) [10:47:51] (03PS1) 10Jbond: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) [10:51:01] (03PS4) 10Jbond: C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 [10:54:25] (03PS2) 10Jbond: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) [10:58:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:58:32] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:17] (03PS3) 10Jbond: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) [11:21:40] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [11:30:05] (03PS1) 10Muehlenhoff: Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 [11:30:21] (03PS2) 10Muehlenhoff: Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 [11:32:25] (03PS1) 10Vgutierrez: acme_chief: implement file and systemd based watchdogs [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) [11:33:20] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: implement file and systemd based watchdogs [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [11:34:04] (03CR) 10Majavah: openstack: haproxy: add tls termination support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [11:34:59] (03PS1) 10Jelto: modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) [11:35:09] (03PS1) 10Jbond: bird: filter v4/v6 prefixes [puppet] - 10https://gerrit.wikimedia.org/r/728382 [11:35:49] (03CR) 10jerkins-bot: [V: 04-1] modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:36:26] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31577/console" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:36:56] (03PS2) 10Jelto: modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) [11:37:44] (03CR) 10jerkins-bot: [V: 04-1] modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:39:01] (03PS3) 10Jelto: modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) [11:39:20] (03PS1) 10Effie Mouzeli: mwdebug: Bump opcache max accelerated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/728384 (https://phabricator.wikimedia.org/T280497) [11:41:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31578/console" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:41:24] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) After the last tuning, the results were more promising: {F34678817} On the other hand, we seem to be hitting max accelerate... [11:44:05] (03CR) 10Jbond: [C: 04-1] "i have marked the -1's the other stuff is a bit of feature creep. as always ping me on irc if anything is unclear" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [11:44:14] (03CR) 10Jelto: [V: 03+1] "We need to explicitly create the git user with login permissions. In the puppet-only setup the user gets created as a gitlab-ce dependency" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:44:55] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: Bump opcache max accelerated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/728384 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [11:45:22] (03CR) 10Jelto: [V: 03+1] modules::gitlab::ssh explicitly add git user and enable login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:48:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) [11:48:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:49:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/728378 (owner: 10Muehlenhoff) [11:49:41] (03Merged) 10jenkins-bot: mwdebug: Bump opcache max accelerated files [deployment-charts] - 10https://gerrit.wikimedia.org/r/728384 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [11:51:27] (03PS1) 10Btullis: Correct typo in the name of a hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) [11:53:16] 10SRE, 10Traffic, 10User-ema: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 (10ema) p:05Triage→03Medium [11:53:34] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10ema) p:05Triage→03Medium [11:53:47] 10SRE, 10Traffic, 10User-ema: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10ema) p:05Triage→03High [11:54:13] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-ema: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used - https://phabricator.wikimedia.org/T292180 (10ema) p:05Triage→03Medium [11:54:14] (03PS3) 10Muehlenhoff: Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 [11:54:26] 10SRE, 10SRE Observability (FY2021/2022-Q2), 10User-ema: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10ema) p:05Triage→03Medium [11:54:59] (03PS4) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [11:55:22] (03PS4) 10Muehlenhoff: Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 [11:57:17] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [11:58:26] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:58:47] (03PS2) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [11:59:22] (03CR) 10Jbond: "see comment, also cc'ed moritz in case i'm missing something else. e.g. we might want to reverser a uid and explicitly assign it" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:59:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/728378 (owner: 10Muehlenhoff) [11:59:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) [11:59:52] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:00:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:35:13] (03CR) 10Ssingh: "[trying with do_ipv6=true and the relevant hiera data]" [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [12:38:00] (03PS5) 10Ssingh: C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 (owner: 10Jbond) [12:38:02] (03PS4) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [12:38:58] (03Abandoned) 10Jbond: pki2001: move to multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674916 (owner: 10Jbond) [12:39:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/728378 (owner: 10Muehlenhoff) [12:40:21] (03CR) 10Muehlenhoff: modules::gitlab::ssh explicitly add git user and enable login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:44:13] (03CR) 10Jbond: bird: add IPv6 support to bird and anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [12:44:42] (03PS3) 10Arturo Borrero Gonzalez: openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) [12:44:44] (03PS3) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [12:45:41] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:46:17] (03CR) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [12:51:20] (03PS2) 10Jbond: bird: filter v4/v6 prefixes [puppet] - 10https://gerrit.wikimedia.org/r/728382 [12:58:04] (03CR) 10Ssingh: "[removing the hiera data so that we don't actually enable IPv6 yet and updating commit message]." [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [12:58:36] (03PS5) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [13:01:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10LSobanski) Manuel is out, adding @Kormat. [13:01:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10LSobanski) @Bstorm Should this task be reopened or is there another task for follow up? [13:02:53] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [13:03:49] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:06:23] (03PS3) 10Jbond: bird: filter v4/v6 prefixes [puppet] - 10https://gerrit.wikimedia.org/r/728382 [13:06:28] (03CR) 10Jbond: [C: 03+2] C:bird: test migration to epp [puppet] - 10https://gerrit.wikimedia.org/r/728347 (owner: 10Jbond) [13:06:32] (03CR) 10Jbond: [C: 03+2] bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/728349 (https://phabricator.wikimedia.org/T292737) (owner: 10Jbond) [13:08:04] (03CR) 10Jbond: [C: 03+2] bird: filter v4/v6 prefixes [puppet] - 10https://gerrit.wikimedia.org/r/728382 (owner: 10Jbond) [13:10:38] (03CR) 10Elukey: [C: 03+1] "This is indeed right:" [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:12:17] (03PS1) 10Jbond: bird: use content not source [puppet] - 10https://gerrit.wikimedia.org/r/728435 [13:12:39] (03CR) 10Ssingh: [C: 03+1] bird: use content not source [puppet] - 10https://gerrit.wikimedia.org/r/728435 (owner: 10Jbond) [13:13:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31581/console" [puppet] - 10https://gerrit.wikimedia.org/r/728435 (owner: 10Jbond) [13:13:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] bird: use content not source [puppet] - 10https://gerrit.wikimedia.org/r/728435 (owner: 10Jbond) [13:14:29] (03PS1) 10Elukey: profile::hadoop::master: fix alerts using kerberos_prefix [puppet] - 10https://gerrit.wikimedia.org/r/728441 [13:17:00] (03Abandoned) 10Elukey: profile::hadoop::master: fix alerts using kerberos_prefix [puppet] - 10https://gerrit.wikimedia.org/r/728441 (owner: 10Elukey) [13:17:58] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:19:47] (03CR) 10Elukey: [C: 03+1] "The full command that we run is:" [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:23:10] 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10ldelench_wmf) [13:32:13] (03CR) 10Elukey: [C: 03+1] "elukey@alert1001:~$ /usr/lib/nagios/plugins/check_nrpe --ipv4 -2 -u -H an-master1001.eqiad.wmnet -c check_check_hdfs_topology -t 10" [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:33:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] "This has already been merged, adding a +1 so that it goes "out of turn" :)" [dns] - 10https://gerrit.wikimedia.org/r/727460 (https://phabricator.wikimedia.org/T292537) (owner: 10Ssingh) [13:36:34] (03CR) 10Btullis: Correct typo in the name of a hadoop worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:37:59] (03PS2) 10Vgutierrez: acme_chief: implement file and systemd based watchdogs [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) [13:38:42] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:11] (03PS1) 10David Caro: base::sysctl::core_dumps: move core_dumps to their own class [puppet] - 10https://gerrit.wikimedia.org/r/728457 [13:41:21] (03CR) 10Elukey: [C: 03+1] "No idea if this is the correct command to run, but yes it is my reading as well.." [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:41:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [13:41:53] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:42:48] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:39] (03PS4) 10Arturo Borrero Gonzalez: openstack: cinder: refactor configuration file to its own module [puppet] - 10https://gerrit.wikimedia.org/r/728390 (https://phabricator.wikimedia.org/T292546) [13:43:41] (03PS4) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [13:44:30] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:45:36] (03CR) 10Ssingh: [V: 03+1] bird: add IPv6 support to bird and anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [13:46:06] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31584/console" [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [13:48:30] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31585/console" [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [13:49:58] (03Abandoned) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [13:50:14] (03CR) 10David Caro: [V: 03+1] "The differences on pcc are expected (a param moving from one class to the other)" [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [13:53:08] (03PS4) 10Jelto: modules::gitlab::ssh explicitly add git user and enable login [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) [13:54:45] (03PS1) 10Alexandros Kosiaris: otrs: Remove T187985 leftover [puppet] - 10https://gerrit.wikimedia.org/r/728468 [13:56:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) The Puppet change has been merged but I am going to keep this open in case @ayounsi feels that there is something e... [13:59:50] (03CR) 10Jelto: modules::gitlab::ssh explicitly add git user and enable login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [14:01:13] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:54] (03CR) 10Majavah: [C: 03+2] kubernetes: Use Ingress v1 API [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727449 (https://phabricator.wikimedia.org/T292706) (owner: 10Majavah) [14:04:10] (03CR) 10Majavah: [C: 03+2] fix python version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727450 (owner: 10Majavah) [14:04:44] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10jijiki) [14:05:01] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] (03Merged) 10jenkins-bot: kubernetes: Use Ingress v1 API [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727449 (https://phabricator.wikimedia.org/T292706) (owner: 10Majavah) [14:05:56] (03Merged) 10jenkins-bot: fix python version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727450 (owner: 10Majavah) [14:06:06] (03CR) 10Elukey: [C: 03+1] "Opened https://phabricator.wikimedia.org/T292846" [puppet] - 10https://gerrit.wikimedia.org/r/728391 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [14:07:46] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:38] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:14:42] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:17:29] (03PS1) 10Alexandros Kosiaris: Update package for match Znuny 6.0.37 [software/otrs] - 10https://gerrit.wikimedia.org/r/728478 [14:18:06] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:42] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) I imported the latest data to `gitlab2001` and everything except ssh looks fine. I prepared a patch ([728380](https://... [14:23:58] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:44] (03CR) 10David Caro: P:base: move production specific code to their own profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:32:00] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:24] (03PS1) 10Kormat: admin: Reinstate posix account for etonkovidova [puppet] - 10https://gerrit.wikimedia.org/r/728489 (https://phabricator.wikimedia.org/T292575) [14:37:58] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:28] (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/728378 (owner: 10Muehlenhoff) [14:38:34] (03PS1) 10Jhernandez: Add more types of QuickSurveys on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728490 (https://phabricator.wikimedia.org/T292459) [14:38:37] (03CR) 10Kormat: [C: 03+2] admin: Reinstate posix account for etonkovidova [puppet] - 10https://gerrit.wikimedia.org/r/728489 (https://phabricator.wikimedia.org/T292575) (owner: 10Kormat) [14:39:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10Kormat) 05Open→03Resolved a:03Kormat Hi @etonkovidova, your existing shell account was deactivated, so i've reinstated it now. You're alr... [14:41:13] (03PS3) 10Kormat: admin: add taavi to ldap_only users (nda) [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) (owner: 10Dzahn) [14:43:43] (03CR) 10Kormat: [C: 03+2] admin: add taavi to ldap_only users (nda) [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) (owner: 10Dzahn) [14:46:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to cn=nda for Majavah - https://phabricator.wikimedia.org/T292783 (10Kormat) 05Open→03Resolved Hi @Majavah, welcome to the `nda` ldap group :) You can confirm this by searching for `taavi` on https://contact.toolforge.org/ [14:47:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Kormat) [14:50:00] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:12] (03CR) 10Jdlrobson: [C: 03+1] Add more types of QuickSurveys on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728490 (https://phabricator.wikimedia.org/T292459) (owner: 10Jhernandez) [14:52:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Kormat) [14:52:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Kormat) [14:53:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Kormat) @Ottomata: this needs your approval, please :) [14:53:59] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10Kormat) [14:54:05] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10Kormat) p:05Triage→03Medium [14:54:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Kormat) p:05Triage→03Medium [14:55:41] (03CR) 10Eigyan: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728490 (https://phabricator.wikimedia.org/T292459) (owner: 10Jhernandez) [14:57:01] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10Kormat) [14:57:54] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10Kormat) Hey @ggellerman and @Ottomata: this needs approval from both of you. Thanks! [14:59:50] PROBLEM - Host cloudweb2001-dev.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:03:28] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:03:46] my bad ^ fixed [15:04:02] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:37] (03PS1) 10Elukey: helmfile.d: Move NamespaceDefaultLabelName to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 [15:05:26] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:05:44] RECOVERY - Host cloudweb2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [15:06:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) >>! In T291963#7412052, @LSobanski wrote: > @Bstorm Should this task be reopened or is there a... [15:10:49] thanks kormat! [15:11:04] majavah: 💜 [15:11:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10Ottomata) Approved. [15:12:24] PROBLEM - Host db2143 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:52] ^expected? [15:12:58] not by me, looking [15:13:14] RECOVERY - Host db2143 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [15:13:16] (03PS1) 10Cmjohnson: Adding new kubernetes host to site.pp, dhcpd, and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/728494 (https://phabricator.wikimedia.org/T290202) [15:13:28] either crash or network glitch [15:13:49] jynus: sorry that was me network cable [15:13:59] papaul: ahh :) [15:14:01] ah, no problem :-) [15:15:05] (03CR) 10Cmjohnson: [C: 03+2] Adding new kubernetes host to site.pp, dhcpd, and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/728494 (https://phabricator.wikimedia.org/T290202) (owner: 10Cmjohnson) [15:19:58] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:00] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10Ottomata) Approved. [15:24:16] (03CR) 10JMeybohm: [C: 03+1] "You might want to add "Bug: T290476"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 (owner: 10Elukey) [15:25:56] (03CR) 10Ottomata: [C: 03+1] Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) (owner: 10Elukey) [15:26:02] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:06] (03PS2) 10Elukey: ihelmfile.d: Move NamespaceDefaultLabelName to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 (https://phabricator.wikimedia.org/T290476) [15:26:20] (03PS3) 10Elukey: helmfile.d: Move NamespaceDefaultLabelName to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 (https://phabricator.wikimedia.org/T290476) [15:29:09] (03PS2) 10Elukey: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) [15:29:27] !log enable puppet on gitlab1001 again for T283076 [15:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:35] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [15:29:48] (03CR) 10Elukey: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) (owner: 10Elukey) [15:30:04] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki: Add rsyslog sidecar (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:34:23] (03CR) 10Elukey: [C: 03+2] helmfile.d: Move NamespaceDefaultLabelName to common.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [15:36:04] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:16] (03CR) 10Elukey: "Of course I just realized that an extra file sneaked in this change, really sorry, going to remove it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/728493 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [15:40:58] (03PS1) 10Elukey: helmfile.d: revert GlobalNetworkPolicy change [deployment-charts] - 10https://gerrit.wikimedia.org/r/728505 [15:42:10] elukey: tsk tsk [15:43:13] kormat: yes this is a good conclusion of the week for me [15:43:26] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) Puppet on `gitlab1001` is enabled again and the puppet run was successful. Web interface works, pulling over ssh works... [15:43:56] elukey: :D [15:46:02] (03CR) 10Elukey: [C: 03+2] helmfile.d: revert GlobalNetworkPolicy change [deployment-charts] - 10https://gerrit.wikimedia.org/r/728505 (owner: 10Elukey) [15:47:50] (03CR) 10Bstorm: [C: 03+2] "Since I've got to get this running a bit better before I'm out of here, I'm going to merge it, but please recommend changes if need be whe" [puppet] - 10https://gerrit.wikimedia.org/r/727638 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [15:48:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:12] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:30] PROBLEM - Host moss-be2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:53:07] (03PS1) 10Giuseppe Lavagetto: static.php: correctly report a bad request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728552 [15:53:09] (03PS1) 10Giuseppe Lavagetto: Allow serving assets under /static/current from /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728553 (https://phabricator.wikimedia.org/T285232) [15:54:23] (03PS2) 10Giuseppe Lavagetto: Allow serving assets under /static/current from /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728553 (https://phabricator.wikimedia.org/T285232) [15:54:42] (03CR) 10jerkins-bot: [V: 04-1] Allow serving assets under /static/current from /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728553 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [15:54:48] 10SRE, 10SRE-Access-Requests, 10Campaign-Tools: Request access to private data group for Ldelench - https://phabricator.wikimedia.org/T292841 (10ggellerman) Approved. [15:59:50] (03CR) 10jerkins-bot: [V: 04-1] Allow serving assets under /static/current from /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728553 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [16:04:15] (03PS1) 10Elukey: Release 2020.02~wmf6 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/728557 (https://phabricator.wikimedia.org/T292699) [16:04:16] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:15] (03CR) 10Herron: [C: 03+1] "LGTM, minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:08:48] (03CR) 10Herron: "it would be good to document the param at the top of the file, and optionally add/update comments as it's used as well" [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:12:13] (03PS1) 10Bstorm: toolforge harbor: change the permissions a bit on the dir [puppet] - 10https://gerrit.wikimedia.org/r/728560 (https://phabricator.wikimedia.org/T267616) [16:19:51] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=enwiki --force # to measure performance on a large wiki [16:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:38] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:11] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: change the permissions a bit on the dir [puppet] - 10https://gerrit.wikimedia.org/r/728560 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [16:22:18] RECOVERY - Host moss-be2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [16:25:46] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:54] (03PS3) 10Herron: warn on idle mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [16:28:00] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:42] (03PS1) 10Elukey: hadoop: remove sudo usage in the check_hdfs_topology [puppet] - 10https://gerrit.wikimedia.org/r/728562 (https://phabricator.wikimedia.org/T292846) [16:35:15] (03CR) 10Elukey: [C: 03+2] hadoop: remove sudo usage in the check_hdfs_topology [puppet] - 10https://gerrit.wikimedia.org/r/728562 (https://phabricator.wikimedia.org/T292846) (owner: 10Elukey) [16:36:40] (03CR) 10Herron: warn on idle mtail instances (036 comments) [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [16:37:01] (03PS4) 10Herron: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [16:38:30] (03CR) 10Herron: warn on idle centrallog mtail instances (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [16:38:33] (03PS5) 10Herron: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [16:42:03] (03PS6) 10Herron: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [16:43:15] (03PS7) 10Herron: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [16:44:24] (03PS1) 10Bstorm: toolforge harbor: install and configure docker properly [puppet] - 10https://gerrit.wikimedia.org/r/728565 [16:49:57] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: install and configure docker properly [puppet] - 10https://gerrit.wikimedia.org/r/728565 (owner: 10Bstorm) [16:58:38] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:30] 10SRE, 10Traffic, 10Documentation, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10BBlack) 05Open→03Resolved a:03BBlack Added a section to https://wikitech.wikimedia.org/wiki/HTTPS about renewal which mentions aging out new manual... [17:02:27] 10SRE, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10BBlack) [17:02:38] 10SRE, 10Traffic, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10BBlack) 05Open→03Resolved a:03Vgutierrez TLSv1.3 has been working for quite some time! Any other issues should be in other tickets (and are, in some cases!). [17:03:26] (03PS1) 10Bstorm: toolforge harbor: install docker-compose with puppet [puppet] - 10https://gerrit.wikimedia.org/r/728566 (https://phabricator.wikimedia.org/T267616) [17:05:41] 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393 (10BBlack) #Traffic doesn't use `tlsproxy::localssl` anymore (since quite some time ago!), so we don't really have an opinion on its remaining use or m... [17:05:58] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: install docker-compose with puppet [puppet] - 10https://gerrit.wikimedia.org/r/728566 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [17:06:17] (03PS1) 10AntiCompositeNumber: mediawiki::packages::fonts: replace fonts-liberation with fonts-liberation2 [puppet] - 10https://gerrit.wikimedia.org/r/728568 (https://phabricator.wikimedia.org/T253600) [17:06:50] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:03] 10SRE, 10Patch-For-Review, 10User-ArielGlenn, 10User-MoritzMuehlenhoff, 10User-jbond: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10BBlack) Removing #Traffic tag here, as we've long ago stopped using nginx for our primary TLS ingress, and thus this doesn't really impact us in any... [17:13:02] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:11] (03PS1) 10Michael DiPietro: Revert "depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/728527 [17:16:56] (03CR) 10Cwhite: [C: 03+1] warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [17:17:26] 10SRE, 10Traffic, 10Patch-For-Review: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 (10BBlack) 05Open→03Resolved a:03ema @ema I'm going to assume we're done with all the easy cleanups here. There's one un-merged patch on this at https://gerrit.wikimedia.org/r... [17:19:05] 10SRE, 10Traffic, 10Performance-Team (Radar): User traffic sometimes gets HTTP 502 from ATS - https://phabricator.wikimedia.org/T239382 (10BBlack) 05Open→03Declined Declining this one, as whatever this was, the report is now ~2 years old and everything related has changed or been refined substantially si... [17:19:36] (03CR) 10Michael DiPietro: [C: 03+2] Revert "depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/728527 (owner: 10Michael DiPietro) [17:24:01] 10SRE, 10Commons, 10MediaWiki-File-management, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10BBlack) Removing #Traffic during a ticket cleanup, as it seems like this issue lies within thumbor and not th... [17:31:02] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10BBlack) [17:31:18] 10SRE, 10Traffic, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [17:31:48] 10SRE, 10Traffic, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [17:37:48] 10SRE, 10Infrastructure-Foundations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10BBlack) Not even sure if this is still an issue, but if so it sounds more IF-ish than Traffic-ish :) [17:39:02] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:07] 10SRE, 10Analytics, 10Analytics-Kanban: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10BBlack) Removing #Traffic for now - although it could get added back if some further investigation indicates our infra is the cau... [17:43:52] (03PS1) 10Bstorm: toolforge harbor: add small customization to prepare script here [puppet] - 10https://gerrit.wikimedia.org/r/728578 (https://phabricator.wikimedia.org/T267616) [17:45:02] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:29] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/728579 [17:48:25] 10SRE, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Krinkle) [17:48:31] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Krinkle) [17:48:36] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/728579 (owner: 10Ahmon Dancy) [17:49:49] (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/728579 (owner: 10Ahmon Dancy) [17:55:00] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:28] (03Abandoned) 10Bstorm: toolforge harbor: add small customization to prepare script here [puppet] - 10https://gerrit.wikimedia.org/r/728578 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [17:58:54] (03PS1) 10Bstorm: toolforge harbor: clean up the certs setup a bit better [puppet] - 10https://gerrit.wikimedia.org/r/728581 (https://phabricator.wikimedia.org/T267616) [17:59:53] 10SRE, 10Infrastructure-Foundations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087 (10BBlack) [18:00:04] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897 (10BBlack) [18:00:11] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10BBlack) [18:00:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677 (10BBlack) [18:00:32] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10BBlack) [18:00:40] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10BBlack) [18:00:48] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10BBlack) [18:00:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10BBlack) [18:01:21] (03PS2) 10Bstorm: toolforge harbor: clean up the certs setup a bit better [puppet] - 10https://gerrit.wikimedia.org/r/728581 (https://phabricator.wikimedia.org/T267616) [18:01:26] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Goal: Increase network capacity (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T207668 (10BBlack) [18:01:34] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10BBlack) [18:01:38] (03CR) 10CDanis: NEL alert is empirically high-signal & should page SRE (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727594 (https://phabricator.wikimedia.org/T292792) (owner: 10CDanis) [18:01:42] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10BBlack) [18:01:49] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10BBlack) [18:01:57] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10BBlack) [18:02:12] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 (10BBlack) [18:02:22] 10SRE, 10ops-ulsfo, 10Infrastructure-Foundations, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10BBlack) [18:02:32] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10BBlack) [18:02:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10BBlack) [18:05:33] (03PS2) 10CDanis: NEL alert is empirically high-signal & should page SRE [puppet] - 10https://gerrit.wikimedia.org/r/727594 (https://phabricator.wikimedia.org/T292792) [18:06:16] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: clean up the certs setup a bit better [puppet] - 10https://gerrit.wikimedia.org/r/728581 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [18:15:04] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:47] (03PS1) 10Bstorm: toolforge harbor: fix type for cert and file params [puppet] - 10https://gerrit.wikimedia.org/r/728586 [18:25:02] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:21] 10SRE, 10DNS, 10Infrastructure-Foundations, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10BBlack) [18:28:31] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: fix type for cert and file params [puppet] - 10https://gerrit.wikimedia.org/r/728586 (owner: 10Bstorm) [18:30:21] Hello folks! [18:30:33] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10BBlack) 05Open→03Declined Feel free to reopen/link if this is useful in a future investigation! [18:30:37] Can we help Juan_90264 [18:31:06] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:00] RhinosF1: Help in what? [18:33:25] Juan_90264: you said hi, not a lot of general chat happens here [18:33:43] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Register as14907 dot net (or other similar domain) for network infra concerns - https://phabricator.wikimedia.org/T292866 (10CDanis) [18:34:48] RhinosF1: In this case I already know, I found out when Urbanecm sent me https://nohello.net [18:35:07] Did you read it? [18:36:05] 10SRE, 10DNS, 10Infrastructure-Foundations, 10netbox, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 (10BBlack) [18:37:03] Why on Monday (11), there will be no deployments all day, at other times there were deployments for that day of the week? [18:37:09] * RhinosF1 looks [18:37:19] WMF holiday [18:37:20] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:25] Well, WMF (US) holiday [18:37:27] 10SRE, 10Performance-Team: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10BBlack) [18:37:43] Juan_90264: Indigenous Peoples' Day [18:37:56] You can use https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar as good guide [18:38:04] It should have all the distruption done [18:38:09] On* [18:38:53] RhinosF1: I didn't know there would be this holiday in the United States, I'm in Brazil [18:38:58] 10SRE, 10Analytics: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10BBlack) [18:39:24] Thanks for guide [18:39:27] Juan_90264: I always go by the yearly calendar [18:39:39] Neither do I know US holidays, I'm in the UK [18:41:18] 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10Traffic, and 2 others: wikireplicas last-minute infra work to discuss / resolve - https://phabricator.wikimedia.org/T273248 (10BBlack) 05Open→03Resolved a:03ayounsi [18:41:27] Juan_90264: are you looking to get something deployed? [18:43:32] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:35] RhinosF1: I was looking for a day to be able to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/727497, but it looks like I'll have to put it on for Tuesday [18:44:07] Juan_90264: yeah you'll have to do it then if you're free [18:45:08] RhinosF1: Glad I'm free that day for deployment [18:46:58] Juan_90264: great! I think we have at least one staff holiday a month normally [18:51:31] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) [18:54:12] 10SRE, 10Infrastructure-Foundations, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10BBlack) Removing #Traffic as I don't think this looks actionable for our team (but might still be for netops if the conversations above ar... [18:55:38] PROBLEM - DNS on moss-be2002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.193.0.151 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:58:02] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:51] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) No concerns and nice work, Jelto. 👍 [19:01:02] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10procurement: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10BBlack) [19:01:37] (03PS1) 10Alexandros Kosiaris: Remove old cruft [software/otrs] - 10https://gerrit.wikimedia.org/r/728595 [19:01:53] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) Login to gitlab.wikimedia.org seems to be broken for 2fa users currently (recurring prompt for 2fa code after authen... [19:02:08] (03CR) 10Dzahn: "Thanks for handling it, Stevie Beth" [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) (owner: 10Dzahn) [19:04:24] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:49] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10AntiCompositeNumber) [19:08:28] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:01] 10SRE, 10Cloud-Services, 10DNS, 10Traffic: PDNS in cloud can return inconsistent answers - https://phabricator.wikimedia.org/T281700 (10BBlack) As noted in the description, DNS is inconsistent in general within reasonable TTL bounds, so I don't see resolving the inconsistency being shown here as a good rea... [19:11:44] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10CDanis) p:05Triage→03Low [19:12:02] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10CDanis) [19:12:06] 10SRE, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [19:12:36] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:14:42] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:24] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) I think this is the culprit: ` brennen@gitlab1001:~$ sudo grep session_duration /opt/gitlab/embedded/service/gitlab... [19:20:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1018.eqiad.wmnet ` The log can be found i... [19:20:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:21:16] 10SRE, 10MediaWiki-General, 10Pybal, 10Traffic, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10BBlack) We chose S:BP for those queries on the assumption that, by its nature, it would be a cheap page to... [19:21:20] (03PS1) 10Brennen Bearnes: gitlab: set session duration to 604800 seconds [puppet] - 10https://gerrit.wikimedia.org/r/728618 (https://phabricator.wikimedia.org/T288757) [19:24:03] 10SRE, 10serviceops, 10Datacenter-Switchover: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10BBlack) [19:24:43] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10CDanis) [19:25:56] 10SRE, 10Traffic: cp3059 Varnish child crash: Worker Pool Queue does not move - https://phabricator.wikimedia.org/T285953 (10BBlack) 05Open→03Resolved a:03BBlack We have a new varnish version coming soon, so stale crash reports are probably of little value now. [19:26:21] (03CR) 10Dzahn: "Oh, I see, this needs a quick deploy I assume, not just a +1.. let me get on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/728618 (https://phabricator.wikimedia.org/T288757) (owner: 10Brennen Bearnes) [19:27:12] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:35] (03CR) 10Dzahn: modules::gitlab::ssh explicitly add git user and enable login (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [19:30:48] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) After reading https://en.wikipedia.org/wiki/Proxy_server#Transparent_proxy I'm not exactly sure "t... [19:31:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1019.eqiad.wmnet ` The log can be found i... [19:32:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1020.eqiad.wmnet ` The log can be found i... [19:33:26] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:45] (03PS1) 10Effie Mouzeli: mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/728625 [19:39:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [19:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:57] (03PS1) 10Bstorm: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) [19:40:51] (03PS1) 10Cmjohnson: Adding dhcpd and netboot.cfg for kubestage servers [puppet] - 10https://gerrit.wikimedia.org/r/728630 (https://phabricator.wikimedia.org/T290894) [19:42:16] 10SRE, 10Traffic: DNS Discovery for active/passive failover within a data centre - https://phabricator.wikimedia.org/T287584 (10BBlack) 05Open→03Declined Given your generous offer of declination, I think we'll take that route! :) In general, our DNS Discovery stuff really is meant to handle x-dc situation... [19:42:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: REIMAGE [19:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: REIMAGE [19:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:47] (03PS2) 10Cmjohnson: Adding site.pp, dhcpd and netboot.cfg for kubestage servers [puppet] - 10https://gerrit.wikimedia.org/r/728630 (https://phabricator.wikimedia.org/T290894) [19:45:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1018.eqiad.wmnet'] ` [19:45:04] (03CR) 10Cmjohnson: [C: 03+2] Adding site.pp, dhcpd and netboot.cfg for kubestage servers [puppet] - 10https://gerrit.wikimedia.org/r/728630 (https://phabricator.wikimedia.org/T290894) (owner: 10Cmjohnson) [19:45:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1019.eqiad.wmnet with reason: REIMAGE [19:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:24] (03CR) 10Vgutierrez: "we have a similar issue with ats and we fixed it by creating a symlink to the acmechief live directory where ats expects the tls material." [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [19:46:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1020.eqiad.wmnet with reason: REIMAGE [19:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31587/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/728618 (https://phabricator.wikimedia.org/T288757) (owner: 10Brennen Bearnes) [19:49:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1018.eqiad.wmnet ` The log can be found i... [19:49:51] (03CR) 10Bstorm: toolforge harbor: update certs with acmechief (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [19:51:11] (03PS1) 10Cmjohnson: Adding site.pp entry and dhcpd entry for cloudmetric100[34] [puppet] - 10https://gerrit.wikimedia.org/r/728633 (https://phabricator.wikimedia.org/T289888) [19:51:52] (03PS2) 10Bstorm: toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) [19:52:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1019.eqiad.wmnet'] ` and were **ALL** successful. [19:53:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1020.eqiad.wmnet'] ` and were **ALL** successful. [19:53:50] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) Update on the ca-certificates end of this: Debian has a patch that will corr... [19:54:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubestage1003.eqiad... [19:55:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubestage1004.eqiad... [19:55:53] 10SRE, 10Inuka-Team, 10KaiOS-Wikipedia-app, 10Traffic: Many KaiOS devices can't access WMF websites and can't use Wikipedia app - https://phabricator.wikimedia.org/T292632 (10BBlack) 05Open→03Resolved a:03Vgutierrez Closing for now as I don't think there's anything we want to do on our end here. Tha... [19:55:59] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10BBlack) [19:56:21] Nobody has seen my task created yesterday, would anyone like it? https://phabricator.wikimedia.org/T292687 [19:57:05] urbanecm: feel like creating a phab user space here ^? [19:57:35] I'll check later [19:57:41] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:46] Okay [19:57:50] thanks! [19:59:02] Juan_90264: no need to ask, someone will do when they have time. (Also not really an operations/SRE issue) [19:59:41] Okay then [19:59:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:... [20:00:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:... [20:00:19] (03CR) 10Dzahn: "This edited the ruby config file but a live hack had been applied to yaml generated from those files so it was still a noop in reality but" [puppet] - 10https://gerrit.wikimedia.org/r/728618 (https://phabricator.wikimedia.org/T288757) (owner: 10Brennen Bearnes) [20:02:09] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:41] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr can you be on the lookout for this disk to arrive in shipping. I am out next week, it would be great if you could do the disk swa... [20:03:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) Confirmed: Service Request 1072368852 was successfully submitted. for kubernetes1021 [20:03:41] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr This disk should arrive today or Monday. Please swap the failed disk, it will be on the disk array for backup1002. [20:05:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: REIMAGE [20:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE [20:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:51] 10SRE, 10ops-eqiad, 10Analytics-Clusters: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @BTullis or @razzi please coordinate next week with @Jclark-ctr. @Jclark-ctr this server needs the flea power drained... [20:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1004.eqiad.wmnet'] ` Of which those... [20:07:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1003.eqiad.wmnet'] ` Of which those... [20:08:01] (03CR) 10Cmjohnson: [C: 03+2] Adding site.pp entry and dhcpd entry for cloudmetric100[34] [puppet] - 10https://gerrit.wikimedia.org/r/728633 (https://phabricator.wikimedia.org/T289888) (owner: 10Cmjohnson) [20:08:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: REIMAGE [20:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [20:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE [20:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudmetrics1003.eq... [20:12:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudmetrics1004.eq... [20:12:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [20:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubestage1003.eqiad.wmnet'] ` and were **ALL** successful. [20:16:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubestage1004.eqiad.wmnet'] ` and were **ALL** successful. [20:17:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloud... [20:17:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudmetrics1003.eq... [20:17:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) [20:18:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) 05Open→03Resolved [20:18:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) [20:19:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) kubernetes1018-1020 are fully installed, once we figure out and fix the issue with 1021 we'll be able to close the task. [20:19:32] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloud... [20:20:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloud... [20:22:01] (03PS1) 10Dzahn: admin/otrs: create new root admin group vrts-admins, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 [20:22:52] (03PS2) 10Dzahn: admin/otrs: create new root admin group vrts-admins, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 [20:23:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1018.eqiad.wmnet'] ` and were **ALL** successful. [20:24:55] (03CR) 10Dzahn: "thank you for merging this Arturo! did everything work, systemctl status proxydb-backup is fine? If so, can we also merge https://gerrit.w" [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:25:17] (03PS2) 10Dzahn: dynamicproxy: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/726730 (https://phabricator.wikimedia.org/T273673) [20:37:00] https://graphite.wikimedia.org/ says upstream connect error or disconnect/reset before headers. reset reason: connection termination for me -- is that known? [20:42:29] urbanecm: hm, not known afaik, that's an envoy error [20:42:35] are other services working for you? [20:42:38] yes [20:42:45] (03CR) 10Daniel Kinzler: "This change is ready for review." [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [20:42:52] i can access grafana just fine [20:43:19] just refreshed graphite, looks it fixed itself [20:45:25] nothing's jumping out at me on dashboards -- shout if you have any more trouble though, we can dig a bit [20:46:16] will do, thanks rzl. Was broken for few minutes, but works now. [20:47:18] 👍 [20:52:06] rzl: ...and failing again [20:52:17] https://usercontent.irccloud-cdn.com/file/eTXU9LSS/image.png [20:55:07] hm, looking [20:56:25] (03CR) 10GeoffreyT2000: Revert "Introduce CommentFormatter" (031 comment) [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [20:56:29] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [20:58:18] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:49] urbanecm: okay, I think it's a graphite config issue -- can you file a phab task with what you were trying to do and what happened, and tag it with #observability for that team to look at? [21:03:01] sure, doing [21:03:02] I have some logs I'll attach [21:03:05] thanks <3 [21:03:12] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:06:06] rzl: filled as T292877 [21:06:06] T292877: Loading https://graphite.wikimedia.org/ throws an envoy error - https://phabricator.wikimedia.org/T292877 [21:10:03] got it thanks [21:10:08] (03CR) 10Brennen Bearnes: [C: 04-2] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:10:36] (03CR) 10AntiCompositeNumber: Revert "Introduce CommentFormatter" (033 comments) [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:10:42] when you see the IP address in my paste, it's not yours, just graphite1004's ;) [21:10:51] ack :) [21:11:54] (03CR) 10Brennen Bearnes: [C: 04-2] "Unresolved merge conflicts here." [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:11:59] brennen: working on the merge conflict. sorry about that. got confused during the rebase [21:12:44] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 0 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [21:13:06] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:12] 10SRE, 10observability: Loading https://graphite.wikimedia.org/ throws an envoy error - https://phabricator.wikimedia.org/T292877 (10RLazarus) That Envoy error, in this case from graphite1004's TLS proxy, means that Graphite hung up on Envoy before sending a response. I found this in `journalctl -u uwsgi-grap... [21:16:35] (03PS2) 10Daniel Kinzler: Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 [21:18:06] (03CR) 10Wolfgang Kandek: [C: 03+2] admin/otrs: create new root admin group vrts-admins, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:18:18] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [21:18:39] (03CR) 10Wolfgang Kandek: [V: 03+1] "Approved" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:20:55] (03CR) 10Brennen Bearnes: Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:22:04] (03PS5) 10Legoktm: mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 [21:22:06] (03PS2) 10Legoktm: mediawiki: Remove ploticus [puppet] - 10https://gerrit.wikimedia.org/r/725099 [21:24:39] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [21:24:45] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove ploticus [puppet] - 10https://gerrit.wikimedia.org/r/725099 (owner: 10Legoktm) [21:25:52] (03PS1) 10Urbanecm: updatementeedata.pp: Update script parameters [puppet] - 10https://gerrit.wikimedia.org/r/728656 (https://phabricator.wikimedia.org/T290609) [21:27:31] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:30:02] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:07] !log running puppet across C:mediawiki::packages to uninstall lilypond and ploticus: legoktm@cumin1001:~$ sudo cumin -b 4 C:mediawiki::packages 'run-puppet-agent' [21:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] !log disabling puppet on bacula - going through a restore https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key) [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:39] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Legoktm) >>! In T292291#7413420, @BBlack wrote: > Update on the ca-certificates end... [21:37:11] (03CR) 10jerkins-bot: [V: 04-1] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:38:06] !log mwmaint2002 - disable-puppet, stop bacula-fd, recovery in progress [21:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:55] brennen: looks like I'm just too tired to manage a working revert. Sorry. I'm afraid I'll break something if I keep trying. [21:39:07] (03CR) 10Brennen Bearnes: [C: 04-2] Revert "Introduce CommentFormatter" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/728537 (owner: 10Daniel Kinzler) [21:39:30] duesen: yeah, fair enough, and thanks for the effort. get some sleep! [22:00:14] legoktm: btw, also mwmaint and stuff.. I see it removing lilypond and ploticus, yay [22:00:27] \o/ [22:00:34] restore for amire80 on mwmaint2001 succesful, that's why I was there [22:00:46] I set a pretty low batch limit on cumin, it's only 34% of the way through 366 hosts [22:01:19] ah! yea *nod* [22:05:53] 10SRE: Restore amire80 home directory on mwmaint1002 - https://phabricator.wikimedia.org/T292573 (10Dzahn) Hey @Amire80 I restored the files on _mwmaint2002_ from 23 days ago (it was reimaged 22 days ago) to mwmaint2002. Despite the ticket title I assume these must be what you are after because that is the host... [22:06:17] 10SRE: Restore amire80 home directory on mwmaint1002 - https://phabricator.wikimedia.org/T292573 (10Dzahn) a:05Dzahn→03Amire80 [22:08:30] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:38] (03PS1) 10Dzahn: static-bugzilla: add bug 10000 through 19999 [container/miscweb] - 10https://gerrit.wikimedia.org/r/728668 [22:30:38] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:00] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:24] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:44] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:22] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [23:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:28] T292814: Service restarts of cloudelastic for Java security updates (Aug 2021) - https://phabricator.wikimedia.org/T292814 [23:15:06] (03CR) 10Dzahn: [C: 03+2] "tested query on m3-slave" [puppet] - 10https://gerrit.wikimedia.org/r/726936 (https://phabricator.wikimedia.org/T292062) (owner: 10Aklapper) [23:16:37] !log sudo cumin -b 10 C:mediawiki::packages 'apt-get purge lilypond-data -y' [23:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:16] (03CR) 10Dzahn: "deployed" [puppet] - 10https://gerrit.wikimedia.org/r/726936 (https://phabricator.wikimedia.org/T292062) (owner: 10Aklapper) [23:20:41] no traces of lilypond/ploticus left :) [23:22:51] (03PS1) 10Legoktm: mediawiki: Remove absented lilypond and ploticus packages [puppet] - 10https://gerrit.wikimedia.org/r/728680 [23:22:52] :) congrats [23:22:53] (03PS1) 10Legoktm: mediawiki: Update cgroup documentation [puppet] - 10https://gerrit.wikimedia.org/r/728681 [23:23:38] legoktm: https://debmonitor.wikimedia.org/packages/lilypond [23:23:54] https://debmonitor.wikimedia.org/packages/ploticus [23:24:20] 2280 because https://phabricator.wikimedia.org/T290708 [23:24:30] (03PS1) 10Cwhite: logstash: dot_expander: better handling of field collisions [puppet] - 10https://gerrit.wikimedia.org/r/728682 (https://phabricator.wikimedia.org/T292099) [23:24:39] dunno about the docker image part [23:24:51] the docker image part is expected, since those are what Shellbox uses [23:25:13] ack [23:25:35] I guess I can leave the ensure => absent around a bit longer in case mw2280 comes back to life [23:26:38] (03CR) 10jerkins-bot: [V: 04-1] logstash: dot_expander: better handling of field collisions [puppet] - 10https://gerrit.wikimedia.org/r/728682 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [23:26:59] meh, subscribed to the task. if it does come back I'll do the removal manually [23:26:59] legoktm: it seems there are 2 possible outcomes. either we say "we can live without mw2280" then it will be removed or we say "no, we need a new mainboard" then it will be reimaged.. so you dont have to do to that I guess [23:27:11] ah [23:27:22] I missed that it would require a reimage [23:27:25] perfect :) [23:27:35] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove absented lilypond and ploticus packages [puppet] - 10https://gerrit.wikimedia.org/r/728680 (owner: 10Legoktm) [23:27:37] I think new mainboard requires it.. though... [23:27:44] (03CR) 10Legoktm: [C: 03+2] mediawiki: Update cgroup documentation [puppet] - 10https://gerrit.wikimedia.org/r/728681 (owner: 10Legoktm) [23:28:27] if not I'll make a note to manually clean it up [23:28:35] ok :) [23:30:16] (03PS1) 10Cwhite: logstash: move kubernetes_docker parsing to priority 15 [puppet] - 10https://gerrit.wikimedia.org/r/728683 (https://phabricator.wikimedia.org/T292099) [23:33:00] (03PS2) 10Cwhite: logstash: dot_expander: better handling of field collisions [puppet] - 10https://gerrit.wikimedia.org/r/728682 (https://phabricator.wikimedia.org/T292099) [23:34:06] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:40] (03PS2) 10Cwhite: logstash: move kubernetes_docker parsing to priority 15 [puppet] - 10https://gerrit.wikimedia.org/r/728683 (https://phabricator.wikimedia.org/T292099) [23:38:45] (03PS3) 10Cwhite: logstash: move kubernetes_docker parsing to priority 15 [puppet] - 10https://gerrit.wikimedia.org/r/728683 (https://phabricator.wikimedia.org/T292099) [23:41:06] (03PS9) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [23:41:08] (03PS8) 10Cwhite: opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) [23:41:10] (03PS8) 10Cwhite: icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) [23:41:12] (03PS7) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [23:41:14] (03PS8) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [23:41:16] (03PS8) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [23:41:18] (03PS4) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [23:41:24] (03PS4) 10Cwhite: logstash: kafka input: add manage_truststore parameter [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) [23:41:28] (03CR) 10Cwhite: logstash: add opensearch output config definition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [23:41:42] (03CR) 10Cwhite: logstash: kafka input: add manage_truststore parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [23:46:35] (03CR) 10Cwhite: opensearch: fork elasticsearch module into opensearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [23:55:13] (03CR) 10Legoktm: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [23:56:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:57:44] (03CR) 10Dzahn: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris)