[00:01:12] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:09] (03PS1) 10Bstorm: cloud lvm: finish up volume group creation for ephemeral disk [puppet] - 10https://gerrit.wikimedia.org/r/721105 (https://phabricator.wikimedia.org/T277078) [00:02:36] (03PS4) 10Ryan Kemper: query_service: create common location for vars [puppet] - 10https://gerrit.wikimedia.org/r/721102 (owner: 10Ebernhardson) [00:04:28] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721102 (owner: 10Ebernhardson) [00:12:08] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The following units failed: session-196509.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:39] (03CR) 10Ryan Kemper: [C: 03+2] query_service: create common location for vars [puppet] - 10https://gerrit.wikimedia.org/r/721102 (owner: 10Ebernhardson) [00:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:20] (03PS1) 10Ryan Kemper: Revert "query_service: create common location for vars" [puppet] - 10https://gerrit.wikimedia.org/r/721071 [00:18:06] (03CR) 10Ryan Kemper: "Error: Found 1 dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/721071 (owner: 10Ryan Kemper) [00:18:11] (03CR) 10Ryan Kemper: [C: 03+2] Revert "query_service: create common location for vars" [puppet] - 10https://gerrit.wikimedia.org/r/721071 (owner: 10Ryan Kemper) [00:20:38] (03CR) 10Bstorm: [C: 03+2] "All this stuff never runs unless the vd volume group is missing. Even then, it should fail if the specific volume is not actually ephemera" [puppet] - 10https://gerrit.wikimedia.org/r/721105 (https://phabricator.wikimedia.org/T277078) (owner: 10Bstorm) [00:59:42] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:55] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The following units failed: session-79851.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:24] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: session-98139.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:46] (03PS1) 10Huji: Temporarily disable anonymous editing on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) [02:30:34] (03CR) 10Ottomata: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [02:42:48] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:26] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1109 load', diff saved to https://phabricator.wikimedia.org/P17273 and previous config saved to /var/cache/conftool/dbconfig/20210915-043053-marostegui.json [04:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore db1109 original load', diff saved to https://phabricator.wikimedia.org/P17274 and previous config saved to /var/cache/conftool/dbconfig/20210915-052802-marostegui.json [05:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:14] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100896.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:01] !log powercycle ms-be2045 - no ssh, no remote tty available [06:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31079/console" [puppet] - 10https://gerrit.wikimedia.org/r/720908 (owner: 10DCausse) [06:05:26] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [06:06:12] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:23] elukey: https://phabricator.wikimedia.org/T290881 [06:07:59] RhinosF1: thanks [06:08:13] Np elukey [06:23:37] (03PS1) 10Elukey: profile::configmaster::disc_desired_state.py: update after switchover [puppet] - 10https://gerrit.wikimedia.org/r/721244 [06:27:57] (03PS1) 10Effie Mouzeli: hiera: clean up mwdebug1001 experiments [puppet] - 10https://gerrit.wikimedia.org/r/721246 [06:31:19] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) The only real reason why we've used puppet there was to inject the statsd address easily IIRC. @Krinkle we do already include a "params" file, I think we can ju... [06:32:18] (03CR) 10Elukey: "Puppet is currently broken on wdqs nodes:" [puppet] - 10https://gerrit.wikimedia.org/r/721099 (owner: 10Ryan Kemper) [06:33:05] (03PS1) 10Effie Mouzeli: mwdebug: round 1 experiment, use 6 pods instead of 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721247 (https://phabricator.wikimedia.org/T280497) [06:38:45] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: clean up mwdebug1001 experiments [puppet] - 10https://gerrit.wikimedia.org/r/721246 (owner: 10Effie Mouzeli) [06:41:29] 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Jelto) [06:41:54] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:56] (03Abandoned) 10Jelto: gitlab::backup make config backup less verbose [puppet] - 10https://gerrit.wikimedia.org/r/720316 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [06:42:41] (03PS1) 10Elukey: Revert "wcqs: tell puppet solver we need this vars.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/721073 [06:44:10] PROBLEM - Disk space on ms-be2045 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [06:44:18] (03CR) 10Elukey: [V: 03+1 C: 03+2] elasticsearch: Fix cirrus_settings_check [puppet] - 10https://gerrit.wikimedia.org/r/720908 (owner: 10DCausse) [06:45:21] (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [06:49:06] RECOVERY - ElasticSearch setting check - 9400 on elastic1034 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [06:49:34] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: session-100923.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:41] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: round 1 experiment, use 6 pods instead of 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721247 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [06:53:49] (03Merged) 10jenkins-bot: mwdebug: round 1 experiment, use 6 pods instead of 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721247 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [06:57:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31080/console" [puppet] - 10https://gerrit.wikimedia.org/r/721073 (owner: 10Elukey) [06:57:54] !log shutdown ms-be2045 (again) after seeing T290881 [06:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:00] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [06:58:06] will ack the alerts [06:59:00] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:10] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Had a chat with David on IRC, since this is an easy revert I am inclined to unblock puppet now and let the Search team follow up later on." [puppet] - 10https://gerrit.wikimedia.org/r/721073 (owner: 10Elukey) [06:59:59] ACKNOWLEDGEMENT - SSH on ms-be2045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Elukey T290881 https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:59:59] ACKNOWLEDGEMENT - Disk space on ms-be2045 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error Elukey T290881 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [06:59:59] ACKNOWLEDGEMENT - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T290881 [07:03:58] ok icinga should slowly reduce some remaining errors as puppet goes, the rest looks ok-ish [07:04:27] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: set image recommendation API URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) [07:20:10] (03CR) 10Gergő Tisza: [beta] GrowthExperiments: set image recommendation API URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [07:25:24] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005239 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:26:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/720865 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [07:27:00] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720858 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [07:33:05] RECOVERY - ElasticSearch setting check - 9400 on elastic1040 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:05] RECOVERY - ElasticSearch setting check - 9600 on elastic1050 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:05] RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:06] RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:06] RECOVERY - ElasticSearch setting check - 9600 on elastic1052 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:06] RECOVERY - ElasticSearch setting check - 9400 on elastic1038 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:06] RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:06] RECOVERY - ElasticSearch setting check - 9600 on elastic1048 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:13] this was me forcing the check --^ [07:33:40] cc: dcausse: --^ [07:34:01] elukey: thanks! :) [07:36:33] (03CR) 10DannyS712: [C: 04-1] Temporarily disable anonymous editing on fawiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [07:41:40] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:58] (03PS1) 10Muehlenhoff: Remove access for kaywong [puppet] - 10https://gerrit.wikimedia.org/r/721251 [07:46:59] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) @dancy I like your idea, even if I generally don't like using rewrite rules much. I'll try to bake a set of... [07:54:29] (03CR) 10Majavah: [C: 03+2] tool: Read name prefix from /etc/wmcs-project [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/718004 (https://phabricator.wikimedia.org/T290325) (owner: 10Majavah) [07:55:05] (03Merged) 10jenkins-bot: tool: Read name prefix from /etc/wmcs-project [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/718004 (https://phabricator.wikimedia.org/T290325) (owner: 10Majavah) [07:55:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for kaywong [puppet] - 10https://gerrit.wikimedia.org/r/721251 (owner: 10Muehlenhoff) [07:55:53] (03PS1) 10Majavah: d/changelog: prepare 0.24 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/721252 [07:59:31] (03CR) 10Majavah: [C: 03+2] d/changelog: prepare 0.24 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/721252 (owner: 10Majavah) [08:00:40] (03Merged) 10jenkins-bot: d/changelog: prepare 0.24 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/721252 (owner: 10Majavah) [08:29:18] (03PS1) 10Alexandros Kosiaris: Support HTTPS as well as HTTP [software/benchmw] - 10https://gerrit.wikimedia.org/r/721255 [08:32:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nitpick but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:42:55] (03PS45) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [08:44:13] (03CR) 10Kosta Harlan: [C: 03+1] "+1, although I wonder if we should just define this as a default value in extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [08:44:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31081/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [08:46:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow rewriting static assets to multiversion [deployment-charts] - 10https://gerrit.wikimedia.org/r/721258 (https://phabricator.wikimedia.org/T285232) [08:48:05] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow rewriting static assets to multiversion [deployment-charts] - 10https://gerrit.wikimedia.org/r/721258 (https://phabricator.wikimedia.org/T285232) [08:52:17] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [08:55:53] (03PS15) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [08:55:55] (03PS13) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [08:55:57] (03PS3) 10Elukey: WIP - helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 [08:59:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 (owner: 10Volans) [09:03:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I got a +1 on the rewriterule from Luca via IRC, so I'll just merge this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/721258 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:05:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720994 (owner: 10Volans) [09:05:02] (03PS4) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [09:07:33] (03Merged) 10jenkins-bot: mediawiki: allow rewriting static assets to multiversion [deployment-charts] - 10https://gerrit.wikimedia.org/r/721258 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:07:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:14:16] PROBLEM - SSH on bast4003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:14:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720996 (owner: 10Volans) [09:16:12] RECOVERY - SSH on bast4003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:26] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31082/console" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [09:16:51] (03CR) 10Elukey: [C: 03+1] Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:16:55] (03PS46) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [09:16:57] (03PS8) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [09:17:16] (03CR) 10Hnowlan: cassandra: use profile::java (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [09:17:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans) [09:18:06] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:19:19] (03CR) 10Elukey: [C: 03+1] remote: remove RemoteHosts.init_system() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 (owner: 10Volans) [09:20:25] (03CR) 10Btullis: Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [09:24:43] (03PS47) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [09:25:21] (03CR) 10Lucas Werkmeister (WMDE): "The Phabricator task mentions maintenance scripts to be run; does that apply to this change or is it for later?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [09:25:48] (03PS1) 10Volans: sre.puppet.renew-cert: add installer SSH support [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 [09:26:19] (03PS2) 10Volans: remote: remove RemoteHosts.init_system() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 [09:28:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs::monitoring: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720939 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:29:27] (03PS9) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [09:30:16] (03CR) 10jerkins-bot: [V: 04-1] cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [09:30:43] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31084/console" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [09:38:26] (03CR) 10JMeybohm: [C: 04-1] helmfile: add the ability to inject labels to Namespaces (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [09:39:05] (03CR) 10Volans: [C: 03+2] remote: remove RemoteHosts.init_system() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 (owner: 10Volans) [09:41:03] (03CR) 10JMeybohm: Add revscoring-editquality as first ml-service to helmfile.d (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:43:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [09:44:31] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:44:35] (03Merged) 10jenkins-bot: remote: remove RemoteHosts.init_system() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/720992 (owner: 10Volans) [09:45:40] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/721031 (https://phabricator.wikimedia.org/T290984) (owner: 10Volans) [09:46:21] !log Disabling Intel X710 NIC on-board LLDP processing on relforge1003 (T290984) [09:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:28] T290984: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 [09:48:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:30] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [09:54:00] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10cmooney) Change now made on relforge1003 also. During change I ran "sudo ip monitor" and netlink... [09:54:37] !log depooling mw1312 and mw1319 [09:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:23] !log depool wtp1026 [09:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:05] atlas exporter seems to be failing, but I cannot see any obvious change why it started failing [09:57:09] (03CR) 10Btullis: [V: 03+1 C: 03+2] Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:57:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31085/console" [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [09:58:01] (03CR) 10Btullis: [V: 03+1 C: 03+2] Improve the Kerberos automatic renewal service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:58:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:24] (03PS5) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [09:58:37] (03CR) 10Elukey: helmfile: add the ability to inject labels to Namespaces (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [10:01:59] (03PS8) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [10:02:04] oh, and it got fixed now, not sure if on its own of someone did something [10:03:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1010.eqiad.wmnet [10:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:34] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) >>! In T290984#7354885, @cmooney wrote: > - Decide on a way to have this done at boot-time... [10:06:27] (03PS1) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [10:08:15] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) Great thanks for confirming Jess :) Any problems just drop us a line. [10:08:32] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) 05Open→03Resolved p:05Triage→03Medium [10:09:02] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10akosiaris) [10:12:06] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10MoritzMuehlenhoff) >>! In T290984#7354885, @cmooney wrote: > - Decide on a way to have this done a... [10:14:16] (03CR) 10Jgiannelos: "@Ottomata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [10:15:14] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10MoritzMuehlenhoff) > That page mentions that at least firmware version NVM 6.01 (for the NIC) and... [10:16:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721031 (https://phabricator.wikimedia.org/T290984) (owner: 10Volans) [10:17:31] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10akosiaris) A few questions/points: * I see a bullet point `TLS certificates need to be placed in the namespace of istio-ingressgateway` and a comment by @joe above, but it doesn... [10:30:48] (03CR) 10Jelto: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:32:42] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [10:33:12] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [10:33:32] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [10:34:02] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [10:46:07] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [10:47:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [10:48:32] (03CR) 10Volans: [C: 03+2] facter: fix lldp_neighbors error on empty lldp [puppet] - 10https://gerrit.wikimedia.org/r/721031 (https://phabricator.wikimedia.org/T290984) (owner: 10Volans) [10:52:17] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) [10:52:54] (03PS1) 10Hnowlan: apt::package_from_component: add update condition for multiple packages [puppet] - 10https://gerrit.wikimedia.org/r/721275 [10:59:12] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) The puppet patch has been merged, so the error showing up in facter is now gone. [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1100). [11:00:05] Lucas_WMDE and inductiveload: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:00:13] I can deploy! [11:00:33] * urbanecm waves too, but I'll watch instead :)) [11:01:39] (03PS3) 10Lucas Werkmeister (WMDE): Don’t check constraints on two property qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) [11:03:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Don’t check constraints on two property qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [11:04:39] (03Merged) 10jenkins-bot: Don’t check constraints on two property qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [11:05:42] testing on mwdebug1001… [11:08:10] (03PS10) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [11:08:55] seems to work fine, syncing [11:09:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:39] (03CR) 10jerkins-bot: [V: 04-1] cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [11:09:42] (03CR) 10Btullis: [C: 03+1] "Yep, LGTM too. I've just got up to speed on the published-sync and sync-published mechanism, but it all seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/721061 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [11:10:36] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:583407|Don’t check constraints on two property qualifiers (T235292)]] (duration: 01m 11s) [11:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:41] T235292: Add P4224 and P360 to wgWBQualityConstraintsPropertiesWithViolatingQualifiers - https://phabricator.wikimedia.org/T235292 [11:10:58] alright, that’s my change done [11:11:02] inductiveload: are you around? [11:11:08] (03PS48) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [11:11:16] hi! I'm around for https://gerrit.wikimedia.org/r/c/720983/ [11:11:22] ok! [11:11:37] I left a question on there earlier today [11:11:51] (03CR) 10Inductiveload: Enable change-tags for new edits' proofread status at mulWS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [11:12:01] (03CR) 10Tpt: Enable change-tags for new edits' proofread status at mulWS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [11:12:30] It seems we just both replied to it. [11:12:33] Hi! [11:12:35] (03PS4) 10Lucas Werkmeister (WMDE): Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [11:12:37] hi ^^ [11:14:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [11:15:47] (03Merged) 10jenkins-bot: Enable change-tags for new edits' proofread status at mulWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720983 (https://phabricator.wikimedia.org/T289140) (owner: 10Inductiveload) [11:16:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:55] inductiveload, Tpt: the change is on mwdebug1001, can you test it? [11:17:54] Seems to work: https://wikisource.org/w/index.php?title=Page:Baissac_-_Le_Folk-lore_de_l%E2%80%99%C3%8Ele-Maurice,_1888.djvu/33&action=history [11:18:00] the tag is here [11:18:37] cool, syncing [11:18:44] Thank you! [11:19:07] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) >>! In T287007#7354949, @akosiaris wrote: > A few questions/points: > > * I see a bullet point `TLS certificates need to be placed in the namespace of istio-ingressgat... [11:20:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720983|Enable change-tags for new edits' proofread status at mulWS (T289140)]] (duration: 01m 06s) [11:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:20] T289140: ProofreadPage: Enable change-tag status system on Wikisources - https://phabricator.wikimedia.org/T289140 [11:20:25] (03CR) 10Hnowlan: [C: 03+2] maps: disable OSM sync maps2009.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/721000 (owner: 10MSantos) [11:21:48] (03PS1) 10DCausse: kafka-jumbo: drop wdqs1009 from ferm [puppet] - 10https://gerrit.wikimedia.org/r/721279 [11:21:50] (03PS1) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [11:21:52] (03PS1) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [11:21:55] !log EU backport+config window done [11:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] PROBLEM - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:29:23] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service,tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:40] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service,tilerator.service Hnowlan stopped intentionally for reimport. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:40] ACKNOWLEDGEMENT - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused Hnowlan stopped intentionally for reimport. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:35:21] (03PS2) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [11:35:23] (03PS2) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [11:35:35] (03CR) 10MMandere: [C: 03+2] wmcs::monitoring: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720939 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [11:35:49] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [11:36:06] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [11:41:03] !log Install 10.4.21-2 on db1125 [11:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] (03PS1) 10Muehlenhoff: The php72 hook is now only needed for Buster, use explicitly [puppet] - 10https://gerrit.wikimedia.org/r/721283 [11:42:42] (03PS1) 10Marostegui: control-mariadb-*: Bump version [software] - 10https://gerrit.wikimedia.org/r/721284 (https://phabricator.wikimedia.org/T289488) [11:43:58] (03CR) 10Marostegui: [C: 03+2] control-mariadb-*: Bump version [software] - 10https://gerrit.wikimedia.org/r/721284 (https://phabricator.wikimedia.org/T289488) (owner: 10Marostegui) [11:44:31] (03CR) 10Muehlenhoff: [C: 03+2] The php72 hook is now only needed for Buster, use explicitly [puppet] - 10https://gerrit.wikimedia.org/r/721283 (owner: 10Muehlenhoff) [11:44:35] (03Merged) 10jenkins-bot: control-mariadb-*: Bump version [software] - 10https://gerrit.wikimedia.org/r/721284 (https://phabricator.wikimedia.org/T289488) (owner: 10Marostegui) [11:47:34] (03PS1) 10Jcrespo: dbbackups: Switch s1 backup generation from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) [11:48:51] (03CR) 10Jcrespo: "FYI this should be ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [11:53:19] (03CR) 10Marostegui: "To be deployed once the DC master is done?" [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [11:53:47] (03PS1) 10Jcrespo: dbbackups: Switch s1 backup generation from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/721286 (https://phabricator.wikimedia.org/T290865) [11:55:13] (03CR) 10Jcrespo: "I don't necessarily like preparing patches so far away, because later the rebase may be non-trivial, but here it is :-)." [puppet] - 10https://gerrit.wikimedia.org/r/721286 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [12:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1200) [12:05:51] (03CR) 10Jcrespo: dbbackups: Switch s1 backup generation from db2097 to db2141 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [12:06:47] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [12:07:04] (03CR) 10Marostegui: dbbackups: Switch s1 backup generation from db2097 to db2141 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) (owner: 10Jcrespo) [12:08:56] (03PS3) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [12:08:58] (03PS3) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [12:13:08] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [12:13:15] (03PS1) 10Jcrespo: dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001 [puppet] - 10https://gerrit.wikimedia.org/r/721288 (https://phabricator.wikimedia.org/T290865) [12:26:00] (03PS1) 10Muehlenhoff: Prefer mx2001 over mx1001 for internal smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/721289 (https://phabricator.wikimedia.org/T286911) [12:26:30] (03PS1) 10Kormat: debian: Upstream release 3.2.5 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/721291 [12:26:59] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Upstream release 3.2.5 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/721291 (owner: 10Kormat) [12:29:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721289 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:34:55] (03CR) 10Klausman: [C: 03+1] kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) (owner: 10Elukey) [12:36:34] (03CR) 10Klausman: helmfile: add the ability to inject labels to Namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [12:37:11] (03CR) 10Klausman: [C: 03+1] Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:38:02] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:38:24] (03CR) 10Klausman: [C: 03+1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [12:39:16] (03PS2) 10Jcrespo: dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001 [puppet] - 10https://gerrit.wikimedia.org/r/721288 (https://phabricator.wikimedia.org/T290868) [12:47:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::sites: allow k8s-only parameters [puppet] - 10https://gerrit.wikimedia.org/r/721265 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [12:49:58] (03PS4) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [12:50:00] (03PS4) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [12:54:41] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:21] (03PS5) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [12:58:23] (03PS5) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [13:00:05] hashar and twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1300). [13:01:49] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:01:55] (03PS1) 10Kormat: debian: Upstream release 3.2.6 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/721299 [13:02:33] train for group1 will be later tonight during the americas slot (19:00 utc) [13:03:19] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum4001.ulsfo.wmnet [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:01] (03CR) 10MVernon: [C: 03+1] "LGTM" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/721299 (owner: 10Kormat) [13:12:20] (03PS7) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [13:12:27] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Upstream release 3.2.6 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/721299 (owner: 10Kormat) [13:14:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum4001.ulsfo.wmnet [13:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:07] (03PS1) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [13:18:11] (03CR) 10Ottomata: [C: 03+1] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:18:25] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum4002.ulsfo.wmnet [13:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:45] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) Ready to create Ganeti VM durum4001.ulsfo.wmnet in the ganeti01.svc.ulsfo.wmnet cluster on row 1 with 2 vCPUs, 4GB of RAM, 15GB of disk in the private network. Ready to create Ganeti VM... [13:18:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: actually redirect static/current to static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/721302 [13:22:52] (03PS6) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [13:22:54] (03PS6) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [13:23:16] (03CR) 10Elukey: [C: 03+1] kafka-jumbo: drop wdqs1009 from ferm [puppet] - 10https://gerrit.wikimedia.org/r/721279 (owner: 10DCausse) [13:23:22] (03CR) 10Elukey: [C: 03+2] kafka-jumbo: drop wdqs1009 from ferm [puppet] - 10https://gerrit.wikimedia.org/r/721279 (owner: 10DCausse) [13:25:30] (03CR) 10Ottomata: [C: 03+2] statistics::rsync::published - push to analytics-web.discovery.wmnet cname [puppet] - 10https://gerrit.wikimedia.org/r/721061 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [13:26:30] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:30:51] (03CR) 10Ottomata: [C: 03+2] Point trafficserver at analytics-web cname instead of thorium hostname [puppet] - 10https://gerrit.wikimedia.org/r/721062 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [13:32:04] (03CR) 10Elukey: "David where is the profile included? I expected to be in some role, but may I am missing something. About the hiera settings - are those d" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:32:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum4002.ulsfo.wmnet [13:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:08] !log pointing {stats,analytics}.wikimedia.org at analytics-web.discovery.wmnet cname - T285355 [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:15] T285355: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 [13:33:16] (03CR) 10Btullis: "This has been implemented in another CR, so abandoning this one for the sake of tidyness." [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) (owner: 10Razzi) [13:33:48] (03Abandoned) 10Btullis: hadoop: Add new worker nodes to hadoop_clusters [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) (owner: 10Razzi) [13:35:00] (03PS1) 10Dzahn: DHCP: add MAC addresses for durum4001, durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/721304 (https://phabricator.wikimedia.org/T290672) [13:35:13] (03PS2) 10Dzahn: DHCP: add MAC addresses for durum4001, durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/721304 (https://phabricator.wikimedia.org/T290672) [13:35:20] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add MAC addresses for durum4001, durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/721304 (https://phabricator.wikimedia.org/T290672) (owner: 10Dzahn) [13:35:25] (03PS6) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [13:35:27] (03PS2) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [13:35:40] (03CR) 10Elukey: helmfile: add the ability to inject labels to Namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [13:37:04] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for durum4001, durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/721304 (https://phabricator.wikimedia.org/T290672) (owner: 10Dzahn) [13:37:27] (03CR) 10Jelto: "A proposal on how to deploy services with helm3. I started with blubberoid. If we agreed on a solution I would amend other services as wel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:39:46] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.087 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:40:27] (03PS16) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [13:41:28] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:42:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: actually redirect static/current to static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/721302 (owner: 10Giuseppe Lavagetto) [13:44:51] (03PS9) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [13:47:19] (03Merged) 10jenkins-bot: mediawiki: actually redirect static/current to static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/721302 (owner: 10Giuseppe Lavagetto) [13:48:35] (03CR) 10DCausse: wdqs: Add a streaming updater profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:48:58] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) @ssingh durum4001 and durum4002 are ready now as well. [13:49:42] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) [13:50:03] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:45] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) >>! In T290672#7355397, @Dzahn wrote: > @ssingh durum4001 and durum4002 are ready now as well. Thank you @Dzahn; very grateful for all the help here! [13:54:53] (03PS1) 10Lucas Werkmeister (WMDE): Add new WikimediaBadges config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721305 (https://phabricator.wikimedia.org/T232927) [13:55:09] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) As it stands, my new configuration would mean that we're going to issue a permanent red... [13:56:24] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:58:14] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:06:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [14:07:29] (03PS1) 10Ottomata: Point analytics-web CNAME at an-web1001 [dns] - 10https://gerrit.wikimedia.org/r/721327 (https://phabricator.wikimedia.org/T285355) [14:07:35] (03PS2) 10Volans: sre.puppet.renew-cert: add installer SSH support [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 [14:07:38] (03CR) 10Volans: "{done}" [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [14:09:34] (03PS2) 10Ssingh: Add durum hosts durum[123]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) [14:10:13] (03PS7) 10DCausse: wdqs: Add a streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [14:10:15] (03PS7) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [14:10:40] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix longstanding bug with mcrouter cross-dc encryption [deployment-charts] - 10https://gerrit.wikimedia.org/r/721328 [14:11:15] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [14:12:24] (03PS3) 10Ssingh: Add durum hosts durum[1234]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) [14:13:17] (03CR) 10Elukey: wdqs: Add a streaming updater profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [14:22:08] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:33:02] 10SRE, 10Observability-Metrics, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) [14:33:07] 10SRE, 10Observability-Metrics, 10observability: grafana access control - https://phabricator.wikimedia.org/T108546 (10lmata) [14:36:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! I'll test this with a reinstall of an unused VM tomorrow." [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [14:39:40] (03CR) 10Volans: [C: 03+2] sre.puppet.renew-cert: add installer SSH support [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [14:42:13] (03Merged) 10jenkins-bot: sre.puppet.renew-cert: add installer SSH support [cookbooks] - 10https://gerrit.wikimedia.org/r/721263 (owner: 10Volans) [14:42:34] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10wkandek) a:03wkandek Yes, we will take a look to see how the new database can be put on all appservers. [14:42:44] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) [14:47:17] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) >>! In T288853#7352497, @Ottomata wrote: > The -1 is not for the idea, but for a specific implement... [14:48:48] (03PS1) 10Giuseppe Lavagetto: Add configuration for wmerrors to php-multiversion-base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 [14:50:23] !log installing lz4 security updates on stretch [14:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:08] (03CR) 10Lucas Werkmeister (WMDE): "> This must be tested somewhere else before getting deployed to production." [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [15:03:14] (03CR) 10DCausse: wdqs: Add a streaming updater profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:06:00] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [15:06:57] (03PS1) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [15:08:43] (03PS2) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [15:10:27] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) @akosiaris: me and @JMeybohm drafted https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-06_Wikifeeds, that should be in a reasonable good stat... [15:11:19] (03PS8) 10DCausse: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [15:11:21] (03PS8) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [15:13:55] (03PS3) 10DCausse: alertmanager: set search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) [15:14:24] (03CR) 10DCausse: alertmanager: set search-platform team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [15:19:23] (03CR) 10Ahmon Dancy: Add configuration for wmerrors to php-multiversion-base (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 (owner: 10Giuseppe Lavagetto) [15:19:59] (03PS2) 10Giuseppe Lavagetto: Add configuration for wmerrors to php-multiversion-base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 (https://phabricator.wikimedia.org/T288851) [15:20:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MoritzMuehlenhoff) >>! In T290766#7349755, @MRaishWMF wrote: > Hi @cmooney , thatnks for noticing that. Yes, the 'mraish' account was set up when I w... [15:24:19] (03CR) 10Tobias Andersson: [C: 03+1] "looks correct to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721305 (https://phabricator.wikimedia.org/T232927) (owner: 10Lucas Werkmeister (WMDE)) [15:26:52] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:27:31] (03PS3) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [15:30:03] (03PS1) 10Cathal Mooney: Grant access to anlytics-privatedata-users for Michael Raish [puppet] - 10https://gerrit.wikimedia.org/r/721339 (https://phabricator.wikimedia.org/T290766) [15:30:11] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10elukey) For the records, puppet was failing with: ` Sep 14 12:01:09 an-worker1096 puppet-agent[35073]: (/Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/g]/File... [15:31:30] (03CR) 10JMeybohm: [C: 04-1] services: deploy services with helm3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [15:32:02] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow injecting the wmerrors script [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851) [15:32:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and make sure to run "offboard-user -l mraish" on mwmaint1002 after merging)." [puppet] - 10https://gerrit.wikimedia.org/r/721339 (https://phabricator.wikimedia.org/T290766) (owner: 10Cathal Mooney) [15:33:35] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: inject php7-0fatal-error.php in k8s [puppet] - 10https://gerrit.wikimedia.org/r/721342 (https://phabricator.wikimedia.org/T288851) [15:33:41] (03CR) 10JMeybohm: [C: 04-1] helmfile: add the ability to inject labels to Namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [15:34:52] (03CR) 10Ahmon Dancy: Add configuration for wmerrors to php-multiversion-base (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/721333 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:35:37] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: inject php7-fatal-error.php in k8s [puppet] - 10https://gerrit.wikimedia.org/r/721342 (https://phabricator.wikimedia.org/T288851) [15:35:53] (03PS2) 10Cathal Mooney: Grant access to anlytics-privatedata-users for Michael Raish [puppet] - 10https://gerrit.wikimedia.org/r/721339 (https://phabricator.wikimedia.org/T290766) [15:39:36] (03PS9) 10DCausse: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) [15:39:38] (03PS9) 10DCausse: wdqs: activate the streaming_updater role on wdqs2008 [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) [15:41:23] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:41:32] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) Hi @WDoranWMF, thanks for the heads up! That timeline should work. We don't expect mo... [15:45:35] (03CR) 10Eevans: "Looks good to me (see suggestion in-line though)." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [15:45:43] (03CR) 10Ahmon Dancy: safe-service-restart: only verify pooled services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [15:45:49] (03CR) 10Herron: [C: 03+1] Prefer mx2001 over mx1001 for internal smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/721289 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:46:22] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:24] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:47:54] (03CR) 10Vgutierrez: [C: 03+2] cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:48:14] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:51:54] (03CR) 10Mholloway: Convert $wgEventStreams to be an associative array (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [15:53:16] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/721345 [15:56:18] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:56:31] !log Remove 2FA for User:Rho at wikitech, identity verified via a videocall [15:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:54] (03PS1) 10Herron: logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 [15:57:58] (03PS2) 10Hnowlan: Warn when no instance name is passed. [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) [15:59:22] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721281 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [16:02:47] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum5001.eqsin.wmnet [16:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:53] (03CR) 10JMeybohm: [C: 04-1] kubernetes: add revscoring-editquality in the services configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [16:10:05] * elukey cries in a corner [16:10:38] * jayme hands tissue [16:11:20] it's just me trying to be tidy...you might ignore :) [16:12:09] nono I had the same doubts, I didn't want a very invasive patch but I am ok to expand the scope of it :D [16:13:16] (03CR) 10Eevans: Warn when no instance name is passed. (031 comment) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [16:17:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum5001.eqsin.wmnet [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:09] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum5002.eqsin.wmnet [16:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:24] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) Ready to create Ganeti VM durum5001.eqsin.wmnet in the ganeti01.svc.eqsin.wmnet cluster on row 1 with 2 vCPUs, 4GB of RAM, 15GB of disk in the private network. Read... [16:26:01] !log joal@deploy1002 Started deploy [analytics/refinery@0f7f6f3]: Regular analytics weekly train [analytics/refinery@0f7f6f3] [16:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] (03PS7) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [16:29:07] (03PS3) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [16:29:58] (03PS1) 10Dzahn: mail::mx: remove cron that mails aliases to OIT (ITS) [puppet] - 10https://gerrit.wikimedia.org/r/721350 (https://phabricator.wikimedia.org/T122144) [16:31:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum5002.eqsin.wmnet [16:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:46] (03CR) 10Elukey: "Thanks! Usual Luca's leftover from previous patches :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [16:34:01] (03PS3) 10Hnowlan: Warn when no instance name is passed. [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) [16:35:52] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:36:22] (03PS1) 10Dzahn: DHCP: add MAC addresses for durum5001 and durum5002 [puppet] - 10https://gerrit.wikimedia.org/r/721353 (https://phabricator.wikimedia.org/T290672) [16:37:13] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for durum5001 and durum5002 [puppet] - 10https://gerrit.wikimedia.org/r/721353 (https://phabricator.wikimedia.org/T290672) (owner: 10Dzahn) [16:41:24] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={list,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:45:10] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:45:16] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:45:44] !log joal@deploy1002 Finished deploy [analytics/refinery@0f7f6f3]: Regular analytics weekly train [analytics/refinery@0f7f6f3] (duration: 19m 43s) [16:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:03] (03CR) 10Gergő Tisza: [beta] GrowthExperiments: set image recommendation API URL (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [16:47:20] (03CR) 10Elukey: kubernetes: add revscoring-editquality in the services configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [16:47:20] !log joal@deploy1002 Started deploy [analytics/refinery@0f7f6f3] (thin): Regular analytics weekly train THIN [analytics/refinery@0f7f6f3] [16:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] !log joal@deploy1002 Finished deploy [analytics/refinery@0f7f6f3] (thin): Regular analytics weekly train THIN [analytics/refinery@0f7f6f3] (duration: 00m 07s) [16:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:04] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10mmodell) >>! In T285232#7355412, @Joe wrote: > at the end of the day it seems to me that we... [16:50:23] !log joal@deploy1002 Started deploy [analytics/refinery@0f7f6f3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0f7f6f3] [16:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721339 (https://phabricator.wikimedia.org/T290766) (owner: 10Cathal Mooney) [16:56:38] !log joal@deploy1002 Finished deploy [analytics/refinery@0f7f6f3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0f7f6f3] (duration: 06m 15s) [16:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:51] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) Adding notes in the task before I start moving things in to the incident doc itself. * Docs are at https://wikitech.wikimedia.org/wiki/Wikifeeds. That be... [17:00:45] (03CR) 10Dzahn: "The comment at https://phabricator.wikimedia.org/T288956#7350247 sounds like this is not ready to be merged. Adding Mukunda." [puppet] - 10https://gerrit.wikimedia.org/r/720811 (https://phabricator.wikimedia.org/T288956) (owner: 10Jforrester) [17:01:43] (03CR) 1020after4: "This change is overridden by config in phabricator's database, therefore it would have no effect to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/720811 (https://phabricator.wikimedia.org/T288956) (owner: 10Jforrester) [17:10:54] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) All done. You can add all hosts to site.pp now. [17:11:06] (03CR) 10Ladsgroup: miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [17:11:36] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) 05Open→03Resolved [17:14:07] (03PS1) 10Herron: logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 [17:16:57] (03PS1) 10Dzahn: DHCP: switch mwmaint2002 from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) [17:17:01] (03CR) 10Cathal Mooney: [C: 03+2] Grant access to anlytics-privatedata-users for Michael Raish [puppet] - 10https://gerrit.wikimedia.org/r/721339 (https://phabricator.wikimedia.org/T290766) (owner: 10Cathal Mooney) [17:19:30] (03CR) 10Dzahn: "No hosts found matching `C:thumbor` unable to do anything" [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:22:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31091/" [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:22:29] (03PS3) 10Dzahn: thumbor: convert systemd-clean-tmpfiles cron to timer [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) [17:23:15] (03PS1) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [17:24:07] (03CR) 10Dzahn: "or use a cloud VPS project, spin up instance and apply the "simplelamp2" role. that gets you an Apache setup you can try it on, toolforge " [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [17:24:24] (03CR) 10jerkins-bot: [V: 04-1] opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:25:38] (03CR) 10Ladsgroup: "I can't comment on the non-technical aspect of the removal 😄" [puppet] - 10https://gerrit.wikimedia.org/r/721350 (https://phabricator.wikimedia.org/T122144) (owner: 10Dzahn) [17:26:16] (03CR) 10Dzahn: "It's currently an open ZenDesk ticket to get the ok to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/721350 (https://phabricator.wikimedia.org/T122144) (owner: 10Dzahn) [17:27:22] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10NRodriguez) +1 to Lauren. @WDoranWMF of note, this does not need to be escalated if there are high... [17:27:42] topranks: there is a pending change on the master [17:27:58] heh was just typing out a similar question for you. [17:28:22] Do you want to go ahead and merge? Or I can merge your one? [17:28:29] you are first in the FIFO :) yes, please merge both [17:28:38] (03PS2) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [17:29:12] Ok done :) [17:29:26] thanks! I see it on thumbor [17:29:35] converting one more cron [17:30:17] ah nice yes I see.... we'll get to the top of that mountain one day :) [17:31:42] we even have a graph with progress towards 0 :) [17:31:49] in a google doc [17:31:59] it's not that far anymore, heh [17:32:20] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/721361 [17:32:22] (03PS1) 10Dzahn: thumbor: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/721362 (https://phabricator.wikimedia.org/T273673) [17:32:47] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/721361 (owner: 10Dzahn) [17:32:55] (03PS2) 10Dzahn: thumbor: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/721362 (https://phabricator.wikimedia.org/T273673) [17:33:08] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721345 (owner: 10Herron) [17:34:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) Ok @MRaishWMF the additional access should now be set up for account 'mikeraish' if you can try and let me know if it'... [17:35:26] (03CR) 10Dzahn: "[thumbor1002:~] $ systemctl list-timers | grep tmpfiles" [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:37:02] (03CR) 10Dzahn: "Process: 2555 ExecStart=/bin/systemd-tmpfiles --clean --prefix=/srv/thumbor/tmp (code=exited, status=0/SUCCESS)" [puppet] - 10https://gerrit.wikimedia.org/r/719542 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:38:18] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721345 (owner: 10Herron) [17:39:08] !log thumbor - running puppet on all thumbor hosts, removed cron job systemd-thumbor-tmpfiles-clean, added thumbor_systemd_tmpfiles_clean timer job [17:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:45] (03PS6) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [17:41:08] (03CR) 10Dzahn: [C: 03+2] "ran puppet on thumbor* and checked it's gone via cumin" [puppet] - 10https://gerrit.wikimedia.org/r/721362 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:41:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MRaishWMF) Hi @cmooney , I'm still getting the following error message when logged in and attempting to access a particular dat... [17:42:46] (03CR) 10Ottomata: "Ah great!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717589 (https://phabricator.wikimedia.org/T277193) (owner: 10Mholloway) [17:44:33] (03PS1) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [17:45:17] (03PS2) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [17:45:33] Amir1: are you able to SSH to this by any chance? deployment-imagescaler03.deployment-prep.eqiad1.wikimedia.cloud [17:46:42] never tried [17:46:46] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [17:46:54] works for me [17:47:01] yup [17:47:04] same here, works [17:47:30] could you do me a favor and run "crontab -u thumbor -l | grep tmpfiles" [17:47:43] I think i removed the puppet code a moment too early [17:47:48] empty [17:47:53] ah, cool, thanks :) [17:48:26] (03PS2) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [17:50:05] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [17:51:48] (03CR) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:51:55] (03PS5) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [17:52:13] (03CR) 10jerkins-bot: [V: 04-1] thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:52:19] (03PS6) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [18:00:05] hashar and twentyafterfour: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1800). Please do the needful. [18:00:05] tgr: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] (03PS1) 10Herron: logstash: make jmx_ params optional [puppet] - 10https://gerrit.wikimedia.org/r/721370 [18:00:13] I can deploy today! [18:00:28] o/ [18:00:29] (03PS3) 10Urbanecm: [beta] GrowthExperiments: set image recommendation API URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [18:00:32] (03CR) 10Urbanecm: [C: 03+2] [beta] GrowthExperiments: set image recommendation API URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [18:00:40] just +2'ed, can't do much more for a beta patch :) [18:00:50] (03PS2) 10Herron: logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 [18:00:57] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) Thank you, again, for the help and for getting this done so quickly. [18:01:03] (03PS2) 10Herron: logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 [18:01:23] (03PS7) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [18:01:31] (03PS3) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [18:01:33] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: set image recommendation API URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720929 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [18:01:35] a git/fetch rebase on the deploy host would be nice [18:02:03] just so the next deployer doesn't have to think about it [18:02:16] (03PS1) 10Urbanecm: Add portrattarkiv.se to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721372 (https://phabricator.wikimedia.org/T290581) [18:02:20] yeah, I'll do that too :) [18:02:29] (03CR) 10Urbanecm: [C: 03+2] Add portrattarkiv.se to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721372 (https://phabricator.wikimedia.org/T290581) (owner: 10Urbanecm) [18:02:31] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [18:03:36] (03PS1) 10Jelto: hier::common::deployment_server add environment helmfile-defaults [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) [18:03:52] (03Merged) 10jenkins-bot: Add portrattarkiv.se to wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721372 (https://phabricator.wikimedia.org/T290581) (owner: 10Urbanecm) [18:03:57] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10wiki_willy) Just a quick update - the replacement part was shipped out on Monday, and should be arriving today. (might be in the loading dock already... [18:05:32] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7620084a1ed92066aa8b29fa609cf6cbb4f799ab: Add portrattarkiv.se to wgCopyUploadsDomains whitelist of Wikimedia Commons (T290581) (duration: 01m 05s) [18:05:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:40] T290581: Add portrattarkiv.se to wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T290581 [18:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31092/console" [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [18:07:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:45] (03CR) 10Eevans: [V: 03+2 C: 03+2] "👍" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/721019 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [18:09:40] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [18:12:12] (03CR) 10Jelto: services: deploy services with helm3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [18:13:40] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) My overdue status update: * MediaWiki now has a $wgLocalHTTPProxy setting, which allows setting a HTTP proxy for all req... [18:14:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:31] (03PS8) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [18:16:09] (03PS4) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [18:16:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:52] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10UploadWizard, 10Tracking-Neverending: Uploadstash errors (tracking) - https://phabricator.wikimedia.org/T85568 (10thcipriani) [18:20:24] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [18:21:07] !log Start server-side upload for 1 video file (T290707) [18:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:12] T290707: Server side upload for Xenotron - https://phabricator.wikimedia.org/T290707 [18:23:25] !log Start server-side upload for 1 video file (T290685) [18:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:30] T290685: Server side upload for Xenotron - https://phabricator.wikimedia.org/T290685 [18:25:01] (03PS2) 10Ebernhardson: Declare cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [18:25:48] (03PS4) 10Ssingh: Add durum hosts durum[12345]00[12] to BGP anycast [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) [18:26:03] (03PS3) 10Ebernhardson: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 [18:27:28] !log Start server-side upload for 1 video file (T290290) [18:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:32] T290290: Server side upload for ALeoncio (WMB) - https://phabricator.wikimedia.org/T290290 [18:27:51] (03CR) 10Ssingh: "This is finally ready for review. Sorry for the multiple edits: I wanted to make sure I get as many hosts as I can and now that list is co" [homer/public] - 10https://gerrit.wikimedia.org/r/721018 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [18:27:58] 10SRE, 10Wikimedia-Mailing-lists: Enforce a consistent policy for disabled/archived mailing lists - https://phabricator.wikimedia.org/T281778 (10Legoktm) [18:29:21] (03CR) 10RLazarus: "Do we need to repoint mwmaint.discovery.wmnet first? It's still pointing at codfw over in templates/wmnet in the DNS repo -- not sure if t" [puppet] - 10https://gerrit.wikimedia.org/r/721358 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [18:36:18] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) 05Open→03progress a:05Cmjohnson→03RobH [18:43:05] !log migrated sitereq-l@ from Google Groups to Mailman (T290908) [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:09] T290908: Create sitereq-l@lists.wikimedia.org - https://phabricator.wikimedia.org/T290908 [18:44:40] !log Start server-side upload for 3 large PDF files (T290722) [18:44:46] 10SRE, 10Wikimedia-Mailing-lists: Create sitereq-l@lists.wikimedia.org - https://phabricator.wikimedia.org/T290908 (10Legoktm) 05Open→03Resolved Done! All subscribers and archives have been migrated. [18:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:47] T290722: Server side upload for Nikola_Smolenski - https://phabricator.wikimedia.org/T290722 [18:45:44] (03PS3) 10Ssingh: acme_chief: update authorized_regexes for durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) [18:50:24] !log Start server-side upload for 1 video file (T289781) [18:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:29] T289781: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T289781 [18:52:54] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) Attempting to flash the raid firmware results in failure to verify package contents on the firmware file. The firmware file is fine (re-downlo... [18:52:58] !log Start server-side upload for 1 video file (T289949) [18:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:03] T289949: server-side upload (User:Gnom, Mount Kenya video) - https://phabricator.wikimedia.org/T289949 [18:55:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) Ok thanks for the quick feedback. Not 100% why that might be I will double check with those more experienced than I and try to get it sorted. [18:56:50] o/ [18:59:07] sigh, I ran the CI on the wrong CR. does anyone have any tips handy on how to kill a long-running CI? I thought it will time out but no! [18:59:41] link? you should be able to log into integration.wikimedia.org/ci/ with your LDAP account and stop it [19:00:04] hashar and twentyafterfour: How many deployers does it take to do MediaWiki train - European+American Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1900). [19:00:36] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31093/ [19:01:13] it's possible I am missing something but I don't see an option to kill the job anywhere [19:02:11] are you logged into jenkins? [19:02:30] there's like a little "X" next to the progress bar [19:02:36] I killed it though [19:02:50] weird [19:02:55] ok thanks, at least [19:03:07] it's very small and hardly visible :) [19:03:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MRaishWMF) Hi @cmooney , actually I just checked again (80 minutes later) and I actually do have the access I need now. Maybe it took a while for eve... [19:03:45] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.23 [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:59] looks like this, sukhe https://usercontent.irccloud-cdn.com/file/CIewZXZh/image.png [19:04:14] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31094/console" [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [19:04:41] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.23 (duration: 00m 55s) [19:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:02] urbanecm: ha yes [19:05:33] (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: update authorized_regexes for durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [19:06:07] (03CR) 10Ssingh: [V: 03+1 C: 03+2] "Changes since last review: expanded regex to include ulsfo and esams hosts." [puppet] - 10https://gerrit.wikimedia.org/r/721022 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [19:06:29] !log Start server-side upload for 1 video file (T287686) [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:36] T287686: Server side upload for Zoozaz1 - https://phabricator.wikimedia.org/T287686 [19:07:12] !log Re-start server-side upload for 1 video file, likely temporary swift failure (T289781) [19:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:16] T289781: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T289781 [19:07:48] (03PS1) 10Hashar: all wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721377 [19:07:50] (03PS1) 10Hashar: Revert "all wikis to 1.37.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721378 [19:07:50] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Rollback all wikis to 1.37.0-wmf.23 [19:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:17] (03CR) 10Hashar: [C: 03+2] "That change did not make it to Gerrit due to a glitch in deploy-promote script." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721377 (owner: 10Hashar) [19:09:37] (03CR) 10Hashar: [C: 03+2] Revert "all wikis to 1.37.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721378 (owner: 10Hashar) [19:09:50] * urbanecm is confused [19:09:57] i thought it's wednesday? [19:09:59] slight screw up in the deployment sorry folks [19:10:03] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721377 (owner: 10Hashar) [19:10:05] I ran deploy-promote all first [19:10:07] cancel it [19:10:12] but the commit already got crafted [19:10:21] (03Merged) 10jenkins-bot: Revert "all wikis to 1.37.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721378 (owner: 10Hashar) [19:10:24] I then ran deploy-promote group1 which did not create the expected commit [19:10:30] since all group1 wiki already got "promoted" [19:10:39] and it synced the previously made commit that promoted everything :/ [19:10:47] :( [19:10:49] I will fill a bug report about it later on [19:11:08] anyway we got some concerning log errors from AbuseFilter so I would have rolled back group1 anyway [19:11:23] mukunda and I are filing the blockers [19:11:26] I'm sure Daimona will look quickly 🙂 [19:11:38] T291123 [19:11:39] T291123: TypeError: Argument 5 passed to MediaWiki\Extension\AbuseFilter\Parser\ParserStatus::__construct() must be of the type integer, null given, called in /srv/mediawiki/php-1.37.0-wmf.23/extensions/AbuseFilter/includes/Parser/ParserStatus.php on line 107 - https://phabricator.wikimedia.org/T291123 [19:12:15] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) attempted flash of bios and raid bios, failure to due unable to verify package - upload attempted twice from the same set of downloaded files r... [19:13:10] (03Abandoned) 10Jforrester: phabricator: Add 'In Progress' task status [puppet] - 10https://gerrit.wikimedia.org/r/720811 (https://phabricator.wikimedia.org/T288956) (owner: 10Jforrester) [19:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:06] PHP Notice: Undefined index: format when unserializing joyce [19:15:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:42] https://phabricator.wikimedia.org/T291124 [19:18:59] (03PS1) 10Daimona Eaytoy: Bump EditStashCache version [extensions/AbuseFilter] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721312 (https://phabricator.wikimedia.org/T291123) [19:19:31] (03PS2) 1020after4: Bump EditStashCache version [extensions/AbuseFilter] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721312 (https://phabricator.wikimedia.org/T291123) (owner: 10Daimona Eaytoy) [19:20:05] (03CR) 1020after4: [C: 03+2] Bump EditStashCache version [extensions/AbuseFilter] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/721312 (https://phabricator.wikimedia.org/T291123) (owner: 10Daimona Eaytoy) [19:22:57] (03PS1) 10Ebernhardson: query_service: Move /etc/wdqs to /etc/query_service [puppet] - 10https://gerrit.wikimedia.org/r/721380 [19:22:59] (03PS1) 10Ebernhardson: query_service: Add symlink from old config dir to new [puppet] - 10https://gerrit.wikimedia.org/r/721381 [19:23:01] (03PS1) 10Ebernhardson: query_service: Require configuration to exist prior to running scap [puppet] - 10https://gerrit.wikimedia.org/r/721382 [19:23:52] (03PS1) 10Cathal Mooney: Add rhuang-ctr user to puppet data.yaml file. [puppet] - 10https://gerrit.wikimedia.org/r/721383 (https://phabricator.wikimedia.org/T290991) [19:25:56] (03PS2) 10Ebernhardson: query_service: Move /etc/wdqs to /etc/query_service [puppet] - 10https://gerrit.wikimedia.org/r/721380 [19:25:58] (03PS2) 10Ebernhardson: query_service: Add symlink from old config dir to new [puppet] - 10https://gerrit.wikimedia.org/r/721381 [19:26:00] (03PS2) 10Ebernhardson: query_service: Require configuration to exist prior to running scap [puppet] - 10https://gerrit.wikimedia.org/r/721382 [19:26:09] Sorry for the AF blocker. Did I miss the train triage? [19:26:17] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721380 (owner: 10Ebernhardson) [19:26:29] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721382 (owner: 10Ebernhardson) [19:29:31] Daimona: yeah triage is over though we are still working on it [19:31:33] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Joe) Using envoy as a proper proxy instead than a transparent one makes me uneasy a bit; it's not how we've used it, and I don't... [19:36:30] Table 'arbcom_dewiki.echo_push_subscription' doesn't exist [19:36:31] neat [19:43:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:47] this seems like a fun week [19:44:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:45] hashar: that sounds like someone pushed code not ready for prod [19:46:51] Or forgot a config flag [19:46:59] Or to create the table [19:49:15] for the erroenous promote I filed https://phabricator.wikimedia.org/T291130 [19:50:28] !log twentyafterfour@deploy1002 Synchronized php-1.37.0-wmf.23/extensions/AbuseFilter/: sync backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/721312 (duration: 01m 06s) [19:50:28] hashar: do you want to backport the Title::getBacklinkCache patches? [19:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:18] zabe: it might not be needed afterall [19:51:29] the log is filtered out and the patch will be included next week [19:51:35] guess I should state that on the task [19:52:09] I did tell you https://phabricator.wikimedia.org/T281164#7350925 ;) [19:53:31] I have closed https://phabricator.wikimedia.org/T290909 again ;) [19:53:41] ok [19:53:44] and yes that message has definitely been helpful! [19:53:47] thank you [19:54:37] and there is T291128 Table 'arbcom_dewiki.echo_push_subscription' doesn't exist [19:54:37] T291128: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'arbcom_dewiki.echo_push_subscription' doesn't exist (db1175)Function: EchoPush\SubscriptionManager::getSubscriptionsForUserQuery: SELECT * FROM `echo_push_subscription` INNER JOIN `echo_push_provider` ON ((eps_provider = epp_id)) LEFT JOIN `echo_push_topic` ON ((eps_topic = ept_id)) WHERE eps_user = 91 - https://phabricator.wikimedia.org/T291128 [19:55:06] (03PS1) 10Daimona Eaytoy: Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) [19:56:11] hashar: FYI I've tagged this weeks blockers as 1.37 blockers too that weren't [19:56:16] Also for next week [19:56:31] Because the code has hit REL1_37 in some cases [19:56:39] thanks RhinosF1 [19:57:18] twentyafterfour: np, I want to avoid last time where a few train blockers didn't get backported then caught us out [19:58:59] so one cache incompatibility related issue [19:59:04] and a database table being blocked [19:59:07] * twentyafterfour writes the train blocked email [19:59:36] daimona, twentyafterfour and I were in the video call for the last hour. Mukunda follows up [19:59:57] I am going off ! Thank you everyone for the support feedback messages etc! [20:00:05] hashar and twentyafterfour: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T1900). [20:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T2000). Please do the needful. [20:00:06] Bye hashar! [20:01:51] Thank you for driving code through every week [20:07:55] (03PS4) 10Ebernhardson: query_service: Enable installation to new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/721382 [20:08:02] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [20:08:24] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721382 (owner: 10Ebernhardson) [20:10:09] (03PS11) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [20:13:11] PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [20:17:02] (03PS3) 10Herron: logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 [20:17:13] (03PS12) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [20:17:18] (03PS6) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [20:18:14] (03PS5) 10Ebernhardson: query_service: Enable installation to new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/721382 [20:18:43] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [20:19:42] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [20:20:14] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721382 (owner: 10Ebernhardson) [20:21:14] (03PS6) 10Ebernhardson: query_service: Enable installation to new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/721382 [20:22:35] (03PS4) 10Herron: logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 [20:22:48] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [20:23:54] (03CR) 10jerkins-bot: [V: 04-1] Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [20:27:13] looking at T291128 ... how can a database table disappear? or is that a new wiki that hasn't had all schema updates applied? [20:27:14] T291128: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'arbcom_dewiki.echo_push_subscription' doesn't exist (db1175)Function: EchoPush\SubscriptionManager::getSubscriptionsForUserQuery: SELECT * FROM `echo_push_subscription` INNER JOIN `echo_push_provider` ON ((eps_provider = epp_id)) LEFT JOIN `echo_push_topic` ON ((eps_topic = ept_id)) WHERE eps_user = 91 - https://phabricator.wikimedia.org/T291128 [20:27:54] subbu: I'm not sure, I assumed it was a new table [20:28:08] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721382 (owner: 10Ebernhardson) [20:28:10] it is not. [20:28:11] code referencing something that hasn't been created in prod yet [20:28:19] arbcom_dewiki is not new either [20:28:19] hmmm [20:28:23] https://github.com/wikimedia/mediawiki-extensions-Echo/blob/master/db_patches/echo_push_subscription.sql ... 2020. [20:28:35] well then I do not know what's up with it [20:29:50] is that repeatable? or was that a transient error? [20:30:45] I have no idea, it showed up when we rolled forward, I don't think it's happening now that we rolled back [20:30:52] nemo-yiannis pointed me to that sql commit from 2020. [20:31:57] I am not really familiar with extensions in production. Are the db patches applied automatically or needs manual intervention? [20:32:07] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Enable installation to new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/721382 (owner: 10Ebernhardson) [20:32:30] I think they are done manually [20:33:34] twentyafterfour, the error stack indicates the error is from 1.37.0-wmf.21 though [20:33:59] (03PS13) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [20:33:59] RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [20:34:04] (03PS7) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [20:34:46] so i wonder if this was something transient ... if not, if we need some dba to poke at what happened to that table. [20:36:13] One other explanation is that we never created them in the first place but never had events related to push notifications so no error got triggered [20:37:16] (03PS2) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [20:37:25] (03CR) 10Herron: [C: 04-2] "still reviewing the approach in the parent changes, self -2ing for now" [puppet] - 10https://gerrit.wikimedia.org/r/721364 (owner: 10Herron) [20:37:37] sukhe: i can't find that table in x1 (where most echo_* tables are) or in the main DB section. https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L142 is where the initial table creation happens (where someone creates a wiki). It appears to load https://github.com/wikimedia/mediawiki-extensions-Echo/blob/master/echo.sql, and...echo_push_subscription is not there [20:37:42] so nemo-yiannis might well be right [20:39:31] (03CR) 10Herron: [C: 03+1] alerts: copy metadata for alert rules on deploy [puppet] - 10https://gerrit.wikimedia.org/r/720243 (owner: 10Filippo Giunchedi) [20:43:35] (03CR) 10Herron: "fwiw the effective changes can be see in the full diff of the topmost change, for example https://puppet-compiler.wmflabs.org/compiler1001" [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [20:44:02] urbanecm: wrong nick? :) [20:44:09] oh, yes [20:44:12] meant to ping subbu [20:44:14] sorry :) [20:44:19] echo_push_subscription is a shared table (in wikishared), not sure why it's looking in the local DB, that's the problem [20:44:28] i'll look into why that's happening now [20:44:42] urbanecm, ok :) but looks like michael has an idea here. [20:44:47] (03PS1) 10Cwhite: profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) [20:46:21] mholloway: because `$wgEchoSharedTrackingCluster` is false in arbcom_dewiki [20:46:34] that would do it [20:47:09] mholloway: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L3186 only sets it to x1/wikishared if the wiki uses centralauth [20:47:20] which is not the case for private wikis, like arbcom_dewiki [20:47:56] hmm, but push shouldn't be enabled on arbcom_dewiki to begin with, only on the wikipedias [20:48:07] arbcom_dewiki //is// considered a wikipedia [20:48:11] (it's in wikipedia.dblist) [20:48:28] you need to set 'arbcom_dewiki' => false if you want to exclude arbcom wikis [20:49:11] note it's not the only private wiki that's in wikipedia.dblist [20:49:16] those wikis are also affected https://www.irccloud.com/pastebin/R3qVbypt/ [20:49:27] (`grep -f dblists/private.dblist dblists/wikipedia.dblist` is what i used) [20:50:57] (03PS1) 10Ebernhardson: query_service: Provide generic path to query_service logs [puppet] - 10https://gerrit.wikimedia.org/r/721394 [20:51:26] (03PS2) 10Ebernhardson: query_service: Provide generic path to query_service logs [puppet] - 10https://gerrit.wikimedia.org/r/721394 [20:52:04] (03PS1) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [20:52:08] (03CR) 10jerkins-bot: [V: 04-1] query_service: Provide generic path to query_service logs [puppet] - 10https://gerrit.wikimedia.org/r/721394 (owner: 10Ebernhardson) [20:52:11] OK... I suppose to fix the UBN we can exclude those wikis specifically. (Is it true that wiki-specific settings override group settings, e.g., 'arbcom_dewiki' => false takes precedence over 'wikipedia' => true ?) [20:52:17] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721394 (owner: 10Ebernhardson) [20:52:23] it is true [20:52:54] Then I think a follow-up ticket is in order to evaluate next steps. I'll file that. [20:53:05] First, the config update [20:53:11] there's also T183549, which is closely related [20:53:11] T183549: Arbcom wikis are in both wikipedia.dblist and special.dblist - https://phabricator.wikimedia.org/T183549 [20:53:31] once this is done, i am curious to know why this got triggered today given that these are not new wikis. [20:54:20] (03PS3) 10Ebernhardson: query_service: Provide generic path to query_service logs [puppet] - 10https://gerrit.wikimedia.org/r/721394 [20:54:26] i have a theory [20:55:05] (03PS1) 10Urbanecm: Set wmgEchoEnablePush to false explicitly on arbcom_* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721396 (https://phabricator.wikimedia.org/T291128) [20:55:14] mholloway: i think this should be it ^^ [20:55:56] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) a:05RobH→03Bstorm I incremented the idrac from 2.52 to 2.81, bypassing the known bad version 2.61. Seems 2.81 also has an https issue (dif... [20:56:09] (03PS2) 10Urbanecm: Set wmgEchoEnablePush to false explicitly on arbcom_* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721396 (https://phabricator.wikimedia.org/T291128) [20:56:11] (03CR) 10Mholloway: [C: 03+1] Set wmgEchoEnablePush to false explicitly on arbcom_* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721396 (https://phabricator.wikimedia.org/T291128) (owner: 10Urbanecm) [20:56:21] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721394 (owner: 10Ebernhardson) [20:56:30] urbanecm: yes, that should do it. [20:56:30] (03PS1) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) [20:56:46] mholloway: great :). Want to deploy it? Or should I? [20:57:56] if you don't mind, that would be great. (i only have prod access from my work laptop, which i'm not using at the moment) [20:58:04] (03CR) 10Urbanecm: [C: 03+2] Set wmgEchoEnablePush to false explicitly on arbcom_* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721396 (https://phabricator.wikimedia.org/T291128) (owner: 10Urbanecm) [20:58:07] sure, let's do it [20:58:15] urbanecm: If you're deploying now can you pull https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721066 (test-only, doesn't need syncing)? (Or I can.) [20:58:29] will do James_F [20:58:43] my theory is that this change is the underlying cause: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/713708 [20:58:45] (03CR) 10Urbanecm: [C: 03+2] tests: suppress API prefix uniqueness check for 'pi' [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721066 (https://phabricator.wikimedia.org/T290585) (owner: 10Jforrester) [20:58:48] (03Merged) 10jenkins-bot: Set wmgEchoEnablePush to false explicitly on arbcom_* wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721396 (https://phabricator.wikimedia.org/T291128) (owner: 10Urbanecm) [20:59:08] Thanks! [20:59:10] push used to default to disabled for all notification types, but now defaults to enabled for the same notification types as web [20:59:47] (03CR) 10Jforrester: [C: 03+1] "Yeah, this makes sense." [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [21:00:18] (this could also have been triggered if someone on one of the arbcom wikis got curious and enabled push for one or more notification types in notification settings, but i don't think it's a coincidence that this happened shortly after the change i linked) [21:00:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 60e7e515d7034a9f839d78851f1dcc2be3df7f3b: Set wmgEchoEnablePush to false explicitly on arbcom_* wikis (T291128) (duration: 01m 06s) [21:00:40] mholloway: should be live [21:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:42] T291128: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'arbcom_dewiki.echo_push_subscription' doesn't exist (db1175)Function: EchoPush\SubscriptionManager::getSubscriptionsForUserQuery: SELECT * FROM `echo_push_subscription` INNER JOIN `echo_push_provider` ON ((eps_provider = epp_id)) LEFT JOIN `echo_push_topic` ON ((eps_topic = ept_id)) WHERE eps_user = 91 - https://phabricator.wikimedia.org/T291128 [21:02:13] mholloway, i see. ok. thx. similar to nemo-yiannis was saying but now we know what exactly might have caused it. good to have mysteries resolved. :) [21:02:48] ah, will have to catch up on backscroll [21:03:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:17] mholloway, -> "One other explanation is that we never created them in the first place but never had events related to push notifications so no error got triggered" [21:03:26] twentyafterfour: hashar: fyi ^^, T291128 should be hopefully fixed now :) [21:04:47] thx mholloway and urbanecm. [21:04:52] np :) [21:04:55] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31095/console" [puppet] - 10https://gerrit.wikimedia.org/r/721394 (owner: 10Ebernhardson) [21:04:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:17] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Provide generic path to query_service logs [puppet] - 10https://gerrit.wikimedia.org/r/721394 (owner: 10Ebernhardson) [21:08:19] thanks mholloway [21:08:29] and urbanecm [21:08:37] any time [21:08:43] yes, thanks urbanecm and all! [21:08:44] urbanecm: thank you! [21:09:44] (03PS1) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) [21:09:46] (03PS3) 10Urbanecm: Message: Remove deprecated format property [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [21:11:47] (03CR) 10jerkins-bot: [V: 04-1] role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:17:42] (03PS2) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618) [21:20:48] (03Merged) 10jenkins-bot: tests: suppress API prefix uniqueness check for 'pi' [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721066 (https://phabricator.wikimedia.org/T290585) (owner: 10Jforrester) [21:24:53] (03CR) 10Subramanya Sastry: Message: Remove deprecated format property (031 comment) [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/721313 (https://phabricator.wikimedia.org/T146416) (owner: 10Daimona Eaytoy) [21:26:03] James_F: fyi,that patch got merged & fetched [21:27:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:24] (03PS1) 10Ebernhardson: query_service: Require paths prior to running wdqs promotion [puppet] - 10https://gerrit.wikimedia.org/r/721405 [21:34:04] (03CR) 10jerkins-bot: [V: 04-1] query_service: Require paths prior to running wdqs promotion [puppet] - 10https://gerrit.wikimedia.org/r/721405 (owner: 10Ebernhardson) [21:35:29] (03PS2) 10Ebernhardson: query_service: Require paths prior to running wdqs promotion [puppet] - 10https://gerrit.wikimedia.org/r/721405 [21:40:23] 10SRE, 10ops-codfw, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability): decommission maps2001.codfw.wmnet, maps2002.codfw.wmnet, maps2003.codfw.wmnet, maps2004.codfw.wmnet - https://phabricator.wikimedia.org/T290588 (10wiki_willy) a:05hnowlan→03Papaul [21:40:48] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@f3473d9]: Reference files deployed by puppet through query_service paths instead of wdqs [21:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:52] (03PS2) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [21:42:11] (03PS3) 10Ryan Kemper: query_service: Require paths prior to running wdqs promotion [puppet] - 10https://gerrit.wikimedia.org/r/721405 (owner: 10Ebernhardson) [21:42:55] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: Require paths prior to running wdqs promotion [puppet] - 10https://gerrit.wikimedia.org/r/721405 (owner: 10Ebernhardson) [21:42:55] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@f3473d9]: Reference files deployed by puppet through query_service paths instead of wdqs (duration: 02m 07s) [21:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:35] (03CR) 10Ryan Kemper: [C: 03+1] "Looks good; naturally we don't want to deploy this until it's cutover time" [dns] - 10https://gerrit.wikimedia.org/r/721327 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [21:50:09] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Urbanecm) [21:50:22] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Urbanecm) [21:52:37] urbanecm: Thanks! [21:54:11] np [21:55:17] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.85`. Pre-deploy tests passing on canary `wdqs1003` [21:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:33] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@902529b]: 0.3.85 [21:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:17] !log [WDQS Deploy] Tests passing following deploy of `0.3.85` on canary `wdqs1003`; proceeding to rest of fleet [21:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:33] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@902529b]: 0.3.85 (duration: 06m 59s) [22:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:22] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [22:03:24] !log [WDQS Deploy] Restarted `wdqs-categories` across both test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [22:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:36] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [22:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:37] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Confirmation message when changing email subscription is broken - https://phabricator.wikimedia.org/T291134 (10Legoktm) This bug was introduced in https://gitlab.com/mailman/postorius/-/commit/1882dd6f934e1350f78ab813448211885e55a5bb, where the message format was ac... [22:19:54] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) [22:20:00] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Confirmation message when changing email subscription is broken - https://phabricator.wikimedia.org/T291134 (10Legoktm) [22:26:02] thanks legoktm :) [22:31:21] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:31:25] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1001 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:38:12] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [22:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:17] (03PS5) 10Dave Pifke: statsv: add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [22:51:47] !log uploaded new mailmanclient/postorius packages to apt1001 [22:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210915T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:02:54] !log upgrading lists1001 to use postorius 1.3.5 [23:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:52] blah, I broke CSS/JS [23:06:13] FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/python3/dist-packages/postorius/static/postorius/libs/html5shiv/html5shiv.js' [23:06:13] legoktm@lists1001:~$ file /usr/lib/python3/dist-packages/postorius/static/postorius/libs/html5shiv/html5shiv.js [23:06:13] /usr/lib/python3/dist-packages/postorius/static/postorius/libs/html5shiv/html5shiv.js: broken symbolic link to ../../../../../nodejs/html5shiv/dist/html5shiv.js [23:09:09] should be fixed [23:11:04] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) >>! In T286217#7336644, @Legoktm wrote: > postorius 1.3.5 was released, in addition to the unsubscribe security fix we already have: https://docs.m... [23:37:19] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) We could teach MediaWiki how to use a transparent proxy instead, I'll poke at that. [23:45:23] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1698139 MB (21% inode=77%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [23:50:24] (03CR) 10RLazarus: "Do you think there are going to be a lot of cases where we give different values to print_output and print_progress_bars? I'd be tempted t" [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans) [23:51:03] (03PS1) 10SDineshKumar: Switched from cron to systemd timer for elasticsearch modules [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) [23:51:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/721413 (https://phabricator.wikimedia.org/T273673) (owner: 10SDineshKumar) [23:55:21] (03CR) 10RLazarus: remote: add support to enable/disable Cumin output (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/720993 (owner: 10Volans)