[00:00:07] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T0000). [00:01:02] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Set up airflow-research instance on an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/708583 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [00:05:31] (03CR) 10Ahmon Dancy: [C: 03+1] "I can't say it makes things more understandable but it seems like a reasonable change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708582 (owner: 10Dduvall) [00:20:17] (03PS1) 10Ottomata: Set up airflow@platform_eng instance on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) [00:23:07] (03PS2) 10Ottomata: Set up airflow@platform_eng instance on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) [00:23:55] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30402/console" [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [00:25:32] (03PS3) 10Ottomata: Set up airflow@platform_eng instance on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) [00:26:13] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30403/console" [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [00:26:44] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Set up airflow@platform_eng instance on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/708609 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [01:01:13] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) >>! In T257066#7233536, @Legoktm wrote: > https://test.wikipedia.org/wiki/Score/plwikisource/3 is... [03:48:54] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) |Host| Host iface| switch iface|switch name| change notes|iface on new siwtch |lvs2007|ens2f0np0|xe-2/0/45|asw-a2-codfw|no change| no change |lvs200... [03:53:37] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) >>! In T286881#7242985, @Vgutierrez wrote: > |Host |Row |Host iface |switch iface|switch name| > |lvs2007|**A**|ens2f0np0|xe-2/0/45|A2| > |lvs2008|A... [03:55:32] 10SRE, 10Wikimedia-Mailing-lists: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Legoktm) @joan the current list admin is juancho2291 at hotmail dot com - are they no longer active? If not I can promote you. [03:56:47] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-akhatun-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:09] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:25] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) Thanks John! All looking good: ` root@db1129:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prima... [04:47:12] (03PS1) 10Marostegui: wmnet: Failover m2-master to dbproxy1015 [dns] - 10https://gerrit.wikimedia.org/r/708646 (https://phabricator.wikimedia.org/T286032) [04:47:59] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master to dbproxy1015 [dns] - 10https://gerrit.wikimedia.org/r/708646 (https://phabricator.wikimedia.org/T286032) (owner: 10Marostegui) [04:48:50] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need to revert this. [04:49:18] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [05:07:47] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) [05:08:34] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) [05:38:33] 10SRE, 10Wikimedia-Mailing-lists: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10JOAN) >>! In T287554#7245392, @Legoktm wrote: > @joan the current list admin is juancho2291 at hotmail dot com - are they no longer active? If not I can promote you. Hi. I met him, but he i... [05:44:14] !log adding "comunicaciones AT wikimediacolombia.org" as owner of wikimedia-co mailing list [05:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:03] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Sort out who is admin for wikimedia-co@ - https://phabricator.wikimedia.org/T287554 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I added `comunicaciones@wikimediacolombia.org` as an owner of the mailing list. Create an account with this email (or add... [06:08:35] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) > ...given there could be several dozens of such very small services I [[https://logstash.wikimedia.org/goto/9f46bba4ed0d64bf14926cdb13d53561|searched t... [06:10:36] (03PS1) 10Sharvaniharan: Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [06:11:43] (03CR) 10jerkins-bot: [V: 04-1] Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [06:16:31] (03PS2) 10Sharvaniharan: Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [06:17:39] (03CR) 10jerkins-bot: [V: 04-1] Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [06:20:37] (03PS3) 10Sharvaniharan: Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [06:21:48] (03CR) 10jerkins-bot: [V: 04-1] Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [06:27:32] (03PS4) 10Sharvaniharan: Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [06:28:42] (03CR) 10jerkins-bot: [V: 04-1] Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [06:31:42] (03PS5) 10Sharvaniharan: Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [06:41:45] !log push pfw policies - T287203 [06:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:53] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10ayounsi) [06:52:26] (03CR) 10Martaannaj: Added the PropertySuggester event logging to InitialiseSettings-labs.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (owner: 10Michaelcochez) [06:56:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:01:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:02:04] jelto: FYI ^ rails flapping [07:02:15] or maybe not? not sure if expected [07:05:26] (03PS1) 10Marostegui: Revert "db1122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708630 [07:06:10] (03CR) 10Marostegui: [C: 03+2] Revert "db1122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/708630 (owner: 10Marostegui) [07:07:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/708534 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [07:10:50] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245477, @Legoktm wrote: >> ...given there could be several dozens of such very small services > > I [[https://logstash.wikimedia.org/goto/9f... [07:18:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [07:20:18] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Aklapper) +1 to Legoktm's last comment. "Add a comment to this task" makes this a neverending open ticket, though tickets... [07:21:47] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245525, @Joe wrote: > I looked at the query you linked, and I don't think you should exclude the `scripts/` directory, or am I missing so... [07:22:03] (03PS1) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [07:22:18] (03CR) 10Marostegui: [C: 04-2] "Needs discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:23:12] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:24:02] (03CR) 10Majavah: [C: 04-1] wmf-config: Wikitech migration from s10 to s6 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:25:35] (03PS2) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [07:26:46] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:27:14] (03CR) 10Marostegui: [C: 04-2] wmf-config: Wikitech migration from s10 to s6 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:28:17] (03CR) 10Elukey: [C: 03+2] "Going to test the cookbook in dry-run mode :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [07:28:34] (03PS3) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [07:29:46] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:31:54] (03PS1) 10Filippo Giunchedi: pontoon: add kafkamon to observability [puppet] - 10https://gerrit.wikimedia.org/r/708717 [07:31:56] (03PS4) 10Marostegui: wmf-config: Wikitech migration from s10 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) [07:32:02] (03CR) 10Marostegui: [C: 04-2] "Question: should we leave s10.dblist just empty or delete it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:33:20] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add kafkamon to observability [puppet] - 10https://gerrit.wikimedia.org/r/708717 (owner: 10Filippo Giunchedi) [07:34:08] godog: thanks! I already noticed that rails service on gitlab has reduced availability (<98%). I will try to find out why this is happening and if this is a real issue [07:35:13] jelto: sure np! good luck, LMK if you need help with the prometheus side [07:35:24] (03CR) 10Ladsgroup: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:35:33] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245532, @Legoktm wrote: >>>! In T261277#7245525, @Joe wrote: >> I looked at the query you linked, and I don't think you should exclude the `s... [07:37:21] (03CR) 10Majavah: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:39:07] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245543, @Joe wrote: >>>! In T261277#7245532, @Legoktm wrote: >> Only Score shells out to paths with scripts/ AFAIS. Here's the query with... [07:41:05] PROBLEM - tilerator on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [07:41:25] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:35] RECOVERY - tilerator on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [07:42:55] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:15] (03CR) 10RhinosF1: [C: 03+1] "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708716 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:48:59] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) I love "shellboxes out". Thanks I remember we discussed managing scripts explicitly when we introduced shellbox, so I was wondering why they were executed lo... [07:49:37] (03PS1) 10Elukey: Remove Grafana alerts for ORES [puppet] - 10https://gerrit.wikimedia.org/r/708719 (https://phabricator.wikimedia.org/T281359) [07:52:34] !log restarting Tomcat on idp-test [07:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:09] (03CR) 10Elukey: "I tried the cookbook in debug mode and it works with a pooled and a depooled node, but I see a lot of entries from confctl (for example, a" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [07:53:31] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:09] !log roll restart uwsgi + celery on ores[12]* nodes to pick up aspell upgrades [07:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:04] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [07:57:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/708719 (https://phabricator.wikimedia.org/T281359) (owner: 10Elukey) [07:59:02] (03PS2) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) [07:59:04] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [07:59:44] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [08:00:06] (03CR) 10RhinosF1: [C: 04-1] "Not until MediaWiki side is done" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [08:00:18] (03PS3) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) [08:00:29] (03CR) 10Elukey: [C: 03+2] Remove Grafana alerts for ORES [puppet] - 10https://gerrit.wikimedia.org/r/708719 (https://phabricator.wikimedia.org/T281359) (owner: 10Elukey) [08:08:48] (03PS4) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) [08:09:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [08:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:22] (03CR) 10Michaelcochez: "Stream names are now changed to be according to how they were in the logging code. Now added them to the +wikidatawiki section instead of " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [08:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [08:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:22] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/705901 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:26:34] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10elukey) [08:33:23] !log purging obsolete kernels from moscovium (disk space alerts for /) [08:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:50] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jcrespo) [08:40:49] (03PS1) 10ArielGlenn: Swap dumpsdata rols back now that network maintenance is complete [puppet] - 10https://gerrit.wikimedia.org/r/708724 (https://phabricator.wikimedia.org/T286065) [08:45:18] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) >>! In T265435#7227212, @Papaul wrote: > @fgiunchedi the sensors for the Raritan PDU are in place Thank you @papaul ! I can confirm temperature/hum... [08:48:29] (03PS2) 10ArielGlenn: Swap dumpsdata rols back now that network maintenance is complete [puppet] - 10https://gerrit.wikimedia.org/r/708724 (https://phabricator.wikimedia.org/T286065) [08:51:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:39] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) a:05Papaul→03fgiunchedi [08:53:08] (03PS2) 10Elukey: knative-serving: override KUBERNETES_SERVICE_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) [08:53:10] (03PS10) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [08:53:12] (03PS1) 10Elukey: Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 [08:54:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:57] (03PS3) 10ArielGlenn: Swap dumpsdata rols back now that network maintenance is complete [puppet] - 10https://gerrit.wikimedia.org/r/708724 (https://phabricator.wikimedia.org/T286065) [08:59:20] (03CR) 10ArielGlenn: [C: 03+2] Swap dumpsdata rols back now that network maintenance is complete [puppet] - 10https://gerrit.wikimedia.org/r/708724 (https://phabricator.wikimedia.org/T286065) (owner: 10ArielGlenn) [09:02:53] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:09] (03PS1) 10Giuseppe Lavagetto: Centralize the kubernetes docker password [labs/private] - 10https://gerrit.wikimedia.org/r/708727 [09:04:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [09:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:32] (03PS3) 10Elukey: knative-serving: override KUBERNETES_SERVICE_HOST and update images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) [09:05:34] (03PS2) 10Elukey: Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 [09:05:36] (03PS11) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:06:04] (03CR) 10jerkins-bot: [V: 04-1] knative-serving: override KUBERNETES_SERVICE_HOST and update images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:06:09] (03CR) 10jerkins-bot: [V: 04-1] Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 (owner: 10Elukey) [09:06:13] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:07:44] (03CR) 10RhinosF1: "This change is ready for review." [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [09:08:47] (03PS11) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [09:09:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30407/console" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:10:40] (03CR) 10RhinosF1: [C: 04-1] "not until mediawiki side is done" [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [09:13:24] (03PS4) 10Elukey: knative-serving: override KUBERNETES_SERVICE_HOST and update images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) [09:13:26] (03PS3) 10Elukey: Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 [09:13:28] (03PS12) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:14:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [09:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:02] PROBLEM - Check systemd state on ganeti2025 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:50] (03CR) 10Elukey: "Janis: helm template seems to work fine, lemme know if I am missing something or not!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:17:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Centralize the kubernetes docker password [labs/private] - 10https://gerrit.wikimedia.org/r/708727 (owner: 10Giuseppe Lavagetto) [09:19:02] (03PS12) 10Btullis: Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) [09:19:15] (03CR) 10Btullis: "> Patch Set 10: -Verified" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [09:19:33] (03PS1) 10Marostegui: install_server: Reimage db2104 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708730 (https://phabricator.wikimedia.org/T287230) [09:20:32] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2104 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/708730 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [09:40:47] !log uncordon kubestage1002.eqiad.wmnet as rsyslog was restarted and log shipping to logstash works again [09:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:34] !log installing Mariadb 10.3.29 updates from Buster point release (as packaged in Debian, not the WMF DB packages) [09:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:29] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@6960d32]: Increase mirrored traffic to tegola [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:52] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@6960d32]: Increase mirrored traffic to tegola (duration: 00m 22s) [09:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1000). Please do the needful. [10:17:07] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:18:51] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2104 T287230', diff saved to https://phabricator.wikimedia.org/P16925 and previous config saved to /var/cache/conftool/dbconfig/20210729-102753-marostegui.json [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:03] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [10:28:32] (03PS1) 10Marostegui: db2104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708733 (https://phabricator.wikimedia.org/T287230) [10:29:29] (03CR) 10Marostegui: [C: 03+2] db2104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/708733 (https://phabricator.wikimedia.org/T287230) (owner: 10Marostegui) [10:29:31] (03PS1) 10Muehlenhoff: Remove ganeti01 SVC IPs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/708735 [10:40:35] (03PS9) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [10:40:37] (03PS1) 10Jcrespo: dbbackups: Reimage dbprov2002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/708736 (https://phabricator.wikimedia.org/T287230) [10:40:41] (03PS1) 10Jcrespo: dbbackups: Reorganize backups after dbprov2002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/708737 (https://phabricator.wikimedia.org/T287230) [10:42:14] (03PS10) 10Jcrespo: backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) [10:47:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2104.codfw.wmnet with reason: REIMAGE [10:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2104.codfw.wmnet with reason: REIMAGE [10:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:14] (03PS1) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) [10:59:34] (03Restored) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [10:59:43] (03PS5) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) [10:59:48] (03CR) 10Jcrespo: "It can be seen it is a "technical noop" for most servers, except for es202X hosts, which were classified on the wrong cluster for icinga a" [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: How many deployers does it take to do EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1100). [11:00:11] here [11:00:18] no one signed up for traning this week [11:00:41] there are no patches listed in the window either [11:01:02] (03CR) 10Btullis: "Restoring this change, since we have explored the DNS discovery method and decided not to use it for the time being." [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:01:03] duesen: this would be a safe week for you to show up :-P [11:01:15] o/ [11:01:34] no patches, no trainees, no problem :-P [11:01:52] (03CR) 10Marostegui: [C: 03+1] backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:03:26] (03CR) 10Btullis: "See https://gerrit.wikimedia.org/r/c/operations/puppet/+/706661 for the related puppet change." [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:05:21] (03CR) 10Jcrespo: [C: 03+2] backup: Move backup-related hosts from misc to new backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/708473 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:08:20] (03PS2) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) [11:10:56] Lucas_WMDE: or anyone else who cna do this, it turns out that duesen can't really ever make this time, so we might as well remove him from the list. [11:11:28] (03Abandoned) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:12:44] I’m not sure how these windows get created :/ [11:12:52] I think there’s a page somewhere that gets copied? [11:12:59] I'll ask Tyler [11:13:28] aha, apparently it’s automated even https://wikitech.wikimedia.org/wiki/Special:Contributions/DeploymentCalendarTool [11:13:39] or is that only for archiving [11:13:41] nevermind [11:13:43] Tyler will know :) [11:14:05] (03PS1) 10Jbond: R:apt::package_from_component: add more flexible packages param [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) [11:14:24] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/release/+/refs/heads/master/make-deployment-calendar/deployments-calendar.json#54 [11:14:34] (03CR) 10jerkins-bot: [V: 04-1] R:apt::package_from_component: add more flexible packages param [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [11:14:35] yup! [11:14:56] given no patches etc, I'm going to wander off.... [11:15:02] have a stress-free Thursday! [11:15:08] see you! [11:15:28] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708747 (owner: 10L10n-bot) [11:16:54] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [11:17:29] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [11:18:15] (03CR) 10Btullis: "I accidentally replaced 706661 with this change." [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:20:06] (03CR) 10Btullis: "> Patch Set 4:" [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [11:20:51] (03PS11) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [11:22:00] (03PS2) 10Jbond: R:apt::package_from_component: add more flexible packages param [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) [11:22:40] (03CR) 10jerkins-bot: [V: 04-1] R:apt::package_from_component: add more flexible packages param [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [11:34:08] (03PS1) 10Urbanecm: Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 [11:35:17] (03CR) 10jerkins-bot: [V: 04-1] Hide warning notice from backport window announcements [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708756 (owner: 10Urbanecm) [11:35:19] (03PS1) 10Urbanecm: Update default network name to Libera [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708757 [11:44:05] (03PS3) 10Jbond: R:apt::package_from_component: add more flexible packages param [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) [11:46:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30413/console" [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [11:46:38] (03CR) 10Jbond: [V: 03+1] "Not sure how to discourage this new form other then review and the comment. ideas welcome" [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [12:03:40] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] disable backup cronjobs for gitlab2001 [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/708275 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [12:08:51] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:12:01] (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/705707 (owner: 10Ssingh) [12:12:16] (03Abandoned) 10Ssingh: rsyslog: send auditd/audispd logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/705707 (owner: 10Ssingh) [12:13:44] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:14:54] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:18:43] 10SRE, 10Znuny, 10Security: Upgrade OTRS to 5.0.24 - https://phabricator.wikimedia.org/T181127 (10Zabe) [12:21:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [12:21:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2026.codfw.wmnet [12:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10daniel) [12:27:54] RECOVERY - Check systemd state on ganeti2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I think the warning is fine, and this will also get flagged in reviews etc." [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [12:34:23] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Papaul that's a mistake on my side, thanks for spotting it, the second NIC `ens2f1np1` is actually connected to `B7` [12:35:28] (03PS3) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [12:36:52] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [12:38:52] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [12:39:07] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) So there was an issue yesterday when the autovacum process kicked in and caused a [[ https://grafana.wikimedia.org/d/000000469/postgres?viewPanel=1&orgId... [12:39:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/708746 (https://phabricator.wikimedia.org/T287238) (owner: 10Jbond) [12:39:44] elukey: fyi merged ^^^ [12:40:24] jbond: ah nice thanks!! [12:40:42] np [12:42:36] PROBLEM - Check systemd state on ganeti2026 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:07] (03CR) 10Jbond: First attempt to create puppet class for statograph service. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [12:46:00] (03PS3) 10Jelto: icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) [12:51:44] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30414/console" [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:52:56] (03PS4) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [12:54:04] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [12:54:22] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [12:54:55] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [12:55:49] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@1e31cc6]: Increase mirrored traffic to tegola [12:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:10] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@1e31cc6]: Increase mirrored traffic to tegola (duration: 00m 21s) [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:31] (03CR) 10Jbond: First attempt to create puppet class for statograph service. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [13:00:04] twentyafterfour and hashar: How many deployers does it take to do MediaWiki train - American+European Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1300). [13:02:15] 10Puppet, 10Infrastructure-Foundations: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) [13:03:02] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) [13:03:14] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) p:05Triage→03High [13:03:37] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [13:04:14] (03CR) 10Jelto: [V: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:06:22] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the faces table - https://phabricator.wikimedia.org/T287673 (10jbond) [13:06:34] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the faces table - https://phabricator.wikimedia.org/T287673 (10jbond) p:05Triage→03Medium [13:08:40] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: filter large factsets - https://phabricator.wikimedia.org/T287674 (10jbond) [13:08:54] (03PS5) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [13:08:57] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb: filter large factsets - https://phabricator.wikimedia.org/T287674 (10jbond) p:05Triage→03Medium [13:10:19] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [13:17:50] (03PS1) 10Muehlenhoff: ganeti: Add ganeti test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) [13:18:36] (03CR) 10Jbond: First attempt to create puppet class for statograph service. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [13:19:11] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Vgutierrez thank you. What about lvs2007 ens3f1np1? Actually it is connected to d7 and you want it to be moved to C7 or lvs2007 ens3f0np0 is alread... [13:19:30] 10Puppet, 10Infrastructure-Foundations, 10PostgreSQL, 10User-jbond: puppetdb: tune postgress instance - https://phabricator.wikimedia.org/T287672 (10jbond) [13:21:10] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Papaul same thing.. lvs2007 ens3f1np1 is connected to D7, the only desired changes are the new links against A4, B4, C4 and D4 [13:22:08] (03CR) 10Addshore: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:23:27] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Add ganeti test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [13:24:02] (03CR) 10Addshore: Added the PropertySuggester event logging to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:24:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:27:06] (03PS5) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) [13:27:08] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:28:55] (03CR) 10Michaelcochez: "@Addshore you are right, I had changed this in the section above, but missed it here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:29:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [13:33:02] (03PS6) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [13:34:11] (03CR) 10Muehlenhoff: [C: 03+2] conf: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702117 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:34:31] (03CR) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:34:41] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [13:34:59] (03CR) 10Michaelcochez: Added the PropertySuggester event logging to InitialiseSettings-labs.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:38:03] (03PS7) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [13:39:11] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the faces table - https://phabricator.wikimedia.org/T287673 (10jbond) I wonder if the increasing space is some how related to failing or suboptimal vacuuming and perhaps we should schedule a full vacum. from... [13:41:12] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 97 connections established with conf1004.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal [13:41:44] hmmm [13:42:10] (03CR) 10Ottomata: [C: 03+1] Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:42:59] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 25 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [13:43:02] (03CR) 10Ottomata: [C: 03+2] Added the PropertySuggester event logging to InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:43:19] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 57 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [13:44:02] (03PS8) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [13:44:14] (03CR) 10Ottomata: "Merged and rebased on deploy1002. I did not sync to prod (no need, since -labs.php, right?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708592 (https://phabricator.wikimedia.org/T285098) (owner: 10Michaelcochez) [13:44:29] hmm something is messing with conf1004 apparently [13:44:52] (03PS1) 10Jelto: hiera::role::common::acme_chief add gitlab-replica SNI [puppet] - 10https://gerrit.wikimedia.org/r/708767 (https://phabricator.wikimedia.org/T285867) [13:45:36] can it be the most recent puppet commit? it touched conf1004, conf2004 [13:46:12] RECOVERY - Check systemd state on ganeti2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:24] sukhe: yep.. that's it, nginx terminates TLS for etcd there [13:47:24] (03PS1) 10Jgiannelos: tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/708768 [13:51:56] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/708768 (owner: 10Jgiannelos) [13:52:57] <_joe_> !log restarting pybal on lvs1016 [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:13] <_joe_> moritzm: let's see if this fixes it [13:53:15] yup, I was about to do that [13:53:23] <_joe_> but I'm 99% sure it will [13:53:28] <_joe_> ok bgp is back btw [13:53:46] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 115 connections established with conf1004.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal [13:54:11] <_joe_> heh [13:54:18] weird though [13:54:31] <_joe_> vgutierrez: we stopped/started nginx on conf1004/5 [13:54:34] anyways.. I'm restarting pybal on lvs1014 and lvs1015 [13:54:37] <_joe_> so kind-of expected [13:54:48] yeah, I saw the commit and the pybal log [13:54:57] <_joe_> we know that when pybal loses connectivity it sometimes farts strangely [13:55:35] !log restart pybal on lvs1015 [13:55:36] <_joe_> we fixed most of these issues, at this point we don't even know what is causing the issue... I am ready to blame twisted's eventloop [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:29] that's a nasty coincidence [13:56:30] so, the nginx change will also go out on conf1006, conf2004/2005/2006 do we need to sync pybal restarts in a specific sequence [13:56:34] pybal is alredy reporting BGP up again [13:56:46] or just whack-a-mole as they come up? [13:56:52] <_joe_> moritzm: the latter [13:57:20] <_joe_> but beware of the fact you need to wait for the bgp sessions to be properly established before restarting the next pybal [13:57:46] (03CR) 10Dzahn: [C: 03+1] hiera::role::common::acme_chief add gitlab-replica SNI [puppet] - 10https://gerrit.wikimedia.org/r/708767 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [13:58:12] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [13:58:28] ok, I'll wait for 1016 to recover and then I'll proceed with conf2004, ok? [13:58:40] lvs1016 is ok already, lvs1015 too [13:58:46] going for 1014 [13:58:52] ah, yes [13:58:57] (03CR) 10Ottomata: [C: 03+1] Stream config for android_image_recommendation_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [13:59:18] (03PS9) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [13:59:30] (03CR) 10Dzahn: [C: 03+1] icinga::monitor::gitlab add alerts for https and ssh for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/708530 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [13:59:32] (03CR) 10Ottomata: [C: 03+1] Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:59:33] please note that we won't get the recovery on the BGP alert cause cr1-eqiad has BGP issues with another AS [13:59:55] !log restart pybal on lvs1014 [13:59:57] ok [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:44] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [14:00:50] moritzm: all good here :) [14:00:57] vgutierrez: I'll run puppet on conf2004, then [14:01:10] thx, I'll be ready to jump to codfw [14:02:21] (03CR) 10Ottomata: [C: 03+1] Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:02:59] puppet run complete on 2004, let's see [14:03:10] yeah.. etcd connections were reset already [14:03:20] (03CR) 10Btullis: [C: 03+2] Enable kerberos ticket auto-renewal for a test client [puppet] - 10https://gerrit.wikimedia.org/r/705356 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:04:52] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 10 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:05:10] (03PS1) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [14:05:13] !log restart pybal on lvs2007 [14:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:30] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:07:00] (03PS10) 10Jbond: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [14:07:01] lvs2008 is going to alert soon [14:07:07] ok [14:07:34] let's be proactive about that :) [14:07:34] PROBLEM - PyBal connections to etcd on lvs2008 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:07:38] !log restart pybal on lvs2008 [14:07:39] (too late) [14:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:03] (03PS2) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [14:08:19] (03PS3) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [14:08:50] RECOVERY - PyBal connections to etcd on lvs2008 is OK: OK: 8 connections established with conf2004.codfw.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:09:07] vgutierrez: I'll proceed with conf2005? [14:09:10] (03CR) 10Btullis: [C: 03+2] Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [14:09:23] I'm gonna hit lvs2009 and lvs2010 as well [14:09:27] as they're close to alert [14:09:28] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Vgutierrez thank you I have all the information needed. I will do my site audit and get back with you next week to setup a day and time to start m... [14:09:28] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 64 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [14:09:32] ok [14:09:35] (03PS11) 10Jbond: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [14:09:47] !log restart pybal on lvs2010 [14:09:48] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 49 connections established with conf2004.codfw.wmnet:4001 (min=59) https://wikitech.wikimedia.org/wiki/PyBal [14:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:47] (03PS1) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add automation for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) [14:10:49] (03PS1) 10Giuseppe Lavagetto: trafficserver::text: also allow www.mediawiki.org for XWD [puppet] - 10https://gerrit.wikimedia.org/r/708772 [14:11:00] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [14:11:26] !log restart pybal on lvs2009 [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:36] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 59 connections established with conf2004.codfw.wmnet:4001 (min=59) https://wikitech.wikimedia.org/wiki/PyBal [14:13:08] moritzm: all good :) [14:13:28] proceeding with conf2005 [14:14:14] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/708768 (owner: 10Jgiannelos) [14:14:28] puppet run complete [14:17:06] (03Merged) 10jenkins-bot: tegola-vector-tiles: Increase staging replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/708768 (owner: 10Jgiannelos) [14:19:07] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:26] vgutierrez: no fireworks this time, did you bounce them proactively? [14:24:52] moritzm: nope, lvs maintains active connections with 1004 and 2004 [14:26:02] ah, ok. proceeding with conf2006, then [14:27:42] (03CR) 10Btullis: "This does not restart hive services when the configuration files are updated, so i suppose that will have to be done manually after deploy" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:28:12] (03CR) 10Elukey: "Ben I saw the code change passing on IRC and took a look, added some comments, I hope it was ok. If the CR is still WIP feel free to ignor" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:29:20] and conf1006 [14:30:29] (03PS1) 10Ottomata: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 [14:31:35] and done, the conf servers were the final nail in the coffin for https://phabricator.wikimedia.org/T164456 [14:31:37] (03CR) 10Ottomata: "Uh, this was a rename for me in my local diff, dunno why gerrit shows it as a delete and add." [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [14:34:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=etcdmirror site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:34:26] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:35:46] !log depool cp107[5-8].eqiad.wmnet - T286032 [14:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:53] T286032: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 [14:36:21] (03PS12) 10Cathal Mooney: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [14:38:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cp[1075-1078].eqiad.wmnet with reason: Eqiad row A maintenance [14:38:17] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp[1075-1078].eqiad.wmnet with reason: Eqiad row A maintenance [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:35] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw144[3-6].eqiad.wmnet [14:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:43] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row A maintenance ` cp[1075-... [14:38:44] (03CR) 10Jforrester: [C: 03+2] Update default network name to Libera [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708757 (owner: 10Urbanecm) [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:03] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw143[4-6].eqiad.wmnet [14:39:07] (03Merged) 10jenkins-bot: Update default network name to Libera [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/708757 (owner: 10Urbanecm) [14:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] !log depool dns1001 - T286032 [14:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:26] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:59] (03CR) 10Btullis: Update hive to log4j version2 configuration files (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:43:21] (03CR) 10Ottomata: "Nice approach, I think this would work." [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:43:47] PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: connect to address 10.192.32.52 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Etcd [14:43:58] moritzm: ^^? :) [14:44:05] <_joe_> checking [14:44:29] thx [14:44:31] <_joe_> Jul 29 14:31:10 conf2005 etcdmirror-conftool-eqiad-wmnet[2020]: [etcd-mirror] CRITICAL: Generic error: Connection to etcd failed due to ProtocolError('Connection broken: IncompleteRead(0 bytes read)', Incomplete [14:44:33] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) >>! In T279309#7242742, @Dzahn wrote: > @wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a... [14:44:37] (03CR) 10Elukey: Update hive to log4j version2 configuration files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:44:40] <_joe_> yes, it died when nginx restarted [14:45:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:12] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dns1001.wikimedia.org with reason: Eqiad row A maintenance [14:45:12] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:45:13] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dns1001.wikimedia.org with reason: Eqiad row A maintenance [14:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:18] <_joe_> the issue is that etcdmirror is designed to stop and page whenever it finds an issue [14:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:31] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance ` dns1001.... [14:45:31] RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd [14:45:43] ack, thanks [14:46:30] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:36] !log depool lvs1013 - T286032 [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:43] T286032: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 [14:46:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:33] here, all good? [14:47:36] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:40] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Vgutierrez) [14:47:58] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:48:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs1013.eqiad.wmnet with reason: Eqiad row A maintenance [14:48:42] rzl: yep, fixed [14:48:42] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs1013.eqiad.wmnet with reason: Eqiad row A maintenance [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance ` lvs1013.... [14:49:03] thanks :) sorry I'm late [14:49:33] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10jbond) [14:49:41] (03CR) 10Ottomata: "Nice approach, I think this would work." [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:50:11] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Vgutierrez) [14:50:17] 10Puppet, 10Infrastructure-Foundations, 10observability, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10colewhite) [14:50:19] (03CR) 10Ottomata: "Oops, ignore that last comment. I changed my mind and deleted it, but then accidentally submitted this in an open browser window draft." [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:50:23] (03PS1) 10Jdlrobson: Display: Use HTML "dir" attribute for ltr/rtl [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708640 (https://phabricator.wikimedia.org/T287649) [14:50:47] (03CR) 10Btullis: "> Patch Set 3:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [14:53:38] (03PS4) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [14:59:57] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use default value for max postgres connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/708779 [15:00:23] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) a:05fgiunchedi→03Papaul Back to @papaul since work is on hold AIUI [15:02:04] (03CR) 10Btullis: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:02:28] (03PS5) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [15:04:28] (03CR) 10Jelto: [C: 03+2] hiera::role::common::acme_chief add gitlab-replica SNI [puppet] - 10https://gerrit.wikimedia.org/r/708767 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:05:22] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:05:24] (03PS1) 10Marostegui: Revert "wmnet: Failover m2-master to dbproxy1015" [dns] - 10https://gerrit.wikimedia.org/r/708641 [15:06:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30418/console" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:06:12] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m2-master to dbproxy1015" [dns] - 10https://gerrit.wikimedia.org/r/708641 (owner: 10Marostegui) [15:06:18] (03PS2) 10Marostegui: Revert "wmnet: Failover m2-master to dbproxy1015" [dns] - 10https://gerrit.wikimedia.org/r/708641 [15:07:01] (03CR) 10Marostegui: [V: 03+2 C: 03+2] Revert "wmnet: Failover m2-master to dbproxy1015" [dns] - 10https://gerrit.wikimedia.org/r/708641 (owner: 10Marostegui) [15:07:15] !log pool cp107[5-8].eqiad.wmnet - T286032 [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:25] T286032: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 [15:07:28] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7245427, @Marostegui wrote: > m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need t... [15:07:41] (03PS2) 10Jgiannelos: tegola-vector-tiles: Use default value for max postgres connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/708779 [15:09:04] (03CR) 10Btullis: [V: 03+1] "> Patch Set 5: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:09:13] !log pool dns1001.wikimedia.org - T286032 [15:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:59] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:09:59] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Use default value for max postgres connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/708779 (owner: 10Jgiannelos) [15:10:49] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:57] !log pool lvs1013.eqiad.wmnet - T286032 [15:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:39] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Use default value for max postgres connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/708779 (owner: 10Jgiannelos) [15:16:18] (03PS6) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [15:16:24] (03PS1) 10Ottomata: Remove no longer used camus module and references to camus [puppet] - 10https://gerrit.wikimedia.org/r/708782 (https://phabricator.wikimedia.org/T271232) [15:16:45] (03CR) 10Ottomata: [C: 03+1] "Nope, fine with me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:17:52] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30419/console" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:18:02] (03CR) 10Ahmon Dancy: profile::kubernetes::deployment_server: add automation for mw on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [15:18:23] (03Merged) 10jenkins-bot: tegola-vector-tiles: Use default value for max postgres connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/708779 (owner: 10Jgiannelos) [15:20:22] (03CR) 10Ottomata: [C: 03+2] Remove no longer used camus module and references to camus [puppet] - 10https://gerrit.wikimedia.org/r/708782 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:20:46] jbond: ok if i merge your puppet private change about statograph? [15:20:56] oh sorry [15:20:57] labs private. [15:21:03] i see is adummy [15:21:04] merging [15:22:35] (03PS1) 10Elukey: Import Kubeflow Kfserving 0.6.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/708783 (https://phabricator.wikimedia.org/T272919) [15:23:17] (03PS1) 10Cathal Mooney: Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config added manually under T286061 and T286032 [homer/public] - 10https://gerrit.wikimedia.org/r/708784 (https://phabricator.wikimedia.org/T284592) [15:23:19] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:15] (03PS7) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [15:27:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:28:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30421/console" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:28:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:30:05] (03PS8) 10Btullis: Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) [15:31:47] (03CR) 10Btullis: "> Patch Set 7: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:32:44] (03PS1) 10Ottomata: refinery/job - remove already absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/708785 (https://phabricator.wikimedia.org/T271232) [15:33:00] (03PS2) 10Ottomata: refinery/job - remove already absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/708785 (https://phabricator.wikimedia.org/T271232) [15:33:02] (03CR) 10jerkins-bot: [V: 04-1] refinery/job - remove already absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/708785 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:34:09] (03CR) 10Btullis: "> Patch Set 3:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:35:17] (03CR) 10Ottomata: [C: 03+2] refinery/job - remove already absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/708785 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:41:00] (03CR) 10Btullis: [C: 03+2] Update hive to log4j version2 configuration files [puppet] - 10https://gerrit.wikimedia.org/r/708770 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [15:42:57] (03PS3) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/708739 (https://phabricator.wikimedia.org/T273642) [15:55:06] (03PS2) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add automation for mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) [15:55:11] (03CR) 10Giuseppe Lavagetto: profile::kubernetes::deployment_server: add automation for mw on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [15:57:02] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Built locally and it worked, it seems very easy to I'll go ahead and build the images. Please let me know if I missed something or if ther" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/708783 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:59:00] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) I uploaded a very simplistic script that could be used as a systemd timer, or invoked by... [16:00:05] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [16:00:05] jbond42 and rzl: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1600). [16:00:36] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [16:03:38] (03PS13) 10Jbond: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:04:04] 10SRE, 10Analytics-Clusters, 10Analytics-Kanban: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) 05Open→03Resolved [16:07:05] (03PS1) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [16:07:28] (03PS1) 10Jbond: P:diffscan: drop contacts from role [puppet] - 10https://gerrit.wikimedia.org/r/708790 [16:08:20] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [16:08:56] (03PS2) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [16:09:28] (03CR) 10Jbond: [C: 03+2] P:diffscan: drop contacts from role [puppet] - 10https://gerrit.wikimedia.org/r/708790 (owner: 10Jbond) [16:10:11] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [16:11:19] (03PS3) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [16:12:06] (03PS14) 10Jbond: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:17:01] (03PS15) 10Jbond: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:20:55] (03PS16) 10Jbond: O:alerting_host: create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:21:05] (03PS4) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [16:22:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30430/console" [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [16:22:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30429/console" [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:22:55] 10SRE, 10User-jbond: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10Majavah) >>! In T216088#7119005, @ayounsi wrote: > Q. : Are there cases where a profile owner isn't in data.yaml? Yes. I've written and maintain a bunch of profiles for WMCS use cases but don't have producti... [16:23:01] 10SRE, 10Traffic, 10cloud-services-team (Kanban): Puppet broken on diffscan.traffic.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T287612 (10Andrew) 05Open→03Resolved looks fixed! [16:27:39] !log adding uid=mdipietro,ou=people,dc=wikimedia,dc=org to cn=ops,ou=groups,dc=wikimedia,dc=org in ldap [16:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:10] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) I should also note that currently we dont know when the next vacume process will occuer and it is quite possible that when it dose it wil... [16:31:17] (03CR) 10Ayounsi: [C: 03+1] Adding flag for asw2-a-eqiad and asw2-b-eqiad to configure class-of-service shared buffer config. This will keep it in line with config add [homer/public] - 10https://gerrit.wikimedia.org/r/708784 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [17:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1700). [17:04:40] (03PS1) 10Phuedx: test: Add electcomm group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 [17:09:46] Is there anyone around that can review https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708794/? It adds a group on testwiki of which the members can create a poll [17:10:22] (03PS1) 10Andrew Bogott: Don't install smart drive tools on cloud VMs [puppet] - 10https://gerrit.wikimedia.org/r/708796 (https://phabricator.wikimedia.org/T287309) [17:13:48] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: Install arc in exec environ [puppet] - 10https://gerrit.wikimedia.org/r/708129 (https://phabricator.wikimedia.org/T287390) (owner: 10Urbanecm) [17:17:19] (03PS1) 10Andrew Bogott: Added phab tasks explainint why arcanist is installed [puppet] - 10https://gerrit.wikimedia.org/r/708797 (https://phabricator.wikimedia.org/T287390) [17:17:48] (03PS1) 10Mholloway: [push-notifications] Hygiene: Remove invalid TODO [deployment-charts] - 10https://gerrit.wikimedia.org/r/708798 [17:18:50] (03PS2) 10Andrew Bogott: Toolforge: Added phab tasks explainint why arcanist is installed [puppet] - 10https://gerrit.wikimedia.org/r/708797 (https://phabricator.wikimedia.org/T287390) [17:20:00] (03PS3) 10Andrew Bogott: Toolforge: Added phab tasks explaining why arcanist is installed [puppet] - 10https://gerrit.wikimedia.org/r/708797 (https://phabricator.wikimedia.org/T287390) [17:23:51] (03PS2) 10Andrew Bogott: Don't run smart drive drive check [puppet] - 10https://gerrit.wikimedia.org/r/708796 (https://phabricator.wikimedia.org/T287309) [17:24:00] (03CR) 10Andrew Bogott: [C: 03+2] Toolforge: Added phab tasks explaining why arcanist is installed [puppet] - 10https://gerrit.wikimedia.org/r/708797 (https://phabricator.wikimedia.org/T287390) (owner: 10Andrew Bogott) [17:26:17] (03CR) 10Andrew Bogott: [C: 03+2] Don't run smart drive drive check [puppet] - 10https://gerrit.wikimedia.org/r/708796 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [17:29:27] thanks andrewbogott ! [17:42:41] (03CR) 10Legoktm: "Did you test calling registry.get_tags_for_image from deploy1002? I think we'll need to add deployment servers to the registry allowlist f" [puppet] - 10https://gerrit.wikimedia.org/r/708771 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [17:43:25] (03PS1) 10Phuedx: Generate STV tally output page [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708801 (https://phabricator.wikimedia.org/T284585) [17:50:39] (03CR) 10Mholloway: Stream config for android_image_recommendation_interaction schema (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1800). [18:00:05] Tran and jdlrobson: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] I can deploy today! [18:00:44] Jdlrobson: Tran: hi, are you around? [18:00:50] urbanecm: yep! [18:01:25] urbanecm: Tran is just working on a bug. I can be around in place for them [18:01:32] ack :) [18:01:51] I just pinged them, as they were in the calendar [18:01:56] (03CR) 10Urbanecm: [C: 03+2] Display: Use HTML "dir" attribute for ltr/rtl [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708640 (https://phabricator.wikimedia.org/T287649) (owner: 10Jdlrobson) [18:02:11] (03CR) 10Urbanecm: [C: 03+2] Implement STV algorithm [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708413 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [18:02:13] (03CR) 10Urbanecm: [C: 03+2] Generate STV tally output page [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708801 (https://phabricator.wikimedia.org/T284585) (owner: 10Phuedx) [18:02:41] I'll ping you when it's ready for testing :) [18:03:06] urbanecm: AIUI 708801 depends on 708413. Should I have squashed them to make deployment simpler? [18:04:12] phuedx: i see that those two patches have dependencies set in gerrit. As long as gerrit has the right dependency information, deployment should go fine :) [18:04:40] urbanecm: great [18:04:54] urbanecm: Thanks for checking. I'd also note that 708801 has i18n changes [18:05:03] (03PS6) 10Sharvaniharan: Stream config for android_notification_interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 [18:05:50] thanks for noting that phuedx, i missed that part somehow... [18:05:57] is that patch urgent in some way? [18:06:07] (03CR) 10Sharvaniharan: "oops! thank you @Michael Holloway. corrected it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [18:07:08] backporting i18n changes is possible, but it requires full scap (takes ~40 minutes due to i18n cache rebuild) [18:07:23] (03Merged) 10jenkins-bot: Display: Use HTML "dir" attribute for ltr/rtl [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708640 (https://phabricator.wikimedia.org/T287649) (owner: 10Jdlrobson) [18:07:25] (03Merged) 10jenkins-bot: Implement STV algorithm [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708413 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [18:07:27] (03Merged) 10jenkins-bot: Generate STV tally output page [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708801 (https://phabricator.wikimedia.org/T284585) (owner: 10Phuedx) [18:07:29] (03PS1) 10Andrew Bogott: Cloud instances: further attempt to clear the failed state of smartd [puppet] - 10https://gerrit.wikimedia.org/r/708807 (https://phabricator.wikimedia.org/T287309) [18:08:01] (03CR) 10jerkins-bot: [V: 04-1] Cloud instances: further attempt to clear the failed state of smartd [puppet] - 10https://gerrit.wikimedia.org/r/708807 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [18:08:14] phuedx: see my q above [18:08:46] I would wager it's urgent since we need to start testing this week [18:08:58] urbanecm: Wow. It is important. The tally output page will allow AHT QA and T&S to fully test the STV implementation. The team had hoped to land the change before the train started rolling. If it's at all possible, it should be deployed. [18:09:19] (03PS2) 10Andrew Bogott: Cloud instances: further attempt to clear the failed state of smartd [puppet] - 10https://gerrit.wikimedia.org/r/708807 (https://phabricator.wikimedia.org/T287309) [18:09:22] okay then, but don't do this regularly please 🙂 [18:10:12] Thank you! 🙇‍♂️ [18:10:23] phuedx: Tran: Jdlrobson: your patches are at mwdebug2001, please test [18:10:43] sweet [18:10:44] looking [18:10:50] urbanecm: Noted and thank you [18:11:04] (03CR) 10Andrew Bogott: [C: 03+2] Cloud instances: further attempt to clear the failed state of smartd [puppet] - 10https://gerrit.wikimedia.org/r/708807 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [18:11:14] (for phuedx / Tran, note i18n messages will appear as missing, and displayed as -- that's not a bug, it will be fixed as full scap rebuilds i18n cache) [18:11:55] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [18:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:16] urbanecm: that's working as expected, please sync [18:12:24] Jdlrobson: thanks, will do [18:15:13] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/GlobalWatchlist/modules/SiteDisplay.js: 9a2383d7ecfe1874c08f38a08d174364a12ad247: Display: Use HTML "dir" attribute for ltr/rtl (T287649) (duration: 01m 25s) [18:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] Jdlrobson: should be live! [18:15:21] T287649: Regression wmf.16: New vector no longer aligns sites correctly - https://phabricator.wikimedia.org/T287649 [18:15:41] thanks urbanecm [18:15:45] any time :) [18:16:02] phuedx: how does your patch look? :) [18:17:00] urbanecm: just waiting on the 5min ResourceLoader delay to confirm the new JavaScript is loading [18:17:03] Works on debug=true though [18:17:10] ack [18:17:30] urbanecm: Alright. Special:SecurePoll is functioning on testwiki as well as I can test it. I can view all elections, list votes [18:17:45] so, let's sync it? [18:18:30] urbanecm: Yes [18:18:38] doing :). will take a while. [18:19:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - razzi@cumin1001 [18:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] !log urbanecm@deploy1002 Started scap: 796fe8e: 927763c: SecurePoll backports (T283728, T284585) [18:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:00] T283728: Implement STV tallying in STVTallier::finishTally [XL] - https://phabricator.wikimedia.org/T283728 [18:20:00] T284585: Produce the Tally view output for an STV election [M] - https://phabricator.wikimedia.org/T284585 [18:21:34] The new algo that's to be tested is only enabled on testwiki (see https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/8aa4c53a7463a151811a403f5b860168928e9a11/wmf-config/InitialiseSettings.php#23251). If there's a bug in it, there is very limited impact [18:21:58] good to know [18:22:17] phuedx: Are we also getting the config change in? [18:22:29] urbanecm: 👋 [18:22:43] hi Niharika ! [18:22:49] phuedx: fyi, it'll take 20-40 minutes for the scap to complete [18:22:59] Tran spotted a bug in it. I'm just updating it now. That said, that depends on how long it takes for the sync-world to complete [18:23:02] urbanecm: working now! thanks for your help this morning! [18:23:04] Thanks for backporting our changes urbanecm! [18:23:10] any time Niharika :) [18:23:33] (03PS1) 10Andrew Bogott: cloud instances: /qualify/path to systemd [puppet] - 10https://gerrit.wikimedia.org/r/708810 (https://phabricator.wikimedia.org/T287309) [18:25:05] 10Puppet, 10Infrastructure-Foundations, 10observability, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10jbond) i have done some digging on the on disk size vs the database size which i think shows how much data we could potentiall... [18:26:58] (03CR) 10Majavah: [C: 04-1] cloud instances: /qualify/path to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708810 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [18:28:01] (03PS2) 10Andrew Bogott: cloud instances: /qualify/path to systemd [puppet] - 10https://gerrit.wikimedia.org/r/708810 (https://phabricator.wikimedia.org/T287309) [18:30:51] (03CR) 10Andrew Bogott: [C: 03+2] cloud instances: /qualify/path to systemd [puppet] - 10https://gerrit.wikimedia.org/r/708810 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [18:34:11] (03PS2) 10Phuedx: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 [18:35:18] (03CR) 10jerkins-bot: [V: 04-1] test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (owner: 10Phuedx) [18:36:58] !log urbanecm@deploy1002 Finished scap: 796fe8e: 927763c: SecurePoll backports (T283728, T284585) (duration: 17m 06s) [18:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:06] T283728: Implement STV tallying in STVTallier::finishTally [XL] - https://phabricator.wikimedia.org/T283728 [18:37:07] T284585: Produce the Tally view output for an STV election [M] - https://phabricator.wikimedia.org/T284585 [18:37:12] (03PS3) 10Phuedx: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 [18:37:32] phuedx: should be live [18:38:26] (03CR) 10jerkins-bot: [V: 04-1] test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (owner: 10Phuedx) [18:38:55] (03PS4) 10Phuedx: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 [18:39:03] Time to fix my editor config... [18:39:32] urbanecm: Thanks :) [18:39:54] Np :) [18:40:00] (03CR) 10Mholloway: "Thanks. One last thing: If there's a Phabricator task for this work, both here and on the schema patch, could you please add a 'Bug: Tnnnn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (owner: 10Sharvaniharan) [18:40:09] (03PS1) 10Eigyan: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) [18:40:11] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [18:40:24] Do you know someone who be able to check over https://gerrit.wikimedia.org/r/708794 (adding groups to testwiki)? [18:41:20] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [18:41:42] I think that config change resolves https://phabricator.wikimedia.org/T277353 so should it have the Bug line? [18:41:52] 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Michael DiPietro - https://phabricator.wikimedia.org/T287697 (10nskaggs) +1 from me [18:42:17] (03PS5) 10Phuedx: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) [18:46:50] (03CR) 10Zabe: test: Add electcomm and electionadmin groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) (owner: 10Phuedx) [18:46:52] (03PS7) 10Sharvaniharan: Stream config for android_notification_interaction schema Bug: T287652 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) [18:47:34] (03CR) 10Sharvaniharan: "> Patch Set 6:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [18:51:57] (03CR) 10Mepps: "Yay for first commit! Two quick notes: 1. The failure is because of white space. It's easily fixed with running "composer fix" locally or " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [18:55:13] (03PS6) 10Phuedx: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) [18:55:44] (03CR) 10Phuedx: test: Add electcomm and electionadmin groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) (owner: 10Phuedx) [19:00:05] twentyafterfour and hashar: Time to snap out of that daydream and deploy MediaWiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T1900). [19:06:16] (03PS1) 10Andrew Bogott: cloud instance vendordata: exit 0 at the end of the custom boot script [puppet] - 10https://gerrit.wikimedia.org/r/708826 (https://phabricator.wikimedia.org/T287309) [19:07:34] (03CR) 10Andrew Bogott: [C: 03+2] cloud instance vendordata: exit 0 at the end of the custom boot script [puppet] - 10https://gerrit.wikimedia.org/r/708826 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [19:16:21] No blockers for the train [19:16:58] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Michael DiPietro - https://phabricator.wikimedia.org/T287697 (10mdipietro) I am not using a yubikey, though I could put in a request for one. I'll have a look at the git [19:17:54] (03PS1) 1020after4: all wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708827 [19:17:56] (03CR) 1020after4: [C: 03+2] all wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708827 (owner: 1020after4) [19:18:39] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708827 (owner: 1020after4) [19:19:50] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.16 refs T281157 [19:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:58] T281157: 1.37.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T281157 [19:23:02] ugh a bunch of errorss now from luasandboxcallback [19:23:16] https://phabricator.wikimedia.org/T287704 [19:24:51] Rolling back to wmf.15 [19:25:56] (03PS1) 1020after4: group2 wikis to 1.37.0-wmf.15 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708828 [19:25:58] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.37.0-wmf.15 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708828 (owner: 1020after4) [19:26:43] (03Merged) 10jenkins-bot: group2 wikis to 1.37.0-wmf.15 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708828 (owner: 1020after4) [19:27:51] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.37.0-wmf.15 refs T281157 [19:27:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 985 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:58] T281157: 1.37.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T281157 [19:28:04] Huh, I'd have expected that kind of error to show up at group1 first. [19:28:50] James_F: apparently the specific lua call is only in use on group2 wikis? [19:28:59] this is a good argument for getting more wikis into group1 [19:29:09] Surprising, but possible. [19:29:15] Commons uses Wikidata an awful lot though. [19:29:37] all the errors seem to be on svwiki and zh_yuewiki [19:31:30] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:37:08] Ah, maybe they're doing something odd? [19:37:15] Could we roll group2 except those two wikis forward? [19:38:17] (03CR) 10Eigyan: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [19:39:34] James_F: we could I suppose but I thought we had a sort of concensus to not have incomplete groups deployed [19:40:07] twentyafterfour: Yeah, I'm just worried about us getting a fix for this and rushing the train out only to find a second show-stopper six hours too late. [19:40:09] I could revert the offending change in scribuntu [19:40:28] It's already rather late for WMDEers to be around. :-( [19:40:28] scribunto [19:41:06] it looks like the patch only added type hints [19:41:52] But the stack on top of it did a bunch of other stuff, right? [19:42:14] does it cleanly revert? [19:42:32] If it does, that's great. [19:42:41] haven't tried yet [19:43:04] it does not [19:44:36] how could this NOT cause all kinds of fallout. Adding types where there were none to code that's used in wiki templates all over the place [19:45:24] * James_F sighs. [19:46:35] well in theory the Lua code in the extension would guard against this [19:46:38] I sometimes question the value of strict type checking in a language as loose as php [19:47:00] that's a nice theory ;) [19:47:14] Indeed. [19:48:46] I mean, it's totally sane to make the PHP part strictly typed, just that it also needs to be implemented on the Lua side, otherwise we get exceptions [19:49:16] Given how often adding strict typing to the PHP bits of MW have become train-blockers, I'm not sure I'd agree. [19:49:34] It's a good area to think about but be careful in. [19:49:59] 10Puppet, 10Infrastructure-Foundations: Gendered pronouns in README - https://phabricator.wikimedia.org/T287705 (10mdipietro) [19:53:07] (03PS1) 10Michael DiPietro: adding new hire mdipietro to ops [puppet] - 10https://gerrit.wikimedia.org/r/708830 [19:53:09] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/708830 (owner: 10Michael DiPietro) [19:57:51] (03PS1) 10Michael DiPietro: Gender neutral pronouns for README [puppet] - 10https://gerrit.wikimedia.org/r/708831 [19:58:45] thing is, the lua side is user-maintained, right? it's not part of the codebase it's part of the content [19:58:55] or am I wrong about that? [19:59:06] (03PS1) 10Eigyan: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708832 (https://phabricator.wikimedia.org/T287511) [20:00:19] twentyafterfour: You're right, yeah. It's possibly just a single script that's copied between svwiki and yuewiki. [20:00:51] twentyafterfour: Which "normally" would get fixed as it was edited if it caused a fatal like this, but here it's of course being applied by us not the editors. [20:03:25] twentyafterfour: no, it's in the repo [20:03:34] https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/2e0c364d1812e0348c32ea52c513afad1fd43dce/client/includes/DataAccess/Scribunto/mw.wikibase.entity.lua#101 [20:03:50] really? why only happening on two wikis hmmm [20:04:19] on-wiki templates call mw.wikibase.whatever -> function in Wikibase's Lua thing -> calls PHP implementation in Wikibase [20:05:29] all wikis use Wikidata differently and at different levels [20:05:40] probably other wikis aren't calling Lua+Wikibase in this way [20:06:14] (03CR) 10Majavah: [C: 04-1] "Looks good otherwise, however please see https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines#Phabricator about linking this pa" [puppet] - 10https://gerrit.wikimedia.org/r/708830 (owner: 10Michael DiPietro) [20:08:46] (03PS2) 10Michael DiPietro: Gender neutral pronouns for README [puppet] - 10https://gerrit.wikimedia.org/r/708831 (https://phabricator.wikimedia.org/T287705) [20:09:24] (03CR) 10jerkins-bot: [V: 04-1] Gender neutral pronouns for README [puppet] - 10https://gerrit.wikimedia.org/r/708831 (https://phabricator.wikimedia.org/T287705) (owner: 10Michael DiPietro) [20:10:54] mdipietro: hey, thanks for 708831! happy to stamp it, let me know if you need any help sorting out that jenkins complaint [20:12:34] rzl: I was going to suggest they do it themselves once we merge the access request patch :) [20:12:54] legoktm: oh definitely, I wouldn't dream of *merging* it, was just gonna offer a quick +1 :) [20:13:03] :D [20:14:29] (03CR) 10RLazarus: [C: 03+1] "Apart from Jenkins's nitpick about the commit message, looks good, thanks! Ping me on IRC (rzl) if you'd like any help with tidying that u" [puppet] - 10https://gerrit.wikimedia.org/r/708831 (https://phabricator.wikimedia.org/T287705) (owner: 10Michael DiPietro) [20:15:29] (03PS2) 10Michael DiPietro: adding new hire mdipietro to ops [puppet] - 10https://gerrit.wikimedia.org/r/708830 (https://phabricator.wikimedia.org/T287697) [20:16:19] (03PS3) 10Michael DiPietro: Gender neutral pronouns for README [puppet] - 10https://gerrit.wikimedia.org/r/708831 (https://phabricator.wikimedia.org/T287705) [20:17:46] (03CR) 10Michael DiPietro: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/708830 (https://phabricator.wikimedia.org/T287697) (owner: 10Michael DiPietro) [20:19:02] (03CR) 10Majavah: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708830 (https://phabricator.wikimedia.org/T287697) (owner: 10Michael DiPietro) [20:19:49] legoktm: are you on it? I can make a patch to fix that particular error [20:20:21] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Voice & Tone: Gendered pronouns in README - https://phabricator.wikimedia.org/T287705 (10Mahir256) [20:20:38] rzl: Jenkins was just upset about the new line between Bug: and Change-Id right? Looks happy now [20:20:44] yep 👍 [20:20:50] awesome [20:21:15] Amir1: I'm not, please go ahead [20:21:36] I was just looking to see if it was a clean revert and it wasn't [20:24:27] twentyafterfour: did we have to revert? [20:25:26] the train is rolled back, yes [20:25:40] (to group1, not completely) [20:25:43] It's a bit hard to find someone to review the patch at this time in wmde so if anyone feels like reviewing it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/708834 [20:27:19] Amir1: how/when did it break? [20:27:49] I'd be fine reviewing it if it's bringing it back to how it was before a recent change. [20:27:54] Hard to assess otherwise [20:28:04] Krinkle: this broke it https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/704998 [20:28:31] basically adding strict typehinting where the doc was wrong [20:28:38] Amir1: yes [20:29:24] my only worry is that this might not be enough and there might be some other mistakes too [20:29:26] Krinkle: yeah this is new breakage in due to new change in wmf.16 I believe [20:29:36] yeah there could be other stuff for sure [20:29:53] But, that was the only error that showed up with high frequency. [20:30:20] sure, then let's get this reviewed and backported and roll the train forward [20:30:29] I'm around for more fixes, they should be trivial [20:31:12] 👍 [20:31:43] Amir1: ack, updated msg, LGT? [20:31:47] Y? [20:32:22] Krinkle: Ja [20:32:37] it won't cherry pick cleanly to wmf.16 seems [20:33:09] hmmm [20:33:15] :((((( [20:33:20] I check [20:33:55] (03PS1) 10Ladsgroup: Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary [extensions/Wikibase] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708644 (https://phabricator.wikimedia.org/T287704) [20:34:15] looks clean ^ [20:34:44] (03CR) 10Ladsgroup: [C: 03+2] Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary [extensions/Wikibase] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708644 (https://phabricator.wikimedia.org/T287704) (owner: 10Ladsgroup) [20:36:52] (03CR) 1020after4: [C: 03+1] Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary [extensions/Wikibase] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708644 (https://phabricator.wikimedia.org/T287704) (owner: 10Ladsgroup) [20:37:03] thanks Amir1! [20:37:23] I get it deployed and then let's roll [20:37:39] Awesome [20:41:51] (03CR) 10Michael DiPietro: [C: 03+2] Gender neutral pronouns for README [puppet] - 10https://gerrit.wikimedia.org/r/708831 (https://phabricator.wikimedia.org/T287705) (owner: 10Michael DiPietro) [20:42:49] stepping away for a few minutes while the tests are running. I'll be back to roll the train in a few minutes [20:43:11] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Voice & Tone: Gendered pronouns in README - https://phabricator.wikimedia.org/T287705 (10mdipietro) 05Open→03Resolved [20:44:27] (03CR) 10Michael DiPietro: [C: 03+2] adding new hire mdipietro to ops [puppet] - 10https://gerrit.wikimedia.org/r/708830 (https://phabricator.wikimedia.org/T287697) (owner: 10Michael DiPietro) [20:50:53] Welcome mdipietro [20:51:01] (03CR) 10Addshore: [C: 03+1] Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary [extensions/Wikibase] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708644 (https://phabricator.wikimedia.org/T287704) (owner: 10Ladsgroup) [20:51:09] thank you RhinosF1 [20:51:31] :) [20:57:25] (03Merged) 10jenkins-bot: Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary [extensions/Wikibase] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708644 (https://phabricator.wikimedia.org/T287704) (owner: 10Ladsgroup) [20:59:37] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/Wikibase/client: Backport: [[gerrit:708644|Let language parameter accept null in Scribunto_LuaWikibaseEntityLibrary (T287704)]] (duration: 01m 09s) [20:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:46] T287704: TypeError: Argument 2 passed to Wikibase\Client\DataAccess\Scribunto\Scribunto_LuaWikibaseEntityLibrary::addLabelUsage() must be of the type string, null given, called in /srv/mediawiki/php-1.37.0-wmf.16/extensions/Scribunto/includes/engines/LuaSandbox/LuaSandboxCallback.php on line 26 - https://phabricator.wikimedia.org/T287704 [21:00:09] twentyafterfour: done ^ [21:00:27] let's roll the train and let me know if more errors show up. I stay around for a bit [21:07:09] Amir1: thanks, rolling [21:08:48] choo choo [21:09:41] (03PS1) 1020after4: group2 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708839 [21:09:43] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708839 (owner: 1020after4) [21:10:27] (03Merged) 10jenkins-bot: group2 wikis to 1.37.0-wmf.16 refs T281157 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708839 (owner: 1020after4) [21:11:48] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.37.0-wmf.16 refs T281157 [21:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:56] T281157: 1.37.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T281157 [21:12:20] yay! no exception explosion this time. Amir1, it may be too soon to call it but looks good [21:12:29] wohoo [21:13:05] I clean the house, if anything is problematic, ping me. [21:13:35] sure thing, thanks! [21:18:39] Hello I have a silly question. I am trying to access deployment servers after a while and ssh deployment.codfw.wmnet is giving me an error: `ssh: Could not resolve hostname bast1002.wikimedia.org: nodename nor servname provided, or not known` which is weird because I updated my SSH fingerprints to point to bast1003 instead of bast1002 so I am not sure where that's coming from. [21:18:45] Pointers please? [21:19:25] somewhere in your .ssh/config it is still referencing bast1002 instead of bast1003 [21:20:24] also note that the hostname is *deployment.eqiad.wmnet*, despite the eqiad in the name, it'll correctly point you to the correct deployment server, regardless of datacenter [21:20:32] which is still deploy1002 btw [21:21:50] Oh, thanks legoktm. Lemme see... [21:23:26] legoktm: That worked! I only changed bast1002->bast1003 in `known_hosts` and forgot all about the config file. Thanks! [21:26:12] :D [21:26:22] note that bast1003 should have a different fingerprint in your known_hosts than bast1002 [21:27:26] Right. I copied the latest fingerprint from https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1003.wikimedia.org [21:34:24] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Dear Juniper Networks Customer, Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper... [21:36:30] calling it a day [21:38:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) @Joe Please try docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2021-07-29-2046... [22:01:49] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Michael DiPietro - https://phabricator.wikimedia.org/T287697 (10Legoktm) 05Open→03Resolved [22:09:50] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) a:05RobH→03Papaul [22:09:54] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) @jijiki can you try to get the system back again in service? The system has 2 CPU's so it will be difficult to tell which CPU is bad if we do not have any logs telling which CPU is having issues. I will like to ha... [22:30:57] (03PS1) 10Andrew Bogott: cloud-vps cloud-init: don't mask the puppet service [puppet] - 10https://gerrit.wikimedia.org/r/708868 (https://phabricator.wikimedia.org/T287309) [22:31:44] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps cloud-init: don't mask the puppet service [puppet] - 10https://gerrit.wikimedia.org/r/708868 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [22:51:13] (03PS1) 10Andrew Bogott: cloud-vps cloud-init: more tweaks to try to get a perfectly clean run [puppet] - 10https://gerrit.wikimedia.org/r/708869 (https://phabricator.wikimedia.org/T287309) [22:52:16] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps cloud-init: more tweaks to try to get a perfectly clean run [puppet] - 10https://gerrit.wikimedia.org/r/708869 (https://phabricator.wikimedia.org/T287309) (owner: 10Andrew Bogott) [22:58:20] PROBLEM - mediawiki-installation DSH group on mw2383 is CRITICAL: Host mw2383 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:00:05] brennen: How many deployers does it take to do US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210729T2300). [23:00:05] Tran: A patch you scheduled for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:43] hello! 👋 [23:02:45] * thcipriani waves [23:02:52] o/ [23:03:41] * Niharika waves [23:04:25] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [23:04:53] thcipriani: Are you deploying today? [23:05:12] Niharika: xSavitar is our deployer today [23:05:13] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jclark-ctr) @Dzahn Thanks! putting the rest in A will speed up racking currently i was waiting on rack C [23:05:32] Ooh a new deployer?! Hello xSavitar! [23:07:03] (03CR) 10D3r1ck01: [C: 03+2] test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) (owner: 10Phuedx) [23:07:15] Niharika: o/ [23:07:51] (03Merged) 10jenkins-bot: test: Add electcomm and electionadmin groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708794 (https://phabricator.wikimedia.org/T277353) (owner: 10Phuedx) [23:10:32] (03CR) 10Jdlrobson: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708369 (https://phabricator.wikimedia.org/T281359) (owner: 10Jdlrobson) [23:11:00] Niharika, Tran: patch is now on mwdebug2002.codfw.wmnet [23:11:18] xSavitar: Thanks. Please give us a moment to test. [23:11:26] Niharika: perfect! [23:11:44] Testing now thank you! [23:11:56] Niharika could you confirm there are electcomm/admin groups now and add me to electcomm? [23:12:06] Tran: Yep - on it. [23:13:38] Tran: Gonna send you a screenshot on slac [23:18:41] xSavitar: Tran and I tested the patch and looks like our config overrides another config so we need to undo this. [23:18:50] xSavitar: Tran is putting up another patch right now. [23:18:59] Thanks and sorry for the bother! [23:19:07] Niharika: Okay, waiting... [23:23:58] (03PS1) 10STran: Merge new configs with existing testwiki definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708870 [23:24:54] 👆 Could someone look that over? I think that should fix the problem [23:25:06] * thcipriani reads [23:25:43] * xSavitar reads [23:26:59] (03CR) 10Thcipriani: [C: 03+2] Merge new configs with existing testwiki definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708870 (owner: 10STran) [23:27:48] (03Merged) 10jenkins-bot: Merge new configs with existing testwiki definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708870 (owner: 10STran) [23:31:26] Niharika, Tran: It's now on mwdebug2002 [23:31:47] testing now [23:32:05] awesome :) [23:38:59] xSavitar: Looks like it's good to deploy now [23:39:34] cool, going live now :) [23:40:19] syncing now... [23:41:04] !log derick@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708870|Merge new configs with existing testwiki definition]] (duration: 00m 57s) [23:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:29] Niharika, Tran, it's live now :) [23:41:36] thank you so much! 🙇‍♂️ [23:41:39] Woohoo! [23:41:44] You're welcome :) [23:41:45] Thanks folks. :D [23:50:49] I have a super late backport [23:51:36] (03PS1) 10Gergő Tisza: Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708848 (https://phabricator.wikimedia.org/T287636) [23:53:51] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.04191 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:57:41] (03CR) 10Gergő Tisza: [C: 03+2] Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708848 (https://phabricator.wikimedia.org/T287636) (owner: 10Gergő Tisza) [23:58:44] puppet issue looks like another case of https://phabricator.wikimedia.org/T263578 [23:59:02] cwhite: