[00:00:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:30] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable ArticlePlaceholder for kswiki (T294632) (duration: 00m 55s) [00:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:33] T294632: Enable article placeholder on ksWiki - https://phabricator.wikimedia.org/T294632 [00:00:42] (03Merged) 10jenkins-bot: Add event stream config for discussiontools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731854 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [00:02:17] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Add event stream config for discussiontools (T286076) (duration: 00m 55s) [00:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:20] T286076: Implement topic subscription instrumentation - https://phabricator.wikimedia.org/T286076 [00:02:26] Kemayo: ^^ [00:03:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:22] legoktm: thanks! [00:04:31] yw [00:07:26] !log scandium - installing package upgrades, incl. apache, php7.2- packages [00:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:21] Thanks Legoktm for deploying [00:15:37] :) [00:17:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:50] 10SRE, 10serviceops: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Legoktm) [00:24:01] !log parsoid-canary (scandium, wtp1025, wtp1026, parse2001, parse2002) - upgrading php-fpm and php-* packages [00:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:59] 10SRE, 10serviceops: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) a:03Dzahn [00:31:47] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:33:39] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:51] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [00:36:53] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [00:37:53] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:37:57] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 242.13 ms [00:45:00] !log upgraded php-fpm on cloudweb2001-dev - https://labtestwikitech.wikimedia.org/wiki/Main_Page [00:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:10] (03PS2) 10Gergő Tisza: Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) [01:07:28] legoktm: I'll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/735094 tomorrow if there are no objections. [01:38:15] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 218 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:40:17] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 31 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T0200) [02:01:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [02:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [02:06:26] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) I create Order Number - 1-213500699180 to ask Equinix to check the PDU and the power to that PDU and let us know if those PDU's belong to us or not. [02:06:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.7 [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736102 [02:06:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.7 [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736102 (owner: 10TrainBranchBot) [02:07:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:55] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.7 [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736102 (owner: 10TrainBranchBot) [02:30:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [03:40:21] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10aaron) 05Open→03Declined [03:40:26] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10aaron) [03:48:13] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) Below is the reply from Equinix ` Site engineers have determined both power supplies are online from the original source and the PDUs belong to Wikimedia. ` [04:02:26] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Krinkle) [04:03:23] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 197 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:05:23] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:07:13] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:01:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [06:05:18] (03CR) 10Marostegui: "Thanks for working on this Daniel - much appreciated. However, I am not sure if we want to page for a Sanitarium master, it is not very cr" [puppet] - 10https://gerrit.wikimedia.org/r/735689 (https://phabricator.wikimedia.org/T233684) (owner: 10Dzahn) [06:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [06:25:27] alerts.wikimedia.org is *stunning*! How have I not seen this before!? [06:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [06:26:25] (03PS1) 10Marostegui: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) [06:27:01] (03PS1) 10Marostegui: wmnet: Update s1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) [06:27:09] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [06:27:43] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [06:36:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [06:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [06:42:03] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Joe) If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous reque... [06:45:01] !log Rename oauth2_access_tokens oauth_accepted_consumer oauth_registered_consumer tables on db1123 T294595 [06:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:04] T294595: Drop OAuth-related tables from foundationwiki - https://phabricator.wikimedia.org/T294595 [06:46:31] (03CR) 10Marostegui: [C: 03+1] "+1, this needs views recreation on clouddb* hosts." [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481) (owner: 10Zabe) [06:51:52] (03PS1) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) [06:51:54] (03PS1) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) [06:51:56] (03PS1) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [06:51:58] (03PS1) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [06:54:12] (03CR) 10Ryan Kemper: [C: 03+2] relforge: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735973 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [06:54:24] (03CR) 10Ryan Kemper: [C: 03+1] relforge: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735973 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [06:59:03] (03PS2) 10Ryan Kemper: elasticsearch: cleanup absented cron resources [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) [06:59:17] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [07:00:04] (03Abandoned) 10Ryan Kemper: elasticsearch: remove from systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/497503 (https://phabricator.wikimedia.org/T218315) (owner: 10Mathew.onipe) [07:02:37] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: cleanup absented cron resources [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [07:05:05] (03PS1) 10Marostegui: switchover-tmpl.sh: Add Amir to the calendar template [software] - 10https://gerrit.wikimedia.org/r/736122 [07:13:23] !log `apt-get purge dkms` (rc state) on stat100[5,8] [07:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17650 and previous config saved to /var/cache/conftool/dbconfig/20211102-072320-marostegui.json [07:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:24] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:31:30] (03PS2) 10Elukey: role::ml_k8s::master: add node-role.kubernetes.io/master labels [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) [07:32:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32037/console" [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [07:34:31] (03PS3) 10Elukey: role::ml_k8s::master: add node-role.kubernetes.io/master labels [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) [07:36:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:46:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:03:00] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove unused `global` statement (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735769 (owner: 10Awight) [08:05:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch Brooke to volunteer NDA status [puppet] - 10https://gerrit.wikimedia.org/r/736030 (owner: 10Muehlenhoff) [08:06:04] (03CR) 10ArielGlenn: "I am sorry to do this, but can we avoid using the word "job" here? The term "dump job" has a specific meaning for the SQL/XML dumps, and t" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [08:09:13] (03PS1) 10Muehlenhoff: Update email address with new @wikimedia.org addresss [puppet] - 10https://gerrit.wikimedia.org/r/736179 [08:14:32] (03CR) 10Gehel: [C: 03+1] "I'm wondering if there isn't a way to get rid of the inline_template..." [puppet] - 10https://gerrit.wikimedia.org/r/734988 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [08:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:20:59] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [08:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:21:24] (03CR) 10Gehel: [C: 04-1] "We're missing the row/rack definition in hieradata/regex.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [08:23:11] (03CR) 10Gehel: "LGTM, minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [08:25:00] (03CR) 10Gehel: "LGTM, minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [08:26:13] (03CR) 10Gehel: [C: 04-1] elasticsearch: hiera for new eqiad nodes (step 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [08:29:16] !log installing sdl2 security updates [08:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:31:03] (03PS6) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [08:37:04] (03PS1) 10Muehlenhoff: Remove LDAP access for Toby Negrin [puppet] - 10https://gerrit.wikimedia.org/r/736182 [08:38:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Toby Negrin [puppet] - 10https://gerrit.wikimedia.org/r/736182 (owner: 10Muehlenhoff) [08:40:04] (03PS1) 10David Caro: Added ceph auth dummy keydata [labs/private] - 10https://gerrit.wikimedia.org/r/736183 [08:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:44:22] (03CR) 10David Caro: [V: 03+2 C: 03+2] Added ceph auth dummy keydata [labs/private] - 10https://gerrit.wikimedia.org/r/736183 (owner: 10David Caro) [08:45:16] (03PS7) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [08:46:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:48:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think the MW_DEBUG_LOCAL variable should only be added to the mediawiki-specific child image (the -multiversion ones); otherwise LGTM." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [08:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:54:24] (03CR) 10Daniel Kinzler: [C: 03+1] Remove hook set for incident reponse in 2020 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736032 (owner: 10Ppchelko) [08:54:26] (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::master: add node-role.kubernetes.io/master labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [08:54:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [08:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:56:14] (03PS8) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [08:59:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [08:59:56] (03PS9) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:03:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17651 and previous config saved to /var/cache/conftool/dbconfig/20211102-090306-marostegui.json [09:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:10] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [09:06:23] (03PS1) 10David Caro: Moved ceph auth config under profile [labs/private] - 10https://gerrit.wikimedia.org/r/736191 [09:06:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "Thanks for doing this <3" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732739 (owner: 10Cwhite) [09:08:08] (03PS2) 10David Caro: Moved ceph auth config under profile [labs/private] - 10https://gerrit.wikimedia.org/r/736191 [09:08:32] (03CR) 10Giuseppe Lavagetto: "nice catch, but I'd rather remove the 4.5 value which was an experiment we never properly cleaned up from. I'll amend the patch." [puppet] - 10https://gerrit.wikimedia.org/r/732829 (owner: 10Cwhite) [09:08:48] (03CR) 10David Caro: [V: 03+2 C: 03+2] Moved ceph auth config under profile [labs/private] - 10https://gerrit.wikimedia.org/r/736191 (owner: 10David Caro) [09:09:24] (03CR) 10Ladsgroup: [C: 03+1] "I don't have my production access yet to puppet-merge it in puppetmaster, otherwise I would have done it already. Feel free to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/736179 (owner: 10Muehlenhoff) [09:10:54] (03CR) 10Volans: [C: 03+2] ipmi: allow to hide parts of the command [software/spicerack] - 10https://gerrit.wikimedia.org/r/735421 (owner: 10Volans) [09:11:04] (03CR) 10Ladsgroup: [C: 03+1] "Can I just merge this? Would it need some packaging later?" [software] - 10https://gerrit.wikimedia.org/r/736122 (owner: 10Marostegui) [09:11:29] (03CR) 10Marostegui: "you can simply merge it, yes. Nothing else is required" [software] - 10https://gerrit.wikimedia.org/r/736122 (owner: 10Marostegui) [09:12:47] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: add node-role.kubernetes.io/master labels [puppet] - 10https://gerrit.wikimedia.org/r/735577 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [09:13:31] dcaro: o/ merged your changes for labs private [09:13:58] (03CR) 10Ladsgroup: [C: 03+2] "fancy" [software] - 10https://gerrit.wikimedia.org/r/736122 (owner: 10Marostegui) [09:14:40] (03PS2) 10Giuseppe Lavagetto: hiera: remove duplicate fpm_workers_multipier key [puppet] - 10https://gerrit.wikimedia.org/r/732829 (owner: 10Cwhite) [09:17:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] hiera: remove duplicate fpm_workers_multipier key [puppet] - 10https://gerrit.wikimedia.org/r/732829 (owner: 10Cwhite) [09:17:11] (03Merged) 10jenkins-bot: ipmi: allow to hide parts of the command [software/spicerack] - 10https://gerrit.wikimedia.org/r/735421 (owner: 10Volans) [09:17:14] (03Merged) 10jenkins-bot: switchover-tmpl.sh: Add Amir to the calendar template [software] - 10https://gerrit.wikimedia.org/r/736122 (owner: 10Marostegui) [09:18:43] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:45] elukey: thanks! [09:20:44] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32042/console" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:23:03] PROBLEM - Check systemd state on ml-serve-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:33] (03CR) 10Giuseppe Lavagetto: "LGTM, minus the missing use of the image_tag helper. Given I'm building other images today, I'll add that and merge the change." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [09:24:10] (03PS5) 10Giuseppe Lavagetto: First rev of WMF docker-gc image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [09:24:29] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] First rev of WMF docker-gc image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732722 (https://phabricator.wikimedia.org/T294034) (owner: 10Ahmon Dancy) [09:24:59] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme_chief: Page on acme-chief unit failure [puppet] - 10https://gerrit.wikimedia.org/r/735297 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:26:32] (03PS10) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:28:44] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32044/console" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:30:43] PROBLEM - Check systemd state on ml-serve-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:56] (03CR) 10Btullis: [C: 03+1] Remove unused bigtop hive and oozie database creation code [puppet] - 10https://gerrit.wikimedia.org/r/736034 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [09:31:07] PROBLEM - Check systemd state on ml-serve-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:12] (03CR) 10David Caro: [V: 03+1] "This looks way better 😊" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:31:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:32:01] Eh? [09:33:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:34:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32045/console" [puppet] - 10https://gerrit.wikimedia.org/r/736034 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [09:34:49] jouncebot: nowandnext [09:34:50] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [09:34:50] In 1 hour(s) and 25 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1100) [09:34:57] (03PS2) 10Urbanecm: QuickSurveys: Show Growth IP editors survey to 0.1% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736043 (https://phabricator.wikimedia.org/T294568) [09:35:22] (03CR) 10Urbanecm: [C: 03+2] QuickSurveys: Show Growth IP editors survey to 0.1% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736043 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [09:35:35] (03CR) 10Elukey: [V: 03+1 C: 03+1] Remove unused bigtop hive and oozie database creation code [puppet] - 10https://gerrit.wikimedia.org/r/736034 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [09:36:13] (03Merged) 10jenkins-bot: QuickSurveys: Show Growth IP editors survey to 0.1% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736043 (https://phabricator.wikimedia.org/T294568) (owner: 10Urbanecm) [09:38:33] (03PS11) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:39:15] looking at lists1001 now [09:39:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b2594347041ae61ef88661bc0d5aa57fc501540d: QuickSurveys: Show Growth IP editors survey to 0.1% of users (T294568) (duration: 00m 57s) [09:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:41] T294568: deploy quicksurvey for editors on eswiki and arwiki (for Growth IP editors research) - https://phabricator.wikimedia.org/T294568 [09:39:47] * urbanecm done [09:40:17] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:40:25] !log restarted apache2 on lists1001 [09:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:39] (03CR) 10jerkins-bot: [V: 04-1] ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:41:31] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:41:33] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2021-12-27 09:00:28 +0000 (expires in 54 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:41:57] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:47] (03PS12) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:46:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:55:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:00] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:56:55] legoktm: lists server doesn't like you it seems :/ [10:01:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [10:02:00] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1118 with weight 0 T293964', diff saved to https://phabricator.wikimedia.org/P17652 and previous config saved to /var/cache/conftool/dbconfig/20211102-100348-root.json [10:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:52] T293964: Switchover s1 from db1163 to db1118 - https://phabricator.wikimedia.org/T293964 [10:06:33] <_joe_> ok let me take a look at what's wrong with mailman rn [10:06:47] <_joe_> please no one restart apache [10:07:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:07:51] <_joe_> ok the server seems to be freezing [10:16:36] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:24] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2021-12-27 09:00:28 +0000 (expires in 54 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:17:24] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:21] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/734988 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:27:25] (03PS1) 10Urbanecm: dewiki: Set wgGEHomepageDefaultVariant to control [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736198 (https://phabricator.wikimedia.org/T294712) [10:27:27] jouncebot: nowandnext [10:27:28] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [10:27:28] In 0 hour(s) and 32 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1100) [10:27:40] (03CR) 10Urbanecm: [C: 03+2] dewiki: Set wgGEHomepageDefaultVariant to control [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736198 (https://phabricator.wikimedia.org/T294712) (owner: 10Urbanecm) [10:28:35] (03Merged) 10jenkins-bot: dewiki: Set wgGEHomepageDefaultVariant to control [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736198 (https://phabricator.wikimedia.org/T294712) (owner: 10Urbanecm) [10:29:03] (03CR) 10Hnowlan: R:cassandra::instance::monitoring: make sure cassandra is loaded (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735012 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [10:30:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: dbff998f40e438556345185408e495f429440a1b: dewiki: Set wgGEHomepageDefaultVariant to control (T294712) (duration: 00m 55s) [10:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:14] T294712: Completely disable the linkrecommendation task type in the Growth module in the German Wikipedia - https://phabricator.wikimedia.org/T294712 [10:31:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:30] (03PS2) 10Muehlenhoff: Update email address with new @wikimedia.org addresss [puppet] - 10https://gerrit.wikimedia.org/r/736179 [10:35:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:24] (03PS13) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:36:26] (03PS1) 10David Caro: ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 [10:36:40] (03CR) 10Jbond: [C: 04-1] "-1 is just for the location of the custom types the rest are optional nits or clarifying questions" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:36:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [10:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:29] (03CR) 10jerkins-bot: [V: 04-1] ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 (owner: 10David Caro) [10:37:34] heads up, I'm about to reboot cloud network components, some network flapping is to be expected, specially on IRC bots [10:38:59] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32047/console" [puppet] - 10https://gerrit.wikimedia.org/r/736201 (owner: 10David Caro) [10:40:23] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet [10:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:25] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudgw2001-dev.codfw.wmnet [10:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:11] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet [10:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:47] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:45:13] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:46:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet [10:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735972 (owner: 10Muehlenhoff) [10:46:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [10:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] p:ceph::osd: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/735660 (owner: 10David Caro) [10:47:24] (03CR) 10Jbond: [C: 03+1] base_packages: install netcat-openbsd by default [puppet] - 10https://gerrit.wikimedia.org/r/735413 (owner: 10Herron) [10:48:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet [10:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:25] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [10:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:53] (03PS3) 10Jbond: Add another yubikey for my SSH access [puppet] - 10https://gerrit.wikimedia.org/r/736068 (owner: 10Aaron Schulz) [10:50:12] (03CR) 10Jbond: [C: 03+2] "lgtm (matches key on bast4003)" [puppet] - 10https://gerrit.wikimedia.org/r/736068 (owner: 10Aaron Schulz) [10:53:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet [10:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1004.eqiad.wmnet [10:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:09] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1100). [11:00:05] No Gerrit patches in the queue for this window AFAICS. [11:00:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1004.eqiad.wmnet [11:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:28] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:14:52] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:19:59] !log jbond@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=puppetboard [13:21:30] PROBLEM - Host cloudgw1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:06] RECOVERY - Host cloudgw1002 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [13:24:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti-test2001.codfw.wmnet [13:26:57] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10jbond) I have now built puppetboard[12]002 with bullseye pypuppetdb 2.4 and puppetboard to 3.1. i will leave the old systems around for ~1week before starting the decommissioning process [13:30:49] (03CR) 10jerkins-bot: [V: 04-1] wmnet: Update s1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [13:32:06] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4035.ulsfo.wmnet with OS buster [13:32:13] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster [13:34:48] (03CR) 10Volans: [C: 03+1] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/736210 (owner: 10Vgutierrez) [13:39:21] (03CR) 10Marostegui: [C: 04-2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736114 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [13:39:35] (03CR) 10Marostegui: [C: 04-2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/736115 (https://phabricator.wikimedia.org/T293964) (owner: 10Marostegui) [13:39:51] (03CR) 10Jgiannelos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736214 (owner: 10Jgiannelos) [13:45:21] !log pool cp4033.ulsfo.wmnet - T290694 [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:28] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [13:46:23] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:48:53] (03PS4) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [13:49:35] (03PS2) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitely set [puppet] - 10https://gerrit.wikimedia.org/r/735410 [13:50:47] (03CR) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitely set (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [13:51:29] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736059 (owner: 10Elukey) [13:53:02] (03CR) 10Majavah: Bird: peer with router IP (gateway) if nothing explicitely set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [13:53:48] (03PS3) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitely set [puppet] - 10https://gerrit.wikimedia.org/r/735410 [13:54:03] (03CR) 10Ayounsi: Bird: peer with router IP (gateway) if nothing explicitely set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [13:54:20] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:56:55] (03CR) 10Elukey: [C: 03+2] Revert "role::ml_k8s::master: add node-role.kubernetes.io/master labels" [puppet] - 10https://gerrit.wikimedia.org/r/736059 (owner: 10Elukey) [13:57:33] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [14:02:46] (03CR) 10JMeybohm: [C: 04-1] "Apart from the nit about the commit message and comment, this change LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:03:01] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 92.71% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:04:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [14:04:52] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [14:05:40] !log hashar@deploy1002 Started deploy [integration/docroot@4e4d14a]: Add landing page for code metrics [14:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:49] !log hashar@deploy1002 Finished deploy [integration/docroot@4e4d14a]: Add landing page for code metrics (duration: 00m 09s) [14:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [14:06:54] (03CR) 10Elukey: "Hugh the following diff is weird:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [14:08:29] (03CR) 10Ssingh: [C: 03+1] "We can do IPv6 support in a later commit :)" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [14:13:28] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/736214 (owner: 10Jgiannelos) [14:13:59] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! Merging!" [puppet] - 10https://gerrit.wikimedia.org/r/735945 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [14:14:30] (03PS4) 10Alexandros Kosiaris: maps: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735945 (https://phabricator.wikimedia.org/T275752) [14:14:33] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] maps: Disable http2 in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/735945 (https://phabricator.wikimedia.org/T275752) (owner: 10Alexandros Kosiaris) [14:15:12] (03PS1) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) [14:15:20] !log debdeploying wikidiff2-1.13.0-1 to A:mw-app-canary and A:mw-api-canary for T285857 [14:15:21] 10SRE, 10Release-Engineering-Team: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10hashar) >>! In T292753#7410513, @Dzahn wrote: > @hashar That being said, this ticket should not have been needed because we recently already did a more global solution: > > T289746 s... [14:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:26] T285857: Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 [14:16:58] (03CR) 10jerkins-bot: [V: 04-1] P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [14:18:51] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/736214 (owner: 10Jgiannelos) [14:18:57] RECOVERY - Check systemd state on ml-serve-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:12] (03PS2) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) [14:19:59] !log roll-restart restart-php7.2-fpm on A:mw-app-canary and A:mw-api-canary [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:22] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4035.ulsfo.wmnet with OS buster [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed: - cp4035 (**WARN**... [14:25:44] RECOVERY - Check systemd state on ml-serve-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:11] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [14:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [14:34:37] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppet 7: create puppet 7 environment in WMCS to test code - https://phabricator.wikimedia.org/T294841 (10jbond) 05Open→03In progress p:05Triage→03Medium [14:34:44] !log pool cp4035.ulsfo.wmnet - T290694 [14:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [14:36:08] (03CR) 10JMeybohm: [C: 03+1] "Code changes and diff look good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/736227 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [14:37:20] (03PS1) 10Volans: sre.hosts.reimage: adapt confctl message [cookbooks] - 10https://gerrit.wikimedia.org/r/736239 [14:41:56] 10SRE, 10Community-Tech, 10serviceops, 10wikidiff2, 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) I've deployed wikidiff2-1.13.0-1 to the canaries and will deploy to the rest of production tomorrow. For refer... [14:51:11] RECOVERY - Check systemd state on ml-serve-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:09] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10JoKalliauer) [15:07:08] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:06] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:21] (03CR) 10Hnowlan: "LGTM, CI hiccup notwithstanding. I am happy to merge this and supervise it to make sure it doesn't break anything" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [15:12:25] (03CR) 10Hnowlan: [C: 03+1] api-gateway: improve configuration naming for discovery_endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [15:12:27] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:13] (03PS1) 10Urbanecm: LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736065 (https://phabricator.wikimedia.org/T205349) [15:14:35] (03PS1) 10Urbanecm: LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736246 (https://phabricator.wikimedia.org/T205349) [15:22:45] (03CR) 10Elukey: "Ready to merge for me if you have time! I can also try with your supervision if you want :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [15:27:54] (03Abandoned) 10Ppchelko: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734437 (owner: 10PipelineBot) [15:28:06] (03Abandoned) 10Ppchelko: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734436 (owner: 10PipelineBot) [15:28:10] (03Abandoned) 10Ppchelko: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734439 (owner: 10PipelineBot) [15:28:13] (03Abandoned) 10Ppchelko: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734440 (owner: 10PipelineBot) [15:28:17] (03Abandoned) 10Ppchelko: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734441 (owner: 10PipelineBot) [15:29:11] (03Abandoned) 10Ppchelko: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734435 (owner: 10PipelineBot) [15:32:01] (03CR) 10David Caro: [C: 03+2] p:ceph::osd: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/735660 (owner: 10David Caro) [15:32:07] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4034.ulsfo.wmnet with OS buster [15:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:14] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed: - cp4034 (**WARN**... [15:35:09] (03PS2) 10Jelto: services: add support to deploy all services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) [15:38:48] (03PS4) 10Ssingh: Bird: peer with router IP (gateway) if nothing explicitely set [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [15:38:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4036.ulsfo.wmnet with OS buster [15:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:00] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [15:39:34] (03CR) 10Jelto: services: add support to deploy all services with helm3 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/735979 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [15:39:36] (03PS1) 10David Caro: (dcaro) Use a hash for ceph keydata instead of specific key [labs/private] - 10https://gerrit.wikimedia.org/r/736269 [15:41:31] (03CR) 10Ssingh: "The most recent commit adds IPv6 support (I thought let's do it right now instead of a separate commit). It should work because in bird, w" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [15:41:47] !log pool cp4034.ulsfo.wmnet - T290694 [15:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:50] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [15:42:07] (03CR) 10David Caro: [V: 03+2 C: 03+2] (dcaro) Use a hash for ceph keydata instead of specific key [labs/private] - 10https://gerrit.wikimedia.org/r/736269 (owner: 10David Caro) [15:46:22] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/32054/doh1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [15:53:40] (03CR) 10Vgutierrez: [C: 03+1] "Thanks Jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/734973 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [15:55:00] (03PS14) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:55:02] (03CR) 10David Caro: ceph: introduce auth load abstraction (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:55:04] (03PS3) 10David Caro: ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 [15:55:54] (03CR) 10jerkins-bot: [V: 04-1] ceph::auth: add deploy profile and classes [puppet] - 10https://gerrit.wikimedia.org/r/736201 (owner: 10David Caro) [15:58:51] (03PS8) 10Vgutierrez: cache: Provide a HAproxy upload role [puppet] - 10https://gerrit.wikimedia.org/r/734209 (https://phabricator.wikimedia.org/T290005) [15:58:53] (03PS8) 10Vgutierrez: cache: Expose prometheus metrics for HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/734223 (https://phabricator.wikimedia.org/T290005) [16:00:04] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/736239 (owner: 10Volans) [16:01:10] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: adapt confctl message [cookbooks] - 10https://gerrit.wikimedia.org/r/736239 (owner: 10Volans) [16:04:04] (03Merged) 10jenkins-bot: sre.hosts.reimage: adapt confctl message [cookbooks] - 10https://gerrit.wikimedia.org/r/736239 (owner: 10Volans) [16:12:29] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [16:13:31] (03CR) 10Hnowlan: [C: 03+2] api-gateway: improve configuration naming for discovery_endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:17:20] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:43] (03Merged) 10jenkins-bot: api-gateway: improve configuration naming for discovery_endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/736217 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [16:21:51] contint1001 is accessible via ssh. Not sure what's up with the management interface. [16:22:28] (03PS1) 10SBassett: SECURITY: Avoid double-escaping html tag contents [extensions/ConfirmEdit] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736247 (https://phabricator.wikimedia.org/T293818) [16:24:14] (03PS1) 10Giuseppe Lavagetto: Add apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) [16:27:30] (03CR) 10Jbond: ceph: introduce auth load abstraction (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:28:29] (03PS15) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:29:47] (03CR) 10Jbond: [C: 03+2] R:acme_chief::cert: drop deprecated paramters [puppet] - 10https://gerrit.wikimedia.org/r/734973 (https://phabricator.wikimedia.org/T294435) (owner: 10Jbond) [16:30:23] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4036.ulsfo.wmnet with OS buster [16:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed: - cp4036 (**WARN**... [16:31:07] (03CR) 10Inductiveload: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736215 (https://phabricator.wikimedia.org/T294824) (owner: 10Inductiveload) [16:32:41] (03PS1) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [16:33:29] (03PS5) 10Ssingh: Bird: peer with router IP (gateway) if nothing explicitely set [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [16:34:40] (03CR) 10Ssingh: Bird: peer with router IP (gateway) if nothing explicitely set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [16:35:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32057/console" [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [16:36:25] (03CR) 10Jbond: [C: 03+1] "LGTM assuming pcc agrees" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [16:37:05] (03PS1) 10Vgutierrez: prometheus::ops: Add haproxy-tls cluster config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [16:38:39] (03PS2) 10Vgutierrez: prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [16:38:43] !log pool cp4036.ulsfo.wmnet - T290694 [16:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:46] T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 [16:38:58] mmandere: nice! [16:40:26] vgutierrez: thank you... we're now done reimaging [16:42:17] (03PS1) 10Btullis: Add checks for druid datasources to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) [16:42:20] yeah, and ulsfo now has the same number of cache nodes as the other DCs :) [16:42:31] no more 6 VS 8 nodes per cluster PoPs [16:46:26] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/32058/" [puppet] - 10https://gerrit.wikimedia.org/r/735410 (owner: 10Ayounsi) [16:50:01] (03PS1) 10Btullis: Remove prometheus based Druid checks [puppet] - 10https://gerrit.wikimedia.org/r/736280 (https://phabricator.wikimedia.org/T293399) [16:51:20] jouncebot: nowandnext [16:51:20] For the next 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1600) [16:51:20] In 0 hour(s) and 8 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1700) [16:51:37] (03CR) 10Urbanecm: [C: 03+2] LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736246 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:51:39] (03CR) 10Urbanecm: [C: 03+2] LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736065 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [16:52:27] (03PS1) 10Urbanecm: Revert "Revert "foundationwiki: Enable Translate extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736248 (https://phabricator.wikimedia.org/T205349) [16:52:36] (03PS2) 10Urbanecm: Revert "Revert "foundationwiki: Enable Translate extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736248 (https://phabricator.wikimedia.org/T205349) [16:54:42] (03PS1) 10Hnowlan: api-gateway: remove pathing_map default toy values [deployment-charts] - 10https://gerrit.wikimedia.org/r/736281 [16:55:07] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32059/console" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:55:11] (03PS1) 10Jbond: C:puppetmaster::gitclone: add switch to link to environments directory [puppet] - 10https://gerrit.wikimedia.org/r/736282 (https://phabricator.wikimedia.org/T294841) [16:55:42] (03PS2) 10Jbond: C:puppetmaster::gitclone: add switch to link to environments directory [puppet] - 10https://gerrit.wikimedia.org/r/736282 (https://phabricator.wikimedia.org/T294841) [16:56:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32060/console" [puppet] - 10https://gerrit.wikimedia.org/r/736282 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [16:58:49] (03PS2) 10Hnowlan: api-gateway: remove pathing_map default toy values [deployment-charts] - 10https://gerrit.wikimedia.org/r/736281 [17:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1700). [17:00:15] (03CR) 10Elukey: [C: 03+1] api-gateway: remove pathing_map default toy values [deployment-charts] - 10https://gerrit.wikimedia.org/r/736281 (owner: 10Hnowlan) [17:01:30] PROBLEM - SSH on wcqs1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:02:38] (03PS16) 10David Caro: ceph: introduce auth load abstraction [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:02:40] (03CR) 10David Caro: ceph: introduce auth load abstraction (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:03:38] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:04:15] (03CR) 10Hnowlan: [C: 03+2] api-gateway: remove pathing_map default toy values [deployment-charts] - 10https://gerrit.wikimedia.org/r/736281 (owner: 10Hnowlan) [17:04:54] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32061/console" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:06:41] (03CR) 10Jbond: [C: 03+2] C:puppetmaster::gitclone: add switch to link to environments directory [puppet] - 10https://gerrit.wikimedia.org/r/736282 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [17:07:50] (03PS2) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [17:08:44] (03Merged) 10jenkins-bot: api-gateway: remove pathing_map default toy values [deployment-charts] - 10https://gerrit.wikimedia.org/r/736281 (owner: 10Hnowlan) [17:15:12] (03Merged) 10jenkins-bot: LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736246 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [17:18:05] (03Merged) 10jenkins-bot: LinkCache: Try invalidating cache before throwing [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736065 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [17:18:18] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:18:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:07] (03PS1) 10Jbond: O:puppetmaster: fix environments path [puppet] - 10https://gerrit.wikimedia.org/r/736283 [17:19:44] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "foundationwiki: Enable Translate extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736248 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [17:20:33] (03Merged) 10jenkins-bot: Revert "Revert "foundationwiki: Enable Translate extension"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736248 (https://phabricator.wikimedia.org/T205349) (owner: 10Urbanecm) [17:20:49] (03PS3) 10Muehlenhoff: Update email address with new @wikimedia.org addresss [puppet] - 10https://gerrit.wikimedia.org/r/736179 [17:21:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/735615 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:21:49] (03CR) 10Jbond: [C: 03+2] O:puppetmaster: fix environments path [puppet] - 10https://gerrit.wikimedia.org/r/736283 (owner: 10Jbond) [17:22:06] (03PS1) 10Reedy: Pass ->restrict( Shell::RESTRICT_NONE ) to GPG Shell Command [extensions/SecurePoll] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736249 (https://phabricator.wikimedia.org/T294489) [17:22:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:12] (03CR) 10Reedy: [C: 03+2] Pass ->restrict( Shell::RESTRICT_NONE ) to GPG Shell Command [extensions/SecurePoll] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736249 (https://phabricator.wikimedia.org/T294489) (owner: 10Reedy) [17:22:19] Reedy: warning, I'm deploying :) [17:22:30] (just two scap sync-files left, I'll ping when done) [17:22:48] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.6/includes/cache/LinkCache.php: 1e78aeabd682537d8c284559e1356d15c62da810: LinkCache: Try invalidating cache before throwing (T205349) (duration: 00m 56s) [17:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:50] T205349: Enable Translate extension on Governance wiki - https://phabricator.wikimedia.org/T205349 [17:24:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e3227703a662ecda744bb159f39b128ed289c76d: Revert "Revert "foundationwiki: Enable Translate extension"" (T205349) (duration: 00m 55s) [17:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:23] Reedy: I'm done now, over to you. [17:24:41] urbanecm: .7 isn't checked out on the host yet [17:24:47] So nothing to do :P [17:24:50] oh, it's a .7 backport [17:24:54] (03CR) 10Muehlenhoff: [C: 03+2] Update email address with new @wikimedia.org addresss [puppet] - 10https://gerrit.wikimedia.org/r/736179 (owner: 10Muehlenhoff) [17:24:57] then nevermind at all :) [17:25:00] heh [17:25:28] (03Merged) 10jenkins-bot: Pass ->restrict( Shell::RESTRICT_NONE ) to GPG Shell Command [extensions/SecurePoll] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736249 (https://phabricator.wikimedia.org/T294489) (owner: 10Reedy) [17:25:33] that was quick [17:25:41] faster than my core one indeed :) [17:28:56] (03CR) 10SBassett: [C: 03+2] SECURITY: Avoid double-escaping html tag contents [extensions/ConfirmEdit] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736247 (https://phabricator.wikimedia.org/T293818) (owner: 10SBassett) [17:29:22] urbanecm: not sure if you've noticed / are aware, but foundationwiki seems to be constantly killing my local session, and puts me in the "you are centrally logged in, please refresh" state every few hours [17:29:41] majavah: I did notice that, but i didn't think about how to fix that [17:29:56] i think it's relevant to the cookies option of CA false [17:30:16] `$wgCentralAuthCookies = false;` [17:30:33] i vaguely recall a list of allowed domains for CA, too...but i can't find it anymore [17:31:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [17:31:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:42] !log installing opencv security updates [17:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:28] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for opencv [puppet] - 10https://gerrit.wikimedia.org/r/736005 (owner: 10Muehlenhoff) [17:34:48] majavah: suggestions will be appreciated [17:35:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:58] urbanecm: $wgCentralAuthCookies is documented as "If true, global session and token cookies will be set alongside the per-wiki session and login tokens when users log in with a global account.", so that seems to likely be it [17:39:02] PROBLEM - SSH on wcqs1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:39:14] majavah: so, let's try that i guess :) [17:39:32] I don't think setting it to true has any disadvantages on a wiki that has local-only accs, right? [17:39:48] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:48] (03PS1) 10Urbanecm: foundationwiki: Set wgCentralAuthCookies to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736293 (https://phabricator.wikimedia.org/T205347) [17:41:07] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Set wgCentralAuthCookies to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736293 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [17:41:10] probably correct, but I haven't looked very deep into CA session code [17:41:54] I'm doing to just try it, and revert it if something breaks :) [17:42:44] (03Merged) 10jenkins-bot: foundationwiki: Set wgCentralAuthCookies to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736293 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [17:44:07] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 339be07a35de1fa3846b845376695d68a9d743fd: foundationwiki: Set wgCentralAuthCookies to true (T205347) (duration: 00m 54s) [17:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:12] T205347: Enable SUL accounts on Governance wiki - https://phabricator.wikimedia.org/T205347 [17:45:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:20] majavah: okay, deployed, let me know what happens :) [17:45:20] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:45:34] (03PS1) 10Bartosz Dziewoński: UsernameCompletion: Filter out users with indefinite sitewide blocks from API results [extensions/DiscussionTools] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736250 (https://phabricator.wikimedia.org/T294783) [17:45:56] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.39 ms [17:47:11] Reedy: urbanecm: could i convince one of you to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/736250 ? i wanted to put it in the backport window, but there isn't one now :( [17:47:24] sure [17:47:28] (03CR) 10Urbanecm: [C: 03+2] UsernameCompletion: Filter out users with indefinite sitewide blocks from API results [extensions/DiscussionTools] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736250 (https://phabricator.wikimedia.org/T294783) (owner: 10Bartosz Dziewoński) [17:47:29] it's a mitigation for some offensive usernames, see https://phabricator.wikimedia.org/T25310#7475475 for context [17:47:46] thanks! [17:47:50] np [17:48:24] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2386.62 ms [17:48:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:54] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:49:30] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:28] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 285.38 ms [17:54:41] Ty MatmaRex [17:55:00] (03Merged) 10jenkins-bot: SECURITY: Avoid double-escaping html tag contents [extensions/ConfirmEdit] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/736247 (https://phabricator.wikimedia.org/T293818) (owner: 10SBassett) [17:55:04] (03Merged) 10jenkins-bot: UsernameCompletion: Filter out users with indefinite sitewide blocks from API results [extensions/DiscussionTools] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/736250 (https://phabricator.wikimedia.org/T294783) (owner: 10Bartosz Dziewoński) [17:55:34] MatmaRex: mwdebug1001 has the patch, please test! [17:56:32] looking [17:57:12] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-db1001.eqiad.wmnet [17:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:57:21] urbanecm: yeah, seems good [17:57:32] MatmaRex: syncing then [17:58:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:42] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05Open→03In progress irc update chatted with @elukey and these do indeed need to shift to analtyics vlan. [17:59:32] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.6/extensions/DiscussionTools/modules/dt-ve/dt.ui.UsernameCompletionAction.js: 494af124b95e2eabff94fde79aed6b6f6f81feab: UsernameCompletion: Filter out users with indefinite sitewide blocks from API results (T294783) (duration: 00m 55s) [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:34] T294783: Exclude permanently blocked users from DiscussionTools' username suggestion list - https://phabricator.wikimedia.org/T294783 [17:59:40] MatmaRex: should be live [17:59:42] anything else? [17:59:57] thanks [18:00:04] that's all [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1800) [18:00:20] and i see we're at time now anyway :) [18:00:23] np, glad i could help [18:01:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:13] * thcipriani take walk before disembarking on the train [18:02:25] urbanecm: can you read the the linked task and see my comments about OS [18:02:52] RhinosF1: T294783 has no comment from RhinosF1 :) [18:03:33] urbanecm: https://phabricator.wikimedia.org/T294713#7475336 [18:03:48] I meant the private half [18:04:33] MatmaRex: if it does cleanly, can I strongly suggest that patch is backported to rel branches too [18:04:42] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:05:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:06] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:06:16] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [18:07:02] RhinosF1: i'd rather not backport it, i think it's the wrong solution, the proper solution is to hide those usernames while blocking [18:07:11] (and in the WMF context, also to fix https://phabricator.wikimedia.org/T25310) [18:07:50] also, we don't really maintain release branches for DiscussionTools [18:08:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:08:46] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:09:42] MatmaRex: compat policy implies otherwise [18:10:03] And indef blocked users still can't respond [18:10:53] Steward + OS aren't going to always catch every case too [18:11:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:11:18] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:11:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:38] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:10] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:13:19] RhinosF1: compat policy only says "Snapshots releases along with MediaWiki", it doesn't say anything about maintenance. indef blocked users still can't respond, but they can still receive notifications just fine. and Steward + OS actually should catch every case (or we should report cases to them), otherwise those usernames still show up in every other interface, e.g. Special:Contributions. [18:14:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-db1001.eqiad.wmnet [18:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:04] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `an-db1001.eqiad.wmnet` - an-db1... [18:16:37] MatmaRex: security + bug fixes are still expected to be backported where appropiate. It's simply impossible for every case to be caught. There's hundreds. I know there's other interfaces but it's less expected on DT than Contributions. [18:18:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:18:31] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] it's 100% possible. even the case that started this task has already been hidden (but bug https://phabricator.wikimedia.org/T25310 causes it to reappear). i think the DiscussionTools patch should be reverted after fixing T25310, and not backported. [18:19:03] T25310: Global suppression does not work properly when the target has already been locally blocked - https://phabricator.wikimedia.org/T25310 [18:19:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:19:48] PROBLEM - SSH on wcqs1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:20:26] MatmaRex: T25310 didn't cause this case [18:20:49] how so? i checked and the usernames are hidden globally [18:20:56] Not originally it want [18:21:10] Until Nick raised it wasn't suppressed anywhere [18:22:40] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:57] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:23:05] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:51] RhinosF1: i think you're trying to tell me that we're somehow able to block them, but not hide them, and that does not make sense to me [18:24:24] in my opinion this was just a mistake by the admin(s) who applied the blocks [18:24:38] MatmaRex: yes it's a mistake to have not ticked the hide box [18:24:49] But you can easily find countless other examples [18:25:04] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:25:05] i will request them to be oversighted if i see them [18:25:10] And on other sites like Miraheze Oversight can take like a week because stewards slower [18:25:58] the "hide user" checkbox is only available for oversighters, fwiw [18:26:01] sounds like they need more stewards [18:26:09] or more abuse filters, or something [18:26:57] MatmaRex: yes but finding more volunteers is hard [18:27:54] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:21] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:32:21] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1001.eqiad.wmnet with OS buster [18:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:29] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with... [18:33:45] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-db1002.eqiad.wmnet [18:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:34] PROBLEM - SSH on wcqs1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:38:23] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:01] !log starting to stage train for 1.38.0-wmf.7 (T293948) [18:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:05] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [18:50:48] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:25] (03PS1) 10Jbond: O:puppetmaster:standalone [puppet] - 10https://gerrit.wikimedia.org/r/736295 [18:57:56] (03PS2) 10Jbond: O:puppetmaster:standalone [puppet] - 10https://gerrit.wikimedia.org/r/736295 [19:00:02] (03CR) 10Jbond: [C: 03+2] O:puppetmaster:standalone [puppet] - 10https://gerrit.wikimedia.org/r/736295 (owner: 10Jbond) [19:00:05] thcipriani: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T1900). [19:00:26] oh jouncebot, I wish I knew. [19:01:18] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:09] (03PS1) 10Thcipriani: testwikis wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736296 [19:02:11] (03CR) 10Thcipriani: [C: 03+2] testwikis wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736296 (owner: 10Thcipriani) [19:02:56] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736296 (owner: 10Thcipriani) [19:02:58] !log thcipriani@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.7 refs T293948 [19:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:01] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:05:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:38] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-db1001.eqiad.wmnet with OS buster [19:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:48] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS b... [19:09:02] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [19:09:24] (03PS3) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) [19:13:37] (03CR) 10Jbond: P:rsyslog: ship puppetmaster logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736233 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [19:22:26] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:30] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:51] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-db1002.eqiad.wmnet [19:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:59] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `an-db1002.eqiad.wmnet` - an-db1... [19:37:18] 10SRE, 10SRE-Access-Requests: Create "maryana@wikipedia.org" email handle for annual fundraising email test (replying to donate@) - https://phabricator.wikimedia.org/T294758 (10spatton) Your turnaround time is amazing, thank you, Sukhbir! [19:38:32] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:20] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:24] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:47:57] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:16] !log imported ganeti 2.16.0-1~bpo9+1+wmf1to component/ganeti216 for stretch-wikimedia (with additional cherrypicked patches for compat with KVM 3.1) T284811 [19:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:20] T284811: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 [19:50:38] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:11] !log thcipriani@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.7 refs T293948 (duration: 50m 13s) [19:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:15] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [19:57:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:58:41] !log thcipriani@deploy1002 Pruned MediaWiki: 1.38.0-wmf.5 (duration: 04m 08s) [19:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:38] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:19] !log 1.38.0-wmf.7 on testwikis, leaving it there for today for US holiday (T293948) [20:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:22] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [20:02:28] PROBLEM - SSH on wcqs1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:03:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:21:56] 10SRE, 10Wikidata-Query-Service: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10Gehel) [20:28:27] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1001.eqiad.wmnet with OS buster [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:35] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet with... [20:44:38] RECOVERY - SSH on wcqs1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:52:40] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1001.eqiad.wmnet with OS buster [20:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:48] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet with OS b... [20:55:49] 10SRE, 10Analytics-Radar, 10Traffic, 10WMF-General-or-Unknown, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) [21:03:59] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1002.eqiad.wmnet with OS buster [21:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:07] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with... [21:05:53] 10SRE, 10Analytics-Radar, 10Traffic, 10WMF-General-or-Unknown, and 2 others: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10Krinkle) The URLs that use w/extensions, w/skins, and w/resources are also used... [21:09:29] urbanecm: which WMD version do you have (regarding T294335) and which options enabled? I can't seem to reproduce it on 2.4.4 using Firefox and with "on", "off", and/or with "XHGui". [21:09:29] T294335: XWikimediaDebug Chrome extension: Error handling response: TypeError: Cannot read properties of undefined (reading 'state') - https://phabricator.wikimedia.org/T294335 [21:09:49] (and the footer link does show for me, using current/legacy Vector skin) [21:32:31] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-db1002.eqiad.wmnet with OS buster [21:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:38] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS b... [21:50:37] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-db1002.eqiad.wmnet with OS buster [21:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-db1002.eqiad.wmnet with... [21:56:44] (03PS1) 10Jbond: O:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [21:57:35] (03PS2) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [21:58:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32064/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [22:06:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [22:14:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-db1002.eqiad.wmnet with OS buster [22:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:18] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-db1002.eqiad.wmnet with OS b... [22:14:25] 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10RobH) [22:15:42] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05In progress→03Resolved >>! In T289632#7474421, @elukey wrote: > Hi everybody, I noticed that the two hosts are... [22:15:57] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) a:05Cmjohnson→03RobH [22:16:31] 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10RobH) [22:17:17] (03PS3) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [22:18:00] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [22:19:21] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [22:27:31] (03PS4) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [22:28:18] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [22:28:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32065/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [22:34:34] (03PS5) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [23:00:04] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211102T2300). [23:00:04] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:06] I'll deploy th patch soon [23:01:19] I realized it needs a small change [23:01:41] (03PS6) 10Jbond: C:puppetmaster::gitclone: add support for r10k environments [puppet] - 10https://gerrit.wikimedia.org/r/736305 [23:02:44] out of curiosity—what is "Soft-depends on" compared to a normal "Depends on"? [23:02:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32066/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [23:10:19] perryprog, a changeset with Depends-On means that if the other changeset is not merged first, things will break [23:10:35] Right—so what about a soft-depends on? [23:11:21] a soft dependency means that things won't break entirely, but for full functionality both changesets need to be merged [23:11:35] ahh [23:11:37] cheers [23:12:49] in this case, setting the config variable in CommonSettings.php won't do anything unless the other changeset is merged, because the config variable didn't previously exist [23:12:54] (03PS1) 10Ladsgroup: admin: Add my new ssh key [puppet] - 10https://gerrit.wikimedia.org/r/736311 [23:20:48] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1280.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:20:55] (03PS3) 10Gergő Tisza: Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) [23:20:57] (03PS1) 10Gergő Tisza: Use page id for GrowthExperiments image recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736314 (https://phabricator.wikimedia.org/T290949) [23:22:16] ended up adding the change as another patch [23:22:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32067/console" [puppet] - 10https://gerrit.wikimedia.org/r/736305 (owner: 10Jbond) [23:26:01] perryprog: if there is a Depends-On footer, the software handling automated tests and pushing to git will refuse to merge the patch unless the dependency is merged; and will include the change that's being dependent upon when running tests. Soft-depends is for humans only, just to give extra context. [23:26:28] Ooo, I didn't know there was CI integration with that keyword [23:26:36] (03CR) 10Gergő Tisza: [C: 03+2] Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:27:20] (03Merged) 10jenkins-bot: Use url-downloader proxy for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735094 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:27:26] So if the Depends-on patch isn't merged, will the tests run be based on both Depends-on and the patch being tested? [23:29:18] that's the idea, yes. [23:29:37] there are some docs at https://www.mediawiki.org/wiki/Gerrit/Cross-repo_dependencies [23:29:51] Ah! Thank you. [23:30:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:14] (03CR) 10Gergő Tisza: [C: 03+2] Use page id for GrowthExperiments image recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736314 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:34:10] (03Merged) 10jenkins-bot: Use page id for GrowthExperiments image recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736314 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:34:18] !log tgr@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:735094|Use url-downloader proxy for GrowthExperiments (T290949)]] (duration: 01m 14s) [23:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:21] T290949: Add an image: Enable on test wikis - https://phabricator.wikimedia.org/T290949 [23:34:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:04] (03PS1) 10Gergő Tisza: Use title for GrowthExperiments image recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736317 (https://phabricator.wikimedia.org/T290949) [23:40:52] (03CR) 10Gergő Tisza: [C: 03+2] Use title for GrowthExperiments image recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736317 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:41:42] (03Merged) 10jenkins-bot: Use title for GrowthExperiments image recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/736317 (https://phabricator.wikimedia.org/T290949) (owner: 10Gergő Tisza) [23:44:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:50] !log tgr@deploy1002 Synchronized wmf-config: Config: Use page id for GrowthExperiments image recommendations, except for testwiki ([[gerrit:736314|736314]] [[gerrit:736317|736317]] (T290949 T292154) (duration: 01m 03s) [23:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:55] T290949: Add an image: Enable on test wikis - https://phabricator.wikimedia.org/T290949 [23:45:55] T292154: Image Suggestions API: Support querying by title - https://phabricator.wikimedia.org/T292154 [23:46:48] !log UTC late deploys done [23:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log