[00:00:05] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T0000). [00:06:37] (03PS7) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [00:08:14] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:11:57] (03PS1) 10Arlolra: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) [00:16:34] (03PS8) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [00:18:51] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:57] PROBLEM - very high load average likely xfs on ms-be2035 is CRITICAL: CRITICAL - load average: 110.69, 103.17, 100.40 https://wikitech.wikimedia.org/wiki/Swift [00:46:57] ACKNOWLEDGEMENT - very high load average likely xfs on ms-be2035 is CRITICAL: CRITICAL - load average: 119.53, 112.57, 108.85 daniel_zahn rsync is running, probably after https://phabricator.wikimedia.org/T291896 https://wikitech.wikimedia.org/wiki/Swift [01:55:43] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:56:39] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-260.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:35] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:07] (03PS2) 10RLazarus: Minimal version of the image catalog [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) [04:12:15] RECOVERY - very high load average likely xfs on ms-be2035 is OK: OK - load average: 62.65, 68.63, 79.68 https://wikitech.wikimedia.org/wiki/Swift [04:27:05] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-321.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add gehel to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724592 (https://phabricator.wikimedia.org/T292040) (owner: 10Giuseppe Lavagetto) [04:44:40] (03PS2) 10Giuseppe Lavagetto: admin: add gehel to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724592 (https://phabricator.wikimedia.org/T292040) [04:44:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] admin: add gehel to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724592 (https://phabricator.wikimedia.org/T292040) (owner: 10Giuseppe Lavagetto) [04:47:08] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10Joe) 05Open→03Resolved @Gehel in about 30 minutes you should be able to access superset. If that doesn't happen, please reopen the task :) [04:52:47] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Joe) p:05Triage→03Medium As usual we need approval from both analytics (@Ottomata or @odimitrijevic) and direct management (@DannyH) before proceeding. @ifried you should also pleas... [04:58:59] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:05:51] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:05:59] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:07] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:05] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:57] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:32:34] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) p:05Triage→03Medium a:03Joe Hi @SWakiyama and welcome! From what I can see your wikitech account is registered with a different email provider, and indeed y... [05:42:29] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ksiebert - https://phabricator.wikimedia.org/T292053 (10Joe) >>! In T292053#7388970, @Dzahn wrote: > Hi and welcome @KSiebert, > > you indicate you already do have shell access but while I can see your user KSiebert in LDAP I cannot see it in the shell ac... [05:45:03] !log Deploy schema change on s2 codfw (lag will show up) T270620 [05:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:10] T270620: Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 [05:45:35] !log Deploy schema change on s4 codfw (lag will show up) T270620 [05:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:41] !log Deploy schema change on s5 codfw (lag will show up) T270620 [05:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:35] (03PS1) 10Giuseppe Lavagetto: admin: add ksiebert as ldap user [puppet] - 10https://gerrit.wikimedia.org/r/724865 (https://phabricator.wikimedia.org/T292053) [05:50:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add ksiebert as ldap user [puppet] - 10https://gerrit.wikimedia.org/r/724865 (https://phabricator.wikimedia.org/T292053) (owner: 10Giuseppe Lavagetto) [05:52:09] !log Deploy schema change on s7 codfw (lag will show up) T270620 [05:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:15] T270620: Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 [05:53:01] !log Deploy schema change on s3 codfw (lag will show up) T270620 [05:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:42] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Joe) I would expect this to be automatic, yes. I think the proposal makes sense. We will need to edit our instructi... [05:56:35] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for ksiebert - https://phabricator.wikimedia.org/T292053 (10Joe) 05Open→03Resolved a:03Joe Hi @KSiebert and welcome! I've added you to the "wmf" group in LDAP , which should give you access to turnilo, but for access to superset you... [06:01:31] !log Deploy schema change on s1 codfw (lag will show up) T270620 [06:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:38] T270620: Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 [06:04:04] 10SRE, 10Wikimedia-Mailing-lists: Create wikimediacz-talk@lists.wikimedia.org - https://phabricator.wikimedia.org/T292134 (10Joe) p:05Triage→03Medium a:03Joe [06:11:01] 10SRE, 10Wikimedia-Mailing-lists: Create wikimediacz-talk@lists.wikimedia.org - https://phabricator.wikimedia.org/T292134 (10Joe) 05Open→03Resolved Hi @Urbanecm the mailing list has been created. I'm resolving the task, let me know if something isn't right. [06:20:59] (03PS1) 10Giuseppe Lavagetto: admin: add erayfield to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/724867 (https://phabricator.wikimedia.org/T291126) [06:23:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add erayfield to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/724867 (https://phabricator.wikimedia.org/T291126) (owner: 10Giuseppe Lavagetto) [06:26:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Joe) 05Open→03Resolved a:03Joe Ok I see the source of confusion - we call the "wikimedia developer account" the account on wikitech, usually. I know all these conven... [06:44:32] (03PS1) 10Legoktm: Throw more resources at shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/724869 (https://phabricator.wikimedia.org/T289227) [06:48:31] !log Deploy schema change on s8 codfw (lag will show up) T270620 [06:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:40] T270620: Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 [06:49:30] (03PS2) 10Legoktm: Throw more resources at shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/724869 (https://phabricator.wikimedia.org/T289227) [06:50:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Throw more resources at shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/724869 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [06:50:50] (03CR) 10Legoktm: [C: 03+2] Throw more resources at shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/724869 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [06:55:01] (03Merged) 10jenkins-bot: Throw more resources at shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/724869 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [06:56:36] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [06:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:04] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [06:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:51] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:03:43] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [07:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:34] (03PS1) 10Marostegui: db2081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724932 [07:16:21] (03CR) 10Marostegui: [C: 03+2] db2081: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724932 (owner: 10Marostegui) [07:20:09] (03PS17) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [07:20:42] (03PS1) 10Elukey: network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) [07:24:26] 10SRE, 10Wikimedia-Mailing-lists: Create wikimediacz-talk@lists.wikimedia.org - https://phabricator.wikimedia.org/T292134 (10Urbanecm) Thanks, that was quick! I see the list in mailman, so it should work, hopefully :). [07:27:44] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:30:18] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 14 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:31:04] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [07:31:08] (03CR) 10Muehlenhoff: [C: 03+2] Prefer mx2001 for mail in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/724338 (owner: 10Muehlenhoff) [07:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:11] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 06s) [07:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:18] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [07:36:35] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:38:35] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:41:24] looking ^ [07:41:42] this alert is very new and probably needs some tuning [07:43:01] (03CR) 10Gehel: "minor comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:45:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:46:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (modulo what Cole said), please consider including tests too" [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [07:49:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2003.codfw.wmnet [07:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 Marostegui https://phabricator.wikimedia.org/T291961 https://wikitech.wikimedia.org/wiki/HAProxy [07:57:14] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10SWakiyama) Hi Joe, Let's please modify my email address to swakiyama@wikimedia.org. Thanks, Shari [07:57:47] dcausse: neat re: alert! [08:00:06] godog: yes, nice to see one for real, even if it's a false positive :P (I need to tune this one a bit) [08:01:16] heheh indeed [08:01:19] nice to see [08:05:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31365/console" [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Marostegui) dbprox1019 was alerting on haproxy failover I have ack'ed the alert ` [09:54:18] <+icinga... [08:10:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2003.codfw.wmnet [08:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:14] (03PS2) 10Elukey: network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) [08:12:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31366/console" [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:13:31] (03PS1) 10Muehlenhoff: Add DHCP settings for testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/724937 [08:14:29] (03PS1) 10Muehlenhoff: Remove Stretch DHCP setting for serpens/seaborgium [puppet] - 10https://gerrit.wikimedia.org/r/724938 [08:15:00] (03CR) 10Elukey: [V: 03+1] network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:15:53] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP settings for testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/724937 (owner: 10Muehlenhoff) [08:16:20] (03PS2) 10Muehlenhoff: Remove Stretch DHCP setting for serpens/seaborgium [puppet] - 10https://gerrit.wikimedia.org/r/724938 [08:17:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove Stretch DHCP setting for serpens/seaborgium [puppet] - 10https://gerrit.wikimedia.org/r/724938 (owner: 10Muehlenhoff) [08:21:34] !log installing nettle security updates on stretch [08:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:26] jynus: ah, that's a nice idea. it might be useful indeed [08:22:39] the memory alert from dbs to ganeti that is [08:24:04] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10aborrero) The `cloud-gw-transport-eqiad` range is actually `185.15.56.236/30` not 185.15.56.**238**/30. It is registered on netbox: https://netbox.wikimedia.org/ipam/prefixe... [08:25:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor blocker, gimme 10m" [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:45:55] (03PS1) 10Alexandros Kosiaris: network: Split kubepods networks per cluster [puppet] - 10https://gerrit.wikimedia.org/r/724940 [08:45:57] (03PS1) 10Alexandros Kosiaris: Replace $KUBEPODS_NETWORKS ferm macro with cluster aware ones [puppet] - 10https://gerrit.wikimedia.org/r/724941 [08:46:48] (03CR) 10jerkins-bot: [V: 04-1] Replace $KUBEPODS_NETWORKS ferm macro with cluster aware ones [puppet] - 10https://gerrit.wikimedia.org/r/724941 (owner: 10Alexandros Kosiaris) [08:47:31] (03CR) 10David Caro: [C: 03+1] P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [08:47:58] (03PS18) 10Jbond: P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) [08:48:33] (03CR) 10Jbond: [C: 03+2] P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [08:52:16] (03PS5) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [08:53:16] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31367/console" [puppet] - 10https://gerrit.wikimedia.org/r/724940 (owner: 10Alexandros Kosiaris) [08:53:25] (03PS2) 10Alexandros Kosiaris: Replace $KUBEPODS_NETWORKS ferm macro with cluster aware ones [puppet] - 10https://gerrit.wikimedia.org/r/724941 [08:56:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31368/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [08:56:33] (03PS2) 10Alexandros Kosiaris: network: Split kubepods networks per cluster [puppet] - 10https://gerrit.wikimedia.org/r/724940 [08:56:35] (03PS3) 10Alexandros Kosiaris: Replace $KUBEPODS_NETWORKS ferm macro with cluster aware ones [puppet] - 10https://gerrit.wikimedia.org/r/724941 [08:57:45] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31369/console" [puppet] - 10https://gerrit.wikimedia.org/r/724940 (owner: 10Alexandros Kosiaris) [08:59:38] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC is happy, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/724940 (owner: 10Alexandros Kosiaris) [09:02:58] (03PS11) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:03:25] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [09:03:27] (03PS12) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:03:56] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [09:05:46] (03CR) 10David Caro: P:base: make notifications_enabled a boolean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:08:20] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) @aborrero Yeah the range is there alright, I just mean the second IP in the linknet, 185.15.56.238, is not associated with anything. Netbox always uses the CIDR ma... [09:08:39] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:31] (03PS1) 10Effie Mouzeli: Revert "network: Split kubepods networks per cluster" [puppet] - 10https://gerrit.wikimedia.org/r/724804 [09:10:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:01] (03Abandoned) 10Effie Mouzeli: Revert "network: Split kubepods networks per cluster" [puppet] - 10https://gerrit.wikimedia.org/r/724804 (owner: 10Effie Mouzeli) [09:11:17] (03CR) 10Jelto: [V: 03+1] profile::gitlab start using gitlab module (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [09:11:21] PROBLEM - Check systemd state on kubernetes1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:21] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:58] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31371/console" [puppet] - 10https://gerrit.wikimedia.org/r/724941 (owner: 10Alexandros Kosiaris) [09:15:07] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:15:57] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) With orchestrator we can sort of do that (note clouddb1... [09:16:31] (03PS3) 10Elukey: network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) [09:17:47] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:22] akosiaris: jumbo1003's ferm says no such variable: $EQIAD_PRIVATE_PRIVATE1_KUBESTAGEPODS_EQIAD [09:18:59] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:20] ah yes the follow up is needed [09:20:01] (03PS1) 10Jcrespo: mariadb: Replace deprecated wmflib require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724943 [09:20:40] yup [09:20:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31372/console" [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [09:20:50] I have my patch ready as well :) [09:20:59] PROBLEM - Check systemd state on kafka-jumbo1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:14] RECOVERY - Check systemd state on kubernetes1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:27] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31373/console" [puppet] - 10https://gerrit.wikimedia.org/r/724941 (owner: 10Alexandros Kosiaris) [09:23:19] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:42] (03CR) 10Elukey: [C: 03+1] Replace $KUBEPODS_NETWORKS ferm macro with cluster aware ones [puppet] - 10https://gerrit.wikimedia.org/r/724941 (owner: 10Alexandros Kosiaris) [09:25:27] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01082 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:26:07] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Merging to fix maps and kafka jumbo." [puppet] - 10https://gerrit.wikimedia.org/r/724941 (owner: 10Alexandros Kosiaris) [09:26:21] PROBLEM - Check systemd state on kafka-jumbo1009 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:49] (03PS1) 10Filippo Giunchedi: Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) [09:29:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [09:31:35] (03PS13) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:32:06] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [09:32:28] akosiaris: thanks :) [09:32:36] (03CR) 10Elukey: [V: 03+1 C: 03+2] network: add k8s pod+svc ipv{4,6} subnets for the ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/724933 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [09:33:23] (03PS1) 10Giuseppe Lavagetto: admin: add Shari Wakiyama [puppet] - 10https://gerrit.wikimedia.org/r/724945 (https://phabricator.wikimedia.org/T292069) [09:33:25] (03PS1) 10Giuseppe Lavagetto: admin: add Shari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724946 (https://phabricator.wikimedia.org/T292069) [09:34:15] RECOVERY - Check systemd state on kafka-jumbo1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:15] RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:47] RECOVERY - Check systemd state on kafka-jumbo1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:19] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01025 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:35:48] (03CR) 10Hnowlan: [C: 03+2] secrets: Clean up restbase stub certificates [labs/private] - 10https://gerrit.wikimedia.org/r/724717 (owner: 10Hnowlan) [09:40:59] RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:02] (03PS1) 10Filippo Giunchedi: o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) [09:43:14] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005125 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:43:29] RECOVERY - Check systemd state on kafka-jumbo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:39] RECOVERY - Check systemd state on kafka-jumbo1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:50] (03PS1) 10Filippo Giunchedi: Revert "prometheus: add ThanosSidecarUploadFailure to prometheus/ops" [puppet] - 10https://gerrit.wikimedia.org/r/724949 (https://phabricator.wikimedia.org/T289662) [09:45:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:45:26] (03PS14) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:45:59] (03PS1) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [09:46:17] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [09:46:47] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [09:46:55] (03PS15) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:47:37] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [09:48:05] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:59] RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31374/console" [puppet] - 10https://gerrit.wikimedia.org/r/724945 (https://phabricator.wikimedia.org/T292069) (owner: 10Giuseppe Lavagetto) [09:51:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] admin: add Shari Wakiyama [puppet] - 10https://gerrit.wikimedia.org/r/724945 (https://phabricator.wikimedia.org/T292069) (owner: 10Giuseppe Lavagetto) [09:52:59] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:53:08] (03PS6) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [09:53:10] (03PS16) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [09:53:15] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:54:17] (03PS7) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [09:54:24] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:57:19] PROBLEM - Check systemd state on kubernetes1016 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:58:43] I'll check what's up [09:59:03] (03PS2) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1000). [10:00:06] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) Hi Shari, I've modified the email address and added your user to the "wmf" group. Can you confirm you read https://wikitech.wikimedia.org/... [10:00:23] mmhh looks like logstash has trouble consuming from kafka [10:00:55] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) [10:02:22] (03CR) 10Jcrespo: "This is a low-prio, deprecation-related change, done because implementation of memory checks elsewhere, but should be a noop: https://pupp" [puppet] - 10https://gerrit.wikimedia.org/r/724943 (owner: 10Jcrespo) [10:02:35] no I stand corrected, it is the writes to es [10:02:38] (03PS3) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:02:47] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:02:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:03:41] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:04:16] (03CR) 10Marostegui: [C: 03+1] "Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/724943 (owner: 10Jcrespo) [10:05:29] (03CR) 10Nikerabbit: [C: 03+1] Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724458 (https://phabricator.wikimedia.org/T290175) (owner: 10KartikMistry) [10:06:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:19] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Joe) As usual pinging @Ottomata / @odimitrijevic for signoff. [10:06:27] !log test bounce logstash on logstash1023 [10:06:29] (03CR) 10Jbond: [C: 03+2] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [10:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:52] (03PS4) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:06:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:07:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Hi! Thanks for this change." [puppet] - 10https://gerrit.wikimedia.org/r/724816 (owner: 10Jgleeson) [10:10:01] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:13:30] ok it seems to be recovering, though I'm not sure what has happened, logstash was saying "can't talk to elasticsearch on 127.0.0.1:9200" [10:13:57] !log upgrade znuny to 6.0.37 [10:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:22] recovered by itself that is [10:15:01] bizarre [10:17:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:18:05] (03CR) 10Jcrespo: "What do you think? Not sure if the percentages are correct or they should be lower on higher, to prevent too much verbosity." [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:18:49] (03PS5) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:18:58] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:19:22] (03PS6) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:19:31] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:20:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Oh, this is nice!!! Thanks. one typo, but otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:22:23] (03CR) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:23:27] (03PS7) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:23:36] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:25:03] RECOVERY - Check systemd state on kubernetes1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:59] (03PS8) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:30:38] (03PS18) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [10:30:44] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [10:30:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:31:06] (03PS19) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [10:31:12] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [10:31:44] I think there is something weird with puppet an jenkins- I am pushing a patch on top of HEAD, but it fails to be rebased by jenkins [10:32:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:34:38] (03PS1) 10Jbond: cloud.yaml: add defults to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/724954 [10:34:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:34:43] jynus: want to send a link and ill take a look [10:35:01] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/31378/" [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:35:19] (03CR) 10Jbond: [C: 03+2] cloud.yaml: add defults to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/724954 (owner: 10Jbond) [10:35:30] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/724950 [10:35:39] * jbond looking [10:36:22] (03PS9) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [10:37:02] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10aborrero) >>! In T292097#7390420, @cmooney wrote: > @aborrero Yeah the range is there alright, I just mean the second IP in the linknet, 185.15.56.238, is not associated wit... [10:37:10] jynus, jbond: that sounds like the same issue I asked about in -releng, I was about to open a task [10:37:16] appears to affect many repositories [10:37:25] ah, so it is not only puppet! [10:37:29] nope [10:37:47] then maybe not repository-dependent, but just the jenkins bot [10:38:07] (which was what worried me) [10:38:21] mmm, it worked now [10:38:35] weird [10:38:42] yes strange [10:38:49] https://phabricator.wikimedia.org/T292167, I’ll add examples in a second [10:40:04] (feel free to rephrase it of course) [10:40:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from me, deferring to infra foundations for the merge." [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:43:28] (03PS1) 10Alexandros Kosiaris: Remove KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/724955 [10:45:40] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [10:47:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [10:47:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/724955 (owner: 10Alexandros Kosiaris) [10:47:55] (03CR) 10Jbond: [C: 03+1] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [10:49:50] (03PS20) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [10:54:01] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10jcrespo) For backups (but I think DBAs may have an equivalent need)... [10:56:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:57:01] (03PS1) 10Elukey: helmfile.d: lower the min cpu limit for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724956 (https://phabricator.wikimedia.org/T286791) [10:58:29] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [11:00:05] Amir1, Lucas_WMDE, and apergos: That opportune time is upon us again. Time for a EU Backport and Config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1100). [11:00:05] kart_ and MatmaRex: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] mmhh still troubles with backlog in logstash, I'll take a look [11:00:19] * kart_ is here.. [11:00:23] I’m around but in a meeting, hoping someone else can deploy [11:00:30] hi [11:00:34] there are no trainees signed up for the backport window today. I looked briefly at the four patches, as a non-front-end person I have no idea about the last three but hope they are ok :-D the first seems fine to me. are people self-deployers or do they need an assist? [11:00:53] although let's wait for the logstash issue to be resolved first, regardless [11:01:14] I can self deploy apergos, let me know once logstash is calm. [11:01:25] 👍 [11:01:35] (03PS2) 10KartikMistry: Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724458 (https://phabricator.wikimedia.org/T290175) [11:01:42] (03CR) 10Elukey: [C: 03+2] helmfile.d: lower the min cpu limit for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724956 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [11:01:43] (03PS1) 10Alexandros Kosiaris: Calico: Increase replicaCount for typha [deployment-charts] - 10https://gerrit.wikimedia.org/r/724957 (https://phabricator.wikimedia.org/T292077) [11:01:54] yeah I don't know how long that'll take, earlier I've seen it recover itself [11:02:45] if elasticsearch/logstash aficionados are around though I'm happy to bounce ideas [11:03:29] right now the symptom is "read time out" from logstash when talking to elasticsearch on localhost [11:04:57] :/ [11:05:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:45] i need someone to deploy for me [11:05:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:31] I'll let deployers make the call whether mwlog would be enough to check logs [11:07:06] kart_: ? [11:07:24] godog apergos I've simple patch, so that should be OK. [11:07:32] great [11:08:00] OK. Let me go ahead with that. [11:08:33] SGTM [11:08:59] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724458 (https://phabricator.wikimedia.org/T290175) (owner: 10KartikMistry) [11:09:45] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:51] kart_: are you able to deploy my patches as well? [11:09:53] (03Merged) 10jenkins-bot: Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724458 (https://phabricator.wikimedia.org/T290175) (owner: 10KartikMistry) [11:10:38] MatmaRex: Let me check. Are you OK with only mwlog output for log checking? [11:10:56] yeah, i don't expect anything intersting in logs [11:11:04] i can test the patches with thw mwdebug servers [11:11:08] the* [11:12:13] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/724965 (owner: 10L10n-bot) [11:12:22] logstash backlog has almost recovered, I don't understand why yet though [11:12:47] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Orchestrator has tags, which can be useful - I filed th... [11:13:07] well it will be convenient if it does fully :-D [11:14:05] Deploying my patch.. [11:14:43] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724458|Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias (T290175)]] (duration: 01m 08s) [11:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:50] T290175: Enable Section Translation for Igbo, Hausa, Yoruba and Thai Wikipedias - https://phabricator.wikimedia.org/T290175 [11:14:54] OK. That's done :) [11:15:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:10] \o/ [11:16:28] MatmaRex: What's order of your patches? As listed in the calendar? [11:16:36] yes [11:16:54] (they don't strictly depend on each other though) [11:16:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [11:17:11] OK. Let me +2 first two first together. [11:17:53] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:17:57] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:48] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) ` root@dborch1001:~# /usr/bin/orchestrator-client -c... [11:18:58] (03CR) 10KartikMistry: [C: 03+2] Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724788 (https://phabricator.wikimedia.org/T291002) (owner: 10Bartosz Dziewoński) [11:19:04] (03CR) 10KartikMistry: [C: 03+2] Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724789 (https://phabricator.wikimedia.org/T291002) (owner: 10Bartosz Dziewoński) [11:19:12] (03PS2) 10KartikMistry: Make reply tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724732 (https://phabricator.wikimedia.org/T288485) (owner: 10Bartosz Dziewoński) [11:21:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] MatmaRex: I'll deploy in mwdebug1002 and ping you once patch(es) are merged. [11:21:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:23:41] kart_: yup, thanks [11:23:44] (03PS21) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [11:24:27] (03Merged) 10jenkins-bot: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724788 (https://phabricator.wikimedia.org/T291002) (owner: 10Bartosz Dziewoński) [11:25:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:36] (03Merged) 10jenkins-bot: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724789 (https://phabricator.wikimedia.org/T291002) (owner: 10Bartosz Dziewoński) [11:25:47] also looking at the indexing failures [11:27:27] MatmaRex: first patch on mwdebug1002. Please test. [11:27:51] looking [11:28:35] kart_: looks good, i see the link on enwiki [11:28:44] cool. Deploying. [11:29:09] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [11:30:22] !log kartik@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools: Backport: [[gerrit:724788|Add a link to preferences within the Reply and New Discussion Tools (T291002)]] (duration: 01m 09s) [11:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] T291002: Add a link to preferences within the Reply and New Discussion Tools - https://phabricator.wikimedia.org/T291002 [11:30:33] Now on 2nd patch. [11:30:37] (03PS1) 10Ladsgroup: mediawiki: Absent testwikidata dispatching and clean up systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/724990 (https://phabricator.wikimedia.org/T291610) [11:31:47] (03PS22) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [11:32:33] (03PS2) 10Ladsgroup: mediawiki: Absent testwikidata dispatching and clean up systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/724990 (https://phabricator.wikimedia.org/T291610) [11:33:32] MatmaRex: 2nd patch (ie wmf.2 backport) on mwdebug1002 [11:33:51] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:34:06] kart_: thanks, looks good on mw.org [11:34:16] Cool. Deploying. [11:34:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:34:25] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724990 (https://phabricator.wikimedia.org/T291610) (owner: 10Ladsgroup) [11:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:46] ok indexing failures are tracked at T292174 [11:34:47] T292174: logstash indexing failures reported for "knative-serving" - https://phabricator.wikimedia.org/T292174 [11:34:51] going to lunch [11:35:51] !log kartik@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools: Backport: [[gerrit:724789|Add a link to preferences within the Reply and New Discussion Tools (T291002)]] (duration: 01m 08s) [11:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:58] T291002: Add a link to preferences within the Reply and New Discussion Tools - https://phabricator.wikimedia.org/T291002 [11:36:01] MatmaRex: merged. Now on Config patch.. [11:36:47] (03CR) 10KartikMistry: [C: 03+2] Make reply tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724732 (https://phabricator.wikimedia.org/T288485) (owner: 10Bartosz Dziewoński) [11:37:33] (03Merged) 10jenkins-bot: Make reply tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724732 (https://phabricator.wikimedia.org/T288485) (owner: 10Bartosz Dziewoński) [11:37:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:06] (03CR) 10Ladsgroup: "PCC https://puppet-compiler.wmflabs.org/compiler1003/994/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/724990 (https://phabricator.wikimedia.org/T291610) (owner: 10Ladsgroup) [11:38:09] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [11:41:21] MatmaRex: config patch on mwdebug1002. Please test. [11:42:02] kart_: thanks, behaves as expected on a couple wikis! [11:42:13] Excellent! [11:43:40] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724732|Make reply tool available as opt-out almost everywhere (phase 3) (T288485)]] (duration: 01m 07s) [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] T288485: Deploy config to make Reply Tool available as opt-out at phase 3 wikis - https://phabricator.wikimedia.org/T288485 [11:43:49] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:49] MatmaRex: All done. [11:44:08] Thanks for deploying with Rel^Lang team :) [11:44:12] thanks a lot! [11:44:36] !log downgrading scap to 3.17.1-1 on maps* hosts - T291990 [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:43] T291990: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 [11:45:42] huh, I guess that's it for the window [11:45:50] thanks for handling your own/each other's deploys :-D [11:46:40] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [11:46:41] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 01s) [11:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:57] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:46:59] I'm going to wander off now to get food, if someone tries to shove in a patch at this late hour it's just too late :-P [11:47:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:31] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [11:47:32] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 01s) [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:42] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [11:47:45] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 03s) [11:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:57] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:51:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:51:06] 10SRE, 10SRE Observability: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10ema) [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:08] (03PS1) 10KartikMistry: Remove deprecated SectionTranslationTargetLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724992 (https://phabricator.wikimedia.org/T290302) [11:53:27] 10SRE, 10SRE Observability: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10ema) [11:56:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099 (s1 and s8) for upgrade', diff saved to https://phabricator.wikimedia.org/P17351 and previous config saved to /var/cache/conftool/dbconfig/20210930-115631-marostegui.json [11:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:37] (03PS2) 10Muehlenhoff: Configure remaining domains with equal weights for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/723482 (https://phabricator.wikimedia.org/T286911) [11:58:49] !log imported wikidiff2_1.13.0-1/php-wikidiff2_1.13.0-1_amd64.deb to buster-wikimedia component/php72 [11:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:00:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17352 and previous config saved to /var/cache/conftool/dbconfig/20210930-120054-root.json [12:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17353 and previous config saved to /var/cache/conftool/dbconfig/20210930-120102-root.json [12:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:16] 10SRE, 10SRE Observability: rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used - https://phabricator.wikimedia.org/T292180 (10ema) [12:02:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:03:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:21] (03CR) 10Muehlenhoff: [C: 03+2] Configure remaining domains with equal weights for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/723482 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:04:51] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:05:29] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:52] (03PS23) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [12:06:11] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) [12:09:14] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) wikidiff 1.13.0 is now installed on the beta cluster. [12:09:26] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) [12:10:39] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:10:40] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 01s) [12:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:54] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:05] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 10s) [12:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:25] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [12:12:49] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:13:24] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:39] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 15s) [12:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:49] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:13:51] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, that's great! One nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [12:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:04] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 15s) [12:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17354 and previous config saved to /var/cache/conftool/dbconfig/20210930-121558-root.json [12:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17355 and previous config saved to /var/cache/conftool/dbconfig/20210930-121605-root.json [12:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:18] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Joe) @hnowlan please remember to also rebuild the corresponding docker image when rolling out to p... [12:16:54] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:10] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 16s) [12:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:20] !log adapted MX records to point to both mx1001.wikimedia.org and mx2001.wikimedia.org with equal weights T286911 [12:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:26] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [12:17:52] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) [12:18:22] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [12:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:31] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:18:40] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 17s) [12:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:33] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:20:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:22:03] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff mx1001/mx2001 have been reimaged to Bullseye (reusing the VM/IP for potential IP reputation issues... [12:26:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:28:48] (03CR) 10Jgleeson: ssh: Include custom sshd_config files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724816 (owner: 10Jgleeson) [12:30:59] (03PS2) 10Jgleeson: ssh: Include custom sshd_config files. [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) [12:31:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17356 and previous config saved to /var/cache/conftool/dbconfig/20210930-123101-root.json [12:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17357 and previous config saved to /var/cache/conftool/dbconfig/20210930-123109-root.json [12:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:44] !log downloading files for T290900 in screen on mwmaint1002 [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:50] T290900: Server side upload to enwikisource (multiple DJVU files ~200MB each) - https://phabricator.wikimedia.org/T290900 [12:32:34] (03PS1) 10Jgiannelos: tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724996 [12:32:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:34:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:39:06] (03CR) 10Jcrespo: "Let me know what you think of my "why"." [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [12:42:18] (03CR) 10Jcrespo: "Not sure if my comment is understandable- I thinking this will prevent in the short future "run-time" errors and produce "compile time" er" [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [12:46:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17358 and previous config saved to /var/cache/conftool/dbconfig/20210930-124606-root.json [12:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17359 and previous config saved to /var/cache/conftool/dbconfig/20210930-124612-root.json [12:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:56] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) Thanks @hnowlan. @dom_walden, @imaigwilo you should now be able to test the related ticke... [12:48:04] (03PS1) 10DCausse: rdf-streaming-updater: Deploy only to k8s [alerts] - 10https://gerrit.wikimedia.org/r/724999 (https://phabricator.wikimedia.org/T276467) [12:48:09] (03PS1) 10DCausse: blazegraph: relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/725000 [12:56:10] (03PS1) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [12:57:21] (03PS2) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [12:58:36] (03PS3) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [12:59:44] (03PS4) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [12:59:52] (03CR) 10Muehlenhoff: [C: 04-1] ssh: Include custom sshd_config files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [13:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17360 and previous config saved to /var/cache/conftool/dbconfig/20210930-130109-root.json [13:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17361 and previous config saved to /var/cache/conftool/dbconfig/20210930-130116-root.json [13:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:46] !log Start server-side upload for 2 video files (T292096, T291492) [13:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:54] T292096: Server side upload for AKA MBG - https://phabricator.wikimedia.org/T292096 [13:02:54] T291492: Server side upload for Xenotron - https://phabricator.wikimedia.org/T291492 [13:07:04] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Ottomata) Approved. [13:07:12] (03PS8) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [13:08:25] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Ottomata) Approved. [13:08:37] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:14:23] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:39] (03CR) 10Hashar: "recheck CI infra had an issue" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:21:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Calico: Increase replicaCount for typha [deployment-charts] - 10https://gerrit.wikimedia.org/r/724957 (https://phabricator.wikimedia.org/T292077) (owner: 10Alexandros Kosiaris) [13:23:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Calico: Increase replicaCount for typha [deployment-charts] - 10https://gerrit.wikimedia.org/r/724957 (https://phabricator.wikimedia.org/T292077) (owner: 10Alexandros Kosiaris) [13:24:02] (03PS3) 10Jgleeson: ssh: Puppetize GatewayPorts config option for sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) [13:24:54] (03PS2) 10Alexandros Kosiaris: Calico: Increase replicaCount for typha [deployment-charts] - 10https://gerrit.wikimedia.org/r/724957 (https://phabricator.wikimedia.org/T292077) [13:24:55] (03PS1) 10Alexandros Kosiaris: Rename main cluster to services [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 [13:26:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [13:26:31] !log Upgrade db1133 [13:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 (s1) for upgrade', diff saved to https://phabricator.wikimedia.org/P17362 and previous config saved to /var/cache/conftool/dbconfig/20210930-132700-marostegui.json [13:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:30] !log Upgrade db1134 [13:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:28] (03CR) 10jerkins-bot: [V: 04-1] Rename main cluster to services [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [13:28:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 for upgrade', diff saved to https://phabricator.wikimedia.org/P17363 and previous config saved to /var/cache/conftool/dbconfig/20210930-132831-marostegui.json [13:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:22] !log Upgrade db1111 [13:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:26] (03CR) 10Jgleeson: ssh: Puppetize GatewayPorts config option for sshd_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [13:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17364 and previous config saved to /var/cache/conftool/dbconfig/20210930-133029-root.json [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:23] (03CR) 10MSantos: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724996 (owner: 10Jgiannelos) [13:33:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17365 and previous config saved to /var/cache/conftool/dbconfig/20210930-133311-root.json [13:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:36] (03PS1) 10Giuseppe Lavagetto: Add rsyslog image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/725005 (https://phabricator.wikimedia.org/T288851) [13:34:54] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724996 (owner: 10Jgiannelos) [13:36:56] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] (03CR) 1020after4: [C: 03+1] Phabricator: add override for the browser time zone conflict message [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712) [13:37:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:34] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:38] (03Merged) 10jenkins-bot: tegola-vector-tiles: Increase codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724996 (owner: 10Jgiannelos) [13:40:23] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [13:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:18] (03CR) 10David Caro: ssh: Puppetize GatewayPorts config option for sshd_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [13:43:22] (03PS1) 10MSantos: maps: disable tile generation cron [puppet] - 10https://gerrit.wikimedia.org/r/725008 [13:44:14] (03CR) 10Jgiannelos: [C: 03+1] maps: disable tile generation cron [puppet] - 10https://gerrit.wikimedia.org/r/725008 (owner: 10MSantos) [13:45:33] (03CR) 10Hnowlan: [C: 03+2] maps: disable tile generation cron [puppet] - 10https://gerrit.wikimedia.org/r/725008 (owner: 10MSantos) [13:45:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17366 and previous config saved to /var/cache/conftool/dbconfig/20210930-134533-root.json [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:53] (03PS1) 10Muehlenhoff: Update MAC for testvm2001 [puppet] - 10https://gerrit.wikimedia.org/r/725010 [13:48:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17367 and previous config saved to /var/cache/conftool/dbconfig/20210930-134815-root.json [13:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [13:49:56] (03PS2) 10Muehlenhoff: Update MAC for testvm2001 [puppet] - 10https://gerrit.wikimedia.org/r/725010 [13:58:50] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) Cool thanks for that. Regarding 185.15.56.238, I didn't realise it was already in DNS. That makes it easy, I've gone ahead and added an object for it to Netbox:... [13:59:35] (03CR) 10Muehlenhoff: [C: 03+2] Update MAC for testvm2001 [puppet] - 10https://gerrit.wikimedia.org/r/725010 (owner: 10Muehlenhoff) [14:00:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17368 and previous config saved to /var/cache/conftool/dbconfig/20210930-140037-root.json [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] "This is indeed much better! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [14:03:18] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) After much deliberation, @akosiaris and I decided we'll go the following way: * Install an rsyslogd sidecar that will be used by mediawiki... [14:03:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17369 and previous config saved to /var/cache/conftool/dbconfig/20210930-140318-root.json [14:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:31] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10Gehel) @Joe Thanks! I can confirm that it works. [14:07:23] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: Implement a memory monitoring check for ganeti nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [14:08:20] (03PS1) 10Hashar: gitlab: enable Content-Security-Policy reporting [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) [14:12:43] (03CR) 10Hashar: "For security team:" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [14:12:56] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [14:15:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17370 and previous config saved to /var/cache/conftool/dbconfig/20210930-141540-root.json [14:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:52] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:16:17] (03PS24) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [14:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17372 and previous config saved to /var/cache/conftool/dbconfig/20210930-141822-root.json [14:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:00] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [14:23:23] (03PS3) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [14:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17373 and previous config saved to /var/cache/conftool/dbconfig/20210930-143044-root.json [14:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:10] (03PS5) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [14:32:32] (03PS2) 10Alexandros Kosiaris: Rename main cluster to services [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 [14:33:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17374 and previous config saved to /var/cache/conftool/dbconfig/20210930-143325-root.json [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:33] (03CR) 10Elukey: "We call this cluster "main" in puppet (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/723419/27/hieradata/common/profile/kuberne" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [14:36:38] !log drop /etc/helmfile-defaults/private/backup_old_paths from deploy1002 (old data not needed anymore) [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:56] (03CR) 10Alexandros Kosiaris: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [14:39:01] (03Abandoned) 10MSantos: WIP: collect metrics about OSM DB disk space [puppet] - 10https://gerrit.wikimedia.org/r/586372 (https://phabricator.wikimedia.org/T248858) (owner: 10MSantos) [14:47:32] (03PS25) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [14:47:46] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:48:59] (03PS1) 10Jelto: hiera:kubernetes:deployment_server add deploy users for helm3 [puppet] - 10https://gerrit.wikimedia.org/r/725014 (https://phabricator.wikimedia.org/T251305) [14:55:41] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:56:52] (03PS1) 10Bartosz Dziewoński: Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) [14:57:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:59:12] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [15:01:19] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) The preparation of GitLab puppet code is mostly done. I would like to deploy https://gerrit.wikimedia.org/r/724430 to... [15:06:54] (03CR) 10SBassett: gitlab: enable Content-Security-Policy reporting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [15:07:32] (03CR) 10SBassett: gitlab: enable Content-Security-Policy reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [15:12:01] (03CR) 10Ejegg: [C: 03+1] "OK, now I understand that the flexibility of the initial approach was a negative rather than a positive. This way looks good too!" [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [15:20:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Sorry, this feel through the cracks again. I 've left a last round of comments, but this look pretty close to a merge." [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:21:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/725005 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:22:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:25:12] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725019 (owner: 10Michael Große) [15:34:37] (03PS10) 10Jcrespo: ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 [15:36:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] kubeflow-kfserving-inference: create sa to read s3 secrets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 (owner: 10Elukey) [15:36:30] (03CR) 10Jcrespo: [C: 03+2] ganeti: Implement a memory monitoring check for ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [15:39:49] (03CR) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 (owner: 10Elukey) [15:42:58] I got an error while running puppet on alert1001: "Server Error: Evaluation Error: Error while evaluating a Resource Statement, Monitoring::Openapi_service[check_mobileapps_cluster_eqiad]: parameter 'notifications_enabled' expects a String value, got Undef" [15:44:50] I didn't change anything on that lvs monitoring, so my guess would be a race condition or something related to 706a56c CC jbond [15:46:36] (03PS6) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 [15:48:07] jynus: yes it is looking now [15:48:45] (03PS26) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [15:49:50] (03CR) 10Elukey: kubeflow-kfserving-inference: create sa to read s3 secrets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 (owner: 10Elukey) [15:50:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 (owner: 10Elukey) [15:52:16] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving-inference: create sa to read s3 secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/725001 (owner: 10Elukey) [15:53:53] (03PS1) 10Jbond: monitoring::openapi_service: update notifications_enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/725029 [15:54:52] (03PS2) 10Jbond: monitoring::openapi_service: update notifications_enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/725029 [15:55:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [15:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31381/console" [puppet] - 10https://gerrit.wikimedia.org/r/725029 (owner: 10Jbond) [15:57:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] monitoring::openapi_service: update notifications_enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/725029 (owner: 10Jbond) [15:58:47] (03CR) 10DCausse: Validate deploy-tag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [15:59:50] jynus: fixed now sorrty about that [16:00:02] (03CR) 10Jcrespo: "Checking some change that seemed weird." [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [16:00:05] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:09] ^can you check that too [16:00:31] maybe it is intended, but could be a typo or something [16:00:45] * jbond looking [16:00:53] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] secrets: Clean up restbase stub certificates [labs/private] - 10https://gerrit.wikimedia.org/r/724717 (owner: 10Hnowlan) [16:00:55] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [16:01:39] (03CR) 10Hashar: gitlab: enable Content-Security-Policy reporting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [16:01:55] jynus: i thik that may be an issue introduced by a bad rebase [16:02:07] if i look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/723544/12/hieradata/hosts/db2103.yaml [16:02:34] i see that notifications where disabled but im gussing that they must have been re-enabled between rebases and this got missed [16:03:21] yes was reverted earlier today b8cce697ba0eba0ff801fe0e3d37ebb060a5b6e6. [16:03:25] will add back now [16:03:28] (03PS2) 10Filippo Giunchedi: Validate deploy-tag [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) [16:03:30] (03PS2) 10Filippo Giunchedi: o11y: restore thanos sidecar upload failure [alerts] - 10https://gerrit.wikimedia.org/r/724948 (https://phabricator.wikimedia.org/T289662) [16:03:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [16:03:43] (03PS9) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [16:03:51] (03CR) 10Filippo Giunchedi: Validate deploy-tag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/724944 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [16:04:13] 10SRE, 10Infrastructure-Foundations, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10cmooney) [16:04:27] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) 05Open→03Resolved And we are live :) ` From: RIPE Atlas [mailto:atlas@ripe.net] Sent: Thursday, September 30, 2021, 2:31 PM To: Ca... [16:04:34] jouncebot: nowandnext [16:04:35] For the next 0 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1600) [16:04:35] In 0 hour(s) and 55 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1700) [16:04:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [16:04:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) 05Open→03Resolved complete [16:05:01] no patches there, stealing the window [16:05:15] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725019 (owner: 10Michael Große) [16:05:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [16:05:31] (03CR) 10Filippo Giunchedi: [C: 03+1] rdf-streaming-updater: Deploy only to k8s [alerts] - 10https://gerrit.wikimedia.org/r/724999 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [16:05:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) 05Open→03Resolved complete [16:06:21] (03PS1) 10Jbond: db2103: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/725030 [16:06:24] (03Merged) 10jenkins-bot: Enable dispatching via job to 10 prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725019 (owner: 10Michael Große) [16:06:41] (03CR) 10Jbond: [C: 03+2] db2103: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/725030 (owner: 10Jbond) [16:08:05] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31383/console" [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [16:08:16] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725019|Enable dispatching via job to 10 prod wikis]] (duration: 01m 09s) [16:08:19] (03CR) 10SBassett: gitlab: enable Content-Security-Policy reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [16:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:51] (03PS1) 10Ladsgroup: Disable jQuery migrate in metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725032 (https://phabricator.wikimedia.org/T280944) [16:09:29] (03CR) 10Ladsgroup: [C: 03+2] Disable jQuery migrate in metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725032 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [16:10:07] (03CR) 10Subramanya Sastry: "This needs a TechNews announcement beforehand." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:10:11] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) @Joe this server is out of warranty and it has a main board problem. Do you think we can decom this server or do we have to buy a new main board to keep the server in production? Thanks [16:10:16] (03Merged) 10jenkins-bot: Disable jQuery migrate in metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725032 (https://phabricator.wikimedia.org/T280944) (owner: 10Ladsgroup) [16:10:31] (03CR) 10Arlolra: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:11:57] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725032|Disable jQuery migrate in metawiki (T280944)]] (duration: 01m 09s) [16:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:06] T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944 [16:15:09] (03CR) 10Subramanya Sastry: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:16:22] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [16:17:36] (03CR) 10Jcrespo: "The change has been deployed, it says, for example now:" [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [16:18:07] (03CR) 10Arlolra: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:19:35] (03CR) 10Subramanya Sastry: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:20:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:58] (03CR) 10Subramanya Sastry: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:23:56] (03CR) 10Subramanya Sastry: Disable legacy media dom on a few more wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:24:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:21] (03CR) 10Subramanya Sastry: [C: 03+1] Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [16:30:32] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% [16:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% (duration: 00m 16s) [16:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:07] !log Ran `GRANT pg_monitor TO prometheus` for maps in eqiad and codfw to fix empty prometheus connection metrics [16:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:34] (03PS1) 10Ssingh: wikidough: switch to LE's alternative chain [puppet] - 10https://gerrit.wikimedia.org/r/725036 [16:33:02] (03CR) 10Hnowlan: [V: 03+1] "> Patch Set 9: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [16:33:15] (03PS2) 10Ssingh: wikidough: switch to LE's alternative chain [puppet] - 10https://gerrit.wikimedia.org/r/725036 (https://phabricator.wikimedia.org/T252132) [16:33:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:08] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31386/console" [puppet] - 10https://gerrit.wikimedia.org/r/725036 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:37:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:43] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% [16:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:23] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% (duration: 00m 40s) [16:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:31] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@67a4d22]: Increase mirrored traffic to 10% [16:40:34] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: switch to LE's alternative chain [puppet] - 10https://gerrit.wikimedia.org/r/725036 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:40] (03CR) 10BBlack: [C: 03+1] wikidough: switch to LE's alternative chain [puppet] - 10https://gerrit.wikimedia.org/r/725036 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:43:04] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@67a4d22]: Increase mirrored traffic to 10% (duration: 02m 33s) [16:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:02] !log restart dnsdist.service on doh[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Absent testwikidata dispatching and clean up systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/724990 (https://phabricator.wikimedia.org/T291610) (owner: 10Ladsgroup) [16:53:11] (03PS1) 10Elukey: kubeflow-kfserving-inference: improve predictor_config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/725038 [16:54:23] (03PS1) 10Majavah: aprepo: Import kubeadm to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/725041 (https://phabricator.wikimedia.org/T292131) [16:58:47] (03PS1) 10Inductiveload: Add wikisource-bot.toolforge.org to Commons/Wikisource copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) [16:59:10] (03PS2) 10Inductiveload: Add wikisource-bot.toolforge.org to Commons/Wikisource copy upload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) [16:59:12] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving-inference: improve predictor_config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/725038 (owner: 10Elukey) [17:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1700). [17:00:37] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad [17:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:47] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 11s) [17:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:59] (03CR) 10Bstorm: [C: 03+1] "Looks good, but I need to go review how to deploy before I merge it 😆" [puppet] - 10https://gerrit.wikimedia.org/r/725041 (https://phabricator.wikimedia.org/T292131) (owner: 10Majavah) [17:01:06] (03CR) 10Inductiveload: "Ping since we talked about this this today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725042 (https://phabricator.wikimedia.org/T292213) (owner: 10Inductiveload) [17:02:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [17:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:56] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad [17:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:04] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 08s) [17:03:07] 10SRE-Access-Requests: Add Majavah to #mediawiki_security - https://phabricator.wikimedia.org/T292214 (10Majavah) [17:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:43] (03PS5) 10Inductiveload: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [17:04:01] (03PS6) 10Inductiveload: Add IA-Upload tool domains to Commons/Wikisource wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [17:08:34] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad [17:08:36] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:29] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 55s) [17:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:56] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The following units failed: session-204181.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:21] .win 7 [17:17:26] nope :) [17:19:51] (03PS1) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:20:40] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:23:04] (03PS1) 10Elukey: kubeflow-kfserving: update storage initializer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/725047 [17:28:00] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: update storage initializer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/725047 (owner: 10Elukey) [17:28:12] (03PS1) 10Bstorm: toolforge harbor: add external postgres db [puppet] - 10https://gerrit.wikimedia.org/r/725048 (https://phabricator.wikimedia.org/T267616) [17:32:38] (03CR) 10Bstorm: "This is mostly based on the puppet we are using for puppetdb. The stuff for OSMDB we have now isn't very flexible for failover, which is m" [puppet] - 10https://gerrit.wikimedia.org/r/725048 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [17:34:32] (03CR) 10Jcrespo: "ganeti1009 gave a warning (93%). I now know how to fix it-thanks to the documentation-, but because I have never done a rebalancing, I wil" [puppet] - 10https://gerrit.wikimedia.org/r/724950 (owner: 10Jcrespo) [17:34:54] (03CR) 10Bstorm: "An external postresql is also required if you ever end up with cinder vols in k8s (which will allow using helm to deploy harbor), so this " [puppet] - 10https://gerrit.wikimedia.org/r/725048 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [17:35:44] (03CR) 10Bstorm: "For testing, I used the local container version of postgresql, btw. This is about making it a more resilient service before deploying to t" [puppet] - 10https://gerrit.wikimedia.org/r/725048 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [17:36:07] (03CR) 10Bstorm: [C: 03+2] aprepo: Import kubeadm to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/725041 (https://phabricator.wikimedia.org/T292131) (owner: 10Majavah) [17:36:49] (03PS1) 10Ebernhardson: Move query_service secrets to profile-specific file [labs/private] - 10https://gerrit.wikimedia.org/r/725049 (https://phabricator.wikimedia.org/T280006) [17:37:10] (03CR) 10MSantos: [eswiki] Disable static mapframes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [17:37:31] (03PS2) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:38:05] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:41:11] (03CR) 10MSantos: [C: 03+1] [eswiki] Disable static mapframes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [17:41:53] (03PS3) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:42:19] !log updating packages for thirdparty/kubeadm-k8s-1-20 and thirdparty/kubeadm-k8s-1-19 in stretch-wikimedia on apt1001 T292131 [17:42:22] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:25] T292131: Something is up with the kubeadm component on stretch VMs - https://phabricator.wikimedia.org/T292131 [17:43:40] (03CR) 10MarcoAurelio: [eswiki] Disable static mapframes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [17:44:28] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:27] (03PS4) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:46:11] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:47:39] (03CR) 10Ottomata: [C: 03+2] Set thorium to role spare::system and remove references to thorium [puppet] - 10https://gerrit.wikimedia.org/r/724756 (https://phabricator.wikimedia.org/T292075) (owner: 10Ottomata) [17:47:58] (03PS5) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:48:32] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:49:24] !log otto@cumin1001 START - Cookbook sre.hosts.decommission for hosts thorium.eqiad.wmnet [17:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:49:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:22] (03PS6) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:52:04] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:52:46] (03CR) 10MSantos: [C: 03+1] [eswiki] Disable static mapframes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [17:53:00] (03PS1) 10Ottomata: thorium decom - Remove absented rsync module [puppet] - 10https://gerrit.wikimedia.org/r/725059 (https://phabricator.wikimedia.org/T292075) [17:53:14] (03PS2) 10Ottomata: thorium decom - Remove absented rsync module [puppet] - 10https://gerrit.wikimedia.org/r/725059 (https://phabricator.wikimedia.org/T292075) [17:54:28] (03PS1) 10BryanDavis: toolhub: Do not force cronjob envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/725060 (https://phabricator.wikimedia.org/T291447) [17:54:30] (03PS7) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:55:02] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [17:58:44] (03PS8) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [17:59:19] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:00:04] RoanKattouw, Niharika, Urbanecm, and thcipriani: I, the Bot under the Fountain, call upon thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1800). [18:00:04] ottomata and arlolra: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] o/ [18:00:18] o/ [18:00:21] (03CR) 10BryanDavis: [C: 03+2] toolhub: Do not force cronjob envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/725060 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [18:00:33] give me one sec to wrap up my meeting and I can backport [18:00:48] (if no one beats me to it) [18:00:55] \o [18:01:06] hi arlolra [18:01:40] hello [18:01:49] Not sure where my patches went.. [18:02:31] They were here: https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=1927398 [18:03:39] looks like they got removed in https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=cur&oldid=1927428 by MarcoAurelio so I'll re-insert them [18:04:04] thanks [18:04:54] (03Merged) 10jenkins-bot: toolhub: Do not force cronjob envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/725060 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [18:05:21] (03PS1) 10Jdlrobson: Fix search within pages alignment [extensions/MobileFrontend] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724979 (https://phabricator.wikimedia.org/T292107) [18:05:50] alright...let's +2 some backports [18:05:57] (03CR) 10Thcipriani: [C: 03+2] Restore original more menu padding in legacy Vector [skins/Vector] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724798 (https://phabricator.wikimedia.org/T289163) (owner: 10Jdlrobson) [18:07:16] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [18:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:02] Jdlrobson: is this the right link for your 2nd backport? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/724979 [18:08:24] (03CR) 10Thcipriani: [C: 03+2] "Backport" [extensions/EventBus] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724480 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:08:32] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [extensions/EventBus] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724481 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:09:18] thcipriani: yep [18:09:39] it's MobileFrontend though so might want to wait on Jenkins [18:09:40] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [extensions/MobileFrontend] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724979 (https://phabricator.wikimedia.org/T292107) (owner: 10Jdlrobson) [18:09:58] heh, too late :) [18:10:17] thcipriani: it should be fine [18:10:20] i just tried it locally [18:10:41] arlolra: let's get your config patch done first so you don't have to wait on jenkins [18:10:42] https://gerrit.wikimedia.org/r/c/724514/ is beta cluster only [18:10:53] ok [18:11:12] (in theory :)) [18:11:36] I'm out of date: mwdebug1002's hostkey changed? [18:12:10] matches wikitech, continuing [18:12:14] it was reimaged during the switchover to buster [18:13:00] (03PS2) 10Thcipriani: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:13:16] (03CR) 10Thcipriani: [C: 03+2] "CONFIG DEPLOY" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:13:30] majavah: thanks for confirming [18:16:50] (03Merged) 10jenkins-bot: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724861 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:17:59] arlolra: ^ is live on mwdebug1002, check please (if there's anything to check) [18:18:57] it's fine to continue [18:19:04] thanks, going live [18:20:53] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724861|Disable legacy media dom on a few more wikis (T51097)]] (duration: 01m 08s) [18:20:57] ^ arlolra live now [18:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:00] T51097: Use figure and figcaption HTML5 elements when possible - https://phabricator.wikimedia.org/T51097 [18:21:21] (03PS2) 10Thcipriani: Enable sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724514 (https://phabricator.wikimedia.org/T289721) (owner: 10Jdlrobson) [18:21:43] thcipriani: indeed it is, thanks [18:21:49] thank you! [18:21:57] (for checking :)) [18:22:05] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724514 (https://phabricator.wikimedia.org/T289721) (owner: 10Jdlrobson) [18:22:23] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [18:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:05] busy time for ci [18:23:42] (03PS9) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [18:25:29] (03Merged) 10jenkins-bot: Enable sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724514 (https://phabricator.wikimedia.org/T289721) (owner: 10Jdlrobson) [18:25:35] phew [18:25:50] (03PS10) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [18:26:04] Jdlrobson: anything to check for 724514 [18:26:27] (if so it's on mwdebug1002) [18:26:32] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:27:07] 10SRE, 10DBA, 10Traffic, 10Patch-For-Review, and 2 others: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Reedy) [18:27:13] (03Merged) 10jenkins-bot: Restore original more menu padding in legacy Vector [skins/Vector] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724798 (https://phabricator.wikimedia.org/T289163) (owner: 10Jdlrobson) [18:27:14] !log otto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thorium.eqiad.wmnet [18:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:49] judging from the title and your comment, no, but just want to confirm you're around if the world falls down around me :D [18:28:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:14] thcipriani: can check just to be sure [18:29:22] <3 [18:29:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:46] thcipriani: yeh nothing has changed on production so this looks like it's behaving like it should [18:29:58] (03PS13) 10Jdlrobson: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [18:29:59] perfect, going live [18:30:03] (03PS4) 10Muehlenhoff: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) [18:30:13] (03PS13) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [18:30:29] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [18:31:02] !log thcipriani@deploy1002 Synchronized wmf-config: Config: [[gerrit:724514|Enable sticky header on beta cluster (T289721)]] (duration: 01m 08s) [18:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:08] T289721: [Goal] Sticky header is enabled for logged in users on the beta cluster - https://phabricator.wikimedia.org/T289721 [18:31:43] ^ deployed (noop here), looks like zuul says 5 min until it's in beta [18:31:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:04] (03Merged) 10jenkins-bot: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [18:32:19] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Move query_service secrets to profile-specific file [labs/private] - 10https://gerrit.wikimedia.org/r/725049 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [18:32:39] (03CR) 10Muehlenhoff: [C: 03+1] "Jack, Elliott: I'll merge this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [18:32:49] Jdlrobson: 704167 live on mwdebug1002, check please [18:32:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:58] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [18:33:16] thcipriani: looking [18:33:30] (03PS14) 10Thcipriani: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [18:33:37] (03CR) 10Thcipriani: [C: 03+2] "BACKPORT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [18:33:44] thcipriani: looks great [18:33:47] sync away [18:33:51] * thcipriani does [18:34:14] https://usercontent.irccloud-cdn.com/file/WQH1uLaI/Screen%20Shot%202021-09-30%20at%2011.34.09%20AM.png [18:34:33] (03Merged) 10jenkins-bot: Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724480 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:34:35] (03Merged) 10jenkins-bot: Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724481 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:34:37] (03Merged) 10jenkins-bot: Fix search within pages alignment [extensions/MobileFrontend] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724979 (https://phabricator.wikimedia.org/T292107) (owner: 10Jdlrobson) [18:35:06] (03Merged) 10jenkins-bot: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [18:35:07] yay jenkins [18:35:58] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/wikimania.svg: Config: [[gerrit:704167|Use Wikimania's logo in a new vector (T286405)]] part I (duration: 01m 07s) [18:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:05] T286405: Wikimania logo is missing in Vector 2021 in Wikimania website - https://phabricator.wikimedia.org/T286405 [18:37:30] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/wikimania-wordmark.svg: Config: [[gerrit:704167|Use Wikimania's logo in a new vector (T286405)]] Part II (duration: 01m 07s) [18:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:45] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:38:52] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704167|Use Wikimania's logo in a new vector (T286405)]] Part III (duration: 01m 07s) [18:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:59] ^ Jdlrobson should be live [18:39:08] thcipriani: great! [18:39:19] let's finish config changes and then move on to backports [18:39:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:40:09] Jdlrobson: unset logo config on mwdebug1002, check please [18:40:15] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:40:27] (03CR) 10Jgleeson: ssh: Puppetize GatewayPorts config option for sshd_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724816 (https://phabricator.wikimedia.org/T290098) (owner: 10Jgleeson) [18:41:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:11] (03PS11) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [18:41:25] thcipriani: looking [18:41:50] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:41:55] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:42:38] thcipriani: looks ready to sync to me [18:42:46] !log imported gitlab 14.2.5 to thirdparty/gitlab T292219 [18:42:51] cool, going live [18:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:33] !log thcipriani@deploy1002 Scap failed!: Call to mwscript eval.php stderr: not empty [18:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:53] well that's interesting [18:43:58] What happened? [18:44:07] > Notice: Undefined variable: wmgSiteLogoVariants in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 1013 [18:44:15] wah [18:44:34] scap call mwscript eval as a check, if it's not empty it explodes [18:44:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:44] -wmgSiteLogoVariants is perhaps the problem here? [18:45:18] does the minus unset it? [18:46:26] good question, seems like a reasonable assumption, I don't see where the "-" is looking at the patch [18:46:35] $wmgSiteLogoVariants ?: null, [18:46:48] But shouldn't that work with an undefined variable? [18:47:01] just not sure why its undefined [18:47:03] (03PS1) 10Ryan Kemper: query_service: keep oauth secret in both paths [labs/private] - 10https://gerrit.wikimedia.org/r/725084 (https://phabricator.wikimedia.org/T280006) [18:47:33] evidently becomes a notice somewhere along the line. Anyway, I'm going to revert for now. Needs more investigation. [18:47:49] ergg tbh if we revert I'm going to give up on these changes [18:47:57] I'm out of my depth with how the logos are setup. [18:48:10] I can't replicate these issues locally [18:48:15] :( [18:48:21] scap won't let me deploy this change [18:48:56] (03PS2) 10Ryan Kemper: query_service: keep oauth secret in both paths [labs/private] - 10https://gerrit.wikimedia.org/r/725084 (https://phabricator.wikimedia.org/T280006) [18:49:23] (03PS12) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [18:49:24] yeh reverting makes sense, I'm just throwing in the towel with this particular problem. I have larger issues to worry about. [18:50:32] (03PS1) 10Thcipriani: Revert "Unset logo config rather than set to false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725088 [18:50:45] if you have an undefined variable, you need to use ??, not ?: [18:50:47] (03CR) 10Thcipriani: [C: 03+2] Revert "Unset logo config rather than set to false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725088 (owner: 10Thcipriani) [18:50:55] (03CR) 10Ebernhardson: [C: 03+1] query_service: keep oauth secret in both paths [labs/private] - 10https://gerrit.wikimedia.org/r/725084 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [18:51:08] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:51:15] Jdlrobson: still proceeding with other backports, yeah? [18:51:16] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: keep oauth secret in both paths [labs/private] - 10https://gerrit.wikimedia.org/r/725084 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [18:51:49] see https://3v4l.org/5sKWg [18:51:51] @thcipriani So. I think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/724979 is all that's remaining. [18:52:06] (03PS13) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [18:52:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [18:52:30] Jdlrobson: and this one? https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/724798 [18:52:48] thcipriani: yep sorry for some reason i thought we'd synced that one :) [18:52:55] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:52:58] cool, no worries :) [18:53:11] (03Merged) 10jenkins-bot: Revert "Unset logo config rather than set to false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725088 (owner: 10Thcipriani) [18:53:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31405/console" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [18:53:37] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:54:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:32] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox info missing on some WMCS elements - https://phabricator.wikimedia.org/T292097 (10cmooney) Just an update, I removed the DNS entry / IP object for the .238 address just to be safe. When the sre.dns.netbox cookbook ran the change in Netbox made it want to... [18:54:37] Thanks legoktm but my understanding is $wmgSiteLogoVariants ?: null should work when $wmgSiteLogoVariants is false? This should never be set null which is why I'm confused. It should either be false or an array from my read of the code. I suspect there's some magic somewhere doing that. [18:54:42] or maybw a typo [18:55:10] yes, that works if its false, but if its undefined then it's a warning [18:55:34] Jdlrobson: vector change for wmf.2 is on mwdebug1002, check please [18:55:42] checking [18:55:43] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:55:48] if you need to handle undefined and false, then you probably need two conditions in an if() statement rather than a ternary [18:56:34] thcipriani: Vector change looks good [18:56:40] thanks, going live [18:57:32] ottomata: thanks for hanging on, I haven't forgotten you! [18:58:02] !log thcipriani@deploy1002 Synchronized php-1.38.0-wmf.2/skins/Vector/resources/skins.vector.styles.legacy/components/MenuDropdown.less: Backport: [[gerrit:724798|Restore original more menu padding in legacy Vector (T289163)]] (duration: 01m 08s) [18:58:04] i'm here! about to go afk for a bit tho...but that change should be safe to go, its just a guard, and it isn't even active without a configi change [18:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:08] T289163: JavaScript addPortletLink method differs from PHP equivalent leading to gadget inconsistencies in modern Vector - https://phabricator.wikimedia.org/T289163 [18:58:33] Okay I think I see the issue with that last one. 'wmgSiteLogoVariants' doesn't have a default value [18:58:34] it should. [18:58:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:21] (03CR) 10Jdlrobson: "This patch failed scap. Presumably because wmgSiteLogoVariants does not appear to have a default value when it should." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [18:59:57] thcipriani: ottomata sorry these patches took so long. [19:00:04] jeena and dduvall: #bothumor I � Unicode. All rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T1900). [19:00:05] ottomata: if there's anything to check, it's live on mwdebug1002 [19:00:27] Jdlrobson: it happens ¯\_(ツ)_/¯ I blame jenkins [19:00:47] jeena: just one sec, backport window running a little bit long, I'll ping you when I'm clear [19:00:54] okay [19:00:57] <3 [19:01:03] thanks :) [19:01:32] thcipriani: should be good i gthink [19:01:43] ottomata: cool, syncing both [19:02:13] Jdlrobson: mobilefrontend change live on mwdebug1002, check please [19:02:22] (03PS14) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:02:38] thcipriani: checking [19:02:59] thcipriani: looking great [19:03:00] please sync [19:03:07] alright, will do [19:03:21] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:03:39] and thanks for all the help with these patches. It's a weight off my back. [19:04:00] !log thcipriani@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/EventBus/includes/EventBus.php: Backport: [[gerrit:724480|Guard against undefined index notice when setting x-client-ip (T288853)]] (duration: 01m 09s) [19:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:06] T288853: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 [19:04:41] ottomata: ^ there's one, other syncing now [19:04:45] Jdlrobson: happy to help! [19:04:51] (03PS15) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:05:17] * thcipriani finds perfect moment to hype: https://wikitech.wikimedia.org/wiki/Deployments/Training [19:05:30] never wait on a deploy again! [19:05:40] !log thcipriani@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/EventBus/includes/EventBus.php: Backport: [[gerrit:724481|Guard against undefined index notice when setting x-client-ip (T288853)]] (duration: 01m 09s) [19:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:52] ^ ottomata and that's wmf.1, all live [19:06:30] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:06:49] mobilefrontend going now [19:07:00] thank you [19:07:50] you're welcome :) [19:07:51] !log thcipriani@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/MobileFrontend: Backport: [[gerrit:724979|Fix search within pages alignment (T292107)]] (duration: 01m 09s) [19:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:57] T292107: Regression: Search within pages flushed left - https://phabricator.wikimedia.org/T292107 [19:08:06] ^ Jdlrobson and that's mobilefrontend [19:08:15] (03PS16) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:08:29] jeena: all yours! [19:08:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] alrighty [19:08:43] sorry for the delay :( [19:08:50] no problem [19:08:59] (03PS6) 10Ryan Kemper: query_service: Split oauth secret from settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:09:42] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:10:08] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31411/console" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:10:31] (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725093 [19:10:33] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725093 (owner: 10Jeena Huneidi) [19:11:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:29] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725093 (owner: 10Jeena Huneidi) [19:13:58] (03Abandoned) 10Ryan Kemper: query_service: Remove non-secret values from secrets repo [labs/private] - 10https://gerrit.wikimedia.org/r/724832 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:14:06] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.2 refs T281166 [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:12] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Split oauth secret from settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:14:13] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [19:14:14] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ksiebert - https://phabricator.wikimedia.org/T292053 (10Dzahn) >>! In T292053#7390124, @Joe wrote: > I actually see all the data for shell access in @KSiebert's ldap account, I think the confusion is between production shell access and Cloud shell access.... [19:17:22] (03PS17) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:18:33] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:20:29] (03CR) 10Dzahn: [C: 03+1] profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [19:20:53] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:21:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:10] (03PS18) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:22:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:22:52] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:23:01] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:24:48] (03PS19) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:25:35] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:26:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:29] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Backport window didn't go yesterday, so I got the bugfixes out today. I'd rather wait until Monda... [19:29:19] (03PS20) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:29:53] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:30:26] (03PS1) 10Legoktm: Throw more resources at shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/725097 (https://phabricator.wikimedia.org/T289226) [19:30:50] (03CR) 10Legoktm: [C: 03+2] Throw more resources at shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/725097 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [19:31:13] (03PS2) 10Legoktm: Throw more resources at shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/725097 (https://phabricator.wikimedia.org/T289226) [19:31:15] (03CR) 10Legoktm: [C: 03+2] Throw more resources at shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/725097 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [19:31:57] legoktm: is ^ related to the couple of new shellbox/timeline errors in logstash? [19:32:10] uh, which errors? [19:32:38] sorry, there are two 503 errors that just popped up post group2 deploy [19:32:51] yeah just a couple [19:33:05] found it [19:33:10] (03PS21) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:33:11] legoktm: RX0mOHwBfkHq3kAM9raQ e.g. [19:33:13] yeah, this should take care of that [19:33:46] * dduvall nods [19:34:05] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:36:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:38] (03Merged) 10jenkins-bot: Throw more resources at shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/725097 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [19:37:49] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [19:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:57] (03PS22) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:38:39] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:38:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:38:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:22] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [19:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] (03PS5) 10Ebernhardson: query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) [19:42:21] (03PS23) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:42:32] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:43:12] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:43:15] (03PS1) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [19:43:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31418/console" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:43:46] (03CR) 10AOkoth: [C: 03+1] Revert "gitlab: test edit" [puppet] - 10https://gerrit.wikimedia.org/r/724111 (owner: 10Dzahn) [19:43:58] (03CR) 10Dzahn: [C: 03+1] Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [19:44:15] (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: test edit" [puppet] - 10https://gerrit.wikimedia.org/r/724111 (owner: 10Dzahn) [19:44:53] (03PS24) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:45:55] (03PS6) 10Ryan Kemper: query_service: Parameterize oauth redirect url [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:46:04] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:46:25] (03CR) 10Dzahn: [C: 03+1] "It will need one more rebase due to my merge, but +1" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [19:47:39] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:48:19] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31420/console" [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:49:09] (03Abandoned) 10Dzahn: puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [19:51:12] (03CR) 10Dzahn: [C: 03+1] "Or would you like me to keep this a clean revert and add the typo fix in a second patch?" [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [19:51:23] (03PS4) 10Legoktm: mediawiki: Remove lilypond [puppet] - 10https://gerrit.wikimedia.org/r/721618 [19:51:25] (03PS1) 10Legoktm: mediawiki: Remove ploticus [puppet] - 10https://gerrit.wikimedia.org/r/725099 [19:51:45] !log Updating routinator on rpki1001 T291543 [19:51:48] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Parameterize oauth redirect url [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:27] (03PS25) 10Jbond: (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [19:52:57] (03CR) 10Krinkle: [C: 03+1] Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [19:52:59] (03CR) 10Legoktm: "I'm going to wait until Monday when we're confident 1.38.0-wmf.2 won't get rolled back and then deploy this and the ploticus removal at th" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [19:53:42] (03CR) 10Dzahn: [C: 03+1] "https://debmonitor.wikimedia.org/packages/lilypond" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [19:54:25] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10brennen) [19:54:33] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10brennen) [19:55:29] (03CR) 10Muehlenhoff: [C: 04-1] "This should use one in the 9xx range (which are reserved for this purpose), < 500 are for system users created by Debian packages (as in D" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [19:55:55] (03CR) 10jerkins-bot: [V: 04-1] (WIP) monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [19:56:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:57:18] (03PS2) 10Ottomata: Standardize the stats system user uid [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) [19:57:21] (03CR) 10Ottomata: "Thanks, was just trying to use something that was already assigned, but that ran into issues because another user (thumbor) already had 49" [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [19:58:53] (03PS1) 10Legoktm: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) [19:59:41] (03CR) 10Legoktm: "This should wait until there's no chance of a 1.38.0-wmf.2 rollback" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) (owner: 10Legoktm) [20:01:01] PROBLEM - Routinator process on rpki1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [20:01:19] ^^^ this is due to my "upgrade" [20:01:51] rpki2001 still working fine so no impact. [20:01:51] (03CR) 10Muehlenhoff: Standardize the stats system user uid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725098 (https://phabricator.wikimedia.org/T291384) (owner: 10Ottomata) [20:02:04] thanks for letting us know [20:02:09] PROBLEM - RPKI Validator RTR port on rpki1001 is CRITICAL: connect to address 10.64.32.19 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [20:03:17] (03PS1) 10Ryan Kemper: query_service: default oauth_settings to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) [20:04:05] (03PS2) 10Ryan Kemper: query_service: default oauth_settings to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) [20:04:13] ACKNOWLEDGEMENT - RPKI Validator RTR port on rpki1001 is CRITICAL: connect to address 10.64.32.19 and port 3323: Connection refused Cathal Mooney Result of my less than perfect upgrade to version 10. - The acknowledgement expires at: 2021-10-01 11:00:00. https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [20:04:39] ACKNOWLEDGEMENT - Routinator process on rpki1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator Cathal Mooney Result of my less than perfect upgrade to version 0.10 - The acknowledgement expires at: 2021-10-01 11:00:00. https://wikitech.wikimedia.org/wiki/RPKI%23Process [20:05:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:06:10] (03PS9) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [20:06:31] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:06:39] 10SRE, 10serviceops: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Legoktm) 05Stalled→03Open This is unblocked now that Special:VipsTest has been disabled. [20:06:50] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:07:25] (03CR) 10Ebernhardson: query_service: default oauth_settings to {} (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:08:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:08:18] (03PS1) 10Cwhite: logstash: clean up unused cache clear script [puppet] - 10https://gerrit.wikimedia.org/r/725105 (https://phabricator.wikimedia.org/T144396) [20:09:21] (03PS3) 10Ryan Kemper: query_service: default oauth_settings to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) [20:10:53] (03PS4) 10Ryan Kemper: query_service: default oauth_settings to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) [20:11:20] (03CR) 10Ryan Kemper: query_service: default oauth_settings to {} (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:11:43] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:36] (03CR) 10Ebernhardson: [C: 03+1] query_service: default oauth_settings to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:13:39] RECOVERY - Routinator process on rpki1001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [20:14:45] RECOVERY - RPKI Validator RTR port on rpki1001 is OK: TCP OK - 0.000 second response time on 10.64.32.19 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [20:15:12] (03PS8) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [20:15:34] (03PS5) 10Ryan Kemper: query_service: default oauth_settings in gui to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) [20:16:05] PROBLEM - Disk space on rpki1001 is CRITICAL: DISK CRITICAL - free space: / 1421 MB (16% inode=1%): /tmp 1421 MB (16% inode=1%): /var/tmp 1421 MB (16% inode=1%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rpki1001&var-datasource=eqiad+prometheus/ops [20:17:25] (03PS10) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [20:17:49] ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled Ryan Kemper looking into wcqs1003 https://wikitech.wikimedia.org/wiki/PyBal [20:17:49] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled Ryan Kemper looking into wcqs1003 https://wikitech.wikimedia.org/wiki/PyBal [20:18:32] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:18:44] !log [WCQS] `ryankemper@wcqs1003:~$ sudo depool` (not sure why pybal can't depool it, the other 2 servers are pooled) [20:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:51] (03CR) 10Dzahn: "This also includes https://gerrit.wikimedia.org/r/c/operations/puppet/+/721595 for just the new files now" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:19:57] PROBLEM - Routinator process on rpki1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [20:20:47] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31422/console" [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:20:51] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:05] PROBLEM - RPKI Validator RTR port on rpki1001 is CRITICAL: connect to address 10.64.32.19 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [20:22:03] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01026 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:22:05] RECOVERY - Routinator process on rpki1001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [20:23:13] RECOVERY - RPKI Validator RTR port on rpki1001 is OK: TCP OK - 0.000 second response time on 10.64.32.19 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [20:24:05] PROBLEM - LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4 on wcqs.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.67 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:24:15] (03CR) 10Ryan Kemper: [V: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:24:35] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:25:37] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:27:43] (03PS11) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [20:28:39] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:30:41] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) >>! In T240685#7392652, @gerritbot wrote: > Change 721626 **merged** by jenkins-bot: > %%%[mediawiki/core@master] Metrics: Implement statsd-exporter... [20:32:02] !log gitlab2001, gitlab1001: downtime for upgrades to 14.2.5 [20:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:28] (03PS12) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [20:36:17] (03PS1) 10Ryan Kemper: Revert "query_service: Parameterize oauth redirect url" [puppet] - 10https://gerrit.wikimedia.org/r/725126 [20:37:13] RECOVERY - Disk space on rpki1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rpki1001&var-datasource=eqiad+prometheus/ops [20:37:51] Looks like rpki1001 is doing what it should after restarting it with flag to force-delete and refresh all ROAs. [20:37:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "query_service: Parameterize oauth redirect url" [puppet] - 10https://gerrit.wikimedia.org/r/725126 (owner: 10Ryan Kemper) [20:38:01] Will keep working on it but seems ok. [20:39:31] (03PS2) 10Ryan Kemper: Revert "query_service: Parameterize oauth redirect url" [puppet] - 10https://gerrit.wikimedia.org/r/725126 [20:41:04] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: default oauth_settings in gui to {} [puppet] - 10https://gerrit.wikimedia.org/r/725104 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:46:02] (03PS1) 10Ryan Kemper: wcqs: disable oauth while fixing readiness probe [puppet] - 10https://gerrit.wikimedia.org/r/725110 (https://phabricator.wikimedia.org/T280006) [20:47:20] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31425/console" [puppet] - 10https://gerrit.wikimedia.org/r/725110 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:47:44] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31424/puppetmaster1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:48:21] 10SRE, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q2): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [20:49:08] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wcqs: disable oauth while fixing readiness probe [puppet] - 10https://gerrit.wikimedia.org/r/725110 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:49:14] (03Abandoned) 10Ryan Kemper: Revert "query_service: Parameterize oauth redirect url" [puppet] - 10https://gerrit.wikimedia.org/r/725126 (owner: 10Ryan Kemper) [20:49:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:52] !log gitlab1001: upgrade to 14.2.5 complete [20:49:55] (03CR) 10Dzahn: [V: 03+1] "found a way that does NOT involve touching all puppetmasters including cloud" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:55] (03CR) 10Juan90264: [C: 03+1] "I reviewed the Wordmark and its size, and I wait for the Code-Review to merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [20:54:41] !log Routinator on rpki1001 upgraded to 0.10.0 and working again after force refresh. [20:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:57] !log [WCQS] `ryankemper@wcqs1003:~$ sudo pool` (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/725110 to unbreak readiness probe) [20:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:07] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:55:11] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:55:43] RECOVERY - LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4 on wcqs.svc.eqiad.wmnet is OK: OK - Certificate wcqs.discovery.wmnet will expire on Sat 29 Aug 2026 05:26:12 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:55:49] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004558 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:55:51] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:55:53] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:55:54] (03PS9) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [20:57:48] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [20:59:28] (03CR) 10Dzahn: [V: 03+1] "no change on a cloud puppetmaster: https://puppet-compiler.wmflabs.org/compiler1003/31426/" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:02:13] (03PS13) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [21:09:29] (03CR) 10Dzahn: "if you could confirm this just cretes a new directory File[/var/lib/puppet/volatile/GeoIPInfo which is NOT the existing ./volatile/GeoIP w" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:09:51] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31424/puppetmaster2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:15:44] (03PS15) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [21:15:57] (03PS16) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [21:17:29] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [21:21:51] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:29:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rui Huang - https://phabricator.wikimedia.org/T292258 (10rhuang) [21:50:39] (03CR) 10Tacsipacsi: Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [21:53:27] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [21:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:53] (03PS1) 10Ebernhardson: query_service: Exempt health check url from oauth [puppet] - 10https://gerrit.wikimedia.org/r/725120 [21:54:28] (03PS1) 10Legoktm: Scale up shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/725121 (https://phabricator.wikimedia.org/T289228) [21:55:34] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724163 (owner: 10PipelineBot) [21:55:36] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724162 (owner: 10PipelineBot) [21:55:40] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724165 (owner: 10PipelineBot) [21:55:47] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723654 (owner: 10PipelineBot) [21:55:49] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723653 (owner: 10PipelineBot) [22:00:33] (03CR) 10Legoktm: [C: 03+2] Scale up shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/725121 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:05:22] (03Merged) 10jenkins-bot: Scale up shellbox-media [deployment-charts] - 10https://gerrit.wikimedia.org/r/725121 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:06:29] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [22:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:54] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [22:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:01] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [22:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:55] (03PS2) 10Ryan Kemper: query_service: Exempt health check url from oauth [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:13:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:17:25] (03CR) 10Bartosz Dziewoński: Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [22:19:41] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31429/console" [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:22:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) [22:22:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) @cmooney These host have come in and racked unless something has changed and these racks are correct please assign to... [22:24:11] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [22:25:35] (03PS3) 10Ryan Kemper: query_service: Exempt health check url from oauth [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:25:50] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:36:13] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [22:37:09] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31430/console" [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:37:40] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: Exempt health check url from oauth [puppet] - 10https://gerrit.wikimedia.org/r/725120 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:51:27] (03CR) 10Tacsipacsi: Change wgExtraSignatureNamespaces to not include NS_MAIN on most wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725015 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [23:00:05] brennen: That opportune time is upon us again. Time for a US Backport and Config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T2300). [23:00:32] o/ [23:05:32] o/ [23:08:49] [23:22:10] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [23:35:22] (03PS15) 10Dave Pifke: webperf: connect to Kafka using TLS [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [23:39:13] !log dpifke@deploy1002 Started deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) T290131 [23:39:18] !log dpifke@deploy1002 Finished deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) T290131 (duration: 00m 05s) [23:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:19] T290131: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 [23:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:22] !log dpifke@deploy1002 Started deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) T290131 [23:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:29] !log dpifke@deploy1002 Finished deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) T290131 (duration: 01m 07s) [23:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:05] (03PS1) 10Reedy: Put a https protocol into $wgRightsUrl values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725157 [23:47:14] jouncebot: now [23:47:14] For the next 0 hour(s) and 12 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210930T2300) [23:47:17] jouncebot: next [23:47:17] In 7 hour(s) and 12 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211001T0700) [23:48:01] !log dpifke@deploy1002 Started deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) T290131 [23:48:06] !log dpifke@deploy1002 Finished deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) T290131 (duration: 00m 05s) [23:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:08] T290131: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 [23:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:21] (03CR) 10Reedy: [C: 03+2] Put a https protocol into $wgRightsUrl values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725157 (owner: 10Reedy) [23:49:05] (03Merged) 10jenkins-bot: Put a https protocol into $wgRightsUrl values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725157 (owner: 10Reedy) [23:50:37] (03PS16) 10Dave Pifke: webperf: connect to Kafka using TLS [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [23:51:06] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Put a https protocol into values (duration: 01m 00s) [23:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:34] (03PS17) 10Dave Pifke: webperf: connect to Kafka using TLS [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [23:57:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:26] (03PS18) 10Dave Pifke: webperf: connect to Kafka using TLS [puppet] - 10https://gerrit.wikimedia.org/r/721047 (https://phabricator.wikimedia.org/T290131) [23:58:51] (03PS1) 10Nray: Add new 'mediawiki.skin_diff' event logging stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622)