[00:13:03] (03CR) 10Ssingh: [C: 03+1] httpbb: Update Special:FundraiserRedirector tests for new behavior [puppet] - 10https://gerrit.wikimedia.org/r/842013 (owner: 10RLazarus) [00:15:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4045.ulsfo.wmnet with OS buster [00:26:56] (03PS2) 10Tim Starling: Migrate to PHP 7.4 case mapping, but retain Georgian and Eszett overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842019 (https://phabricator.wikimedia.org/T292552) [00:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:34:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [00:38:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4045.ulsfo.wmnet with reason: host reimage [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:42:03] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4045.ulsfo.wmnet with OS buster [00:59:33] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:02:01] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:21] PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:51] RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:46:51] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:58:22] (03Abandoned) 10Tim Starling: Migrate to PHP 7.4 case mapping, but retain Georgian and Eszett overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842019 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [04:59:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:18:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:21:13] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10ayounsi) p:05Triage→03High See https://librenms.wikimedia.org/graphs/to=1665638100/id=22639/type=port_errors/from=1665551700/ @Cmjohnson @Jclark-ctr please sync up with @cmooney or myself to proceed with the optic replaceme... [05:23:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:33:10] (03PS1) 10Tim Starling: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) [05:33:12] (03PS1) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) [05:40:50] (03PS1) 10Ayounsi: Add ntc-netbox-plugin-metrics-ext plugin [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842244 (https://phabricator.wikimedia.org/T311052) [05:47:59] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T0600). [06:00:56] (03PS2) 10Tim Starling: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) [06:00:58] (03PS2) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) [06:19:10] (03CR) 10Ayounsi: prometheus: probe mgmt network from netmon host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [06:34:33] 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) 05Open→03Resolved Indeed it works now, thanks! [06:49:49] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T0700). Please do the needful. [07:00:05] matthiasmullie: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] morning! [07:00:31] there are no trainees signed up for the window, and we have one patch scheduled for deployment. [07:00:55] matthiasmullie: are you a self-deployer or would you like assistance? i can't remember... [07:22:05] matthiasmullie: you here? there's still time to get this done [07:23:08] o/ [07:23:14] sorry! [07:23:21] I didn't get the first ping [07:23:29] I can self-deploy [07:23:59] oh boy, I had a reminder in my calendar, I got pinged here, and still didn't notice :D [07:24:02] I'll start! [07:24:08] okay! [07:24:14] (thanks for pinging me again!) [07:24:21] (sure!) [07:24:44] note that we have the new simplified deployment process now, the one command version [07:25:34] I had my first experience with it yesterday - smooth as butter! [07:26:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) (owner: 10Matthias Mullie) [07:26:08] Oh, I did have a question about it [07:26:25] I might have an answer, or I might not [07:26:29] :-D [07:26:35] (03PS4) 10Matthias Mullie: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) [07:27:17] Since the script now self-merges patches, how does it handle already-merged or pending merge (gate-and-submit) patches? [07:27:40] if something is already merged it will push it out [07:27:52] if it's not already out, I mean [07:27:55] asking for repos with slow CI, where I've routinely +2'ed half an hour ahead of time to not waste precious deployment window time [07:28:06] yeah it will pick those up [07:28:08] ok perfect, it just "ought to work" [07:28:12] yup [07:28:14] (03CR) 10TrainBranchBot: "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) (owner: 10Matthias Mullie) [07:28:25] sweet, thanks [07:28:28] folks were very aware of that workflow, heh [07:28:58] (03Merged) 10jenkins-bot: Enable NS_MAIN thumbnails only on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841893 (https://phabricator.wikimedia.org/T320510) (owner: 10Matthias Mullie) [07:29:29] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:841893|Enable NS_MAIN thumbnails only on wikipedias (T320510)]] [07:29:34] T320510: Thumbnails are shown on Special:Search only on Wikipedias - https://phabricator.wikimedia.org/T320510 [07:29:44] what is a thumbnail namespace, anyways? since you're here [07:29:56] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:841893|Enable NS_MAIN thumbnails only on wikipedias (T320510)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:31:08] apergos: Special:Search will now also show thumbnails for NS_MAIN results (via Extension:PageImages) - e.g. https://en.wikipedia.org/w/index.php?search=cat&title=Special:Search&profile=advanced&fulltext=1&ns0=1 [07:31:36] oic [07:31:40] spiffy! [07:32:45] the "namespace" part is to ensure consistency - not all pages that could have a thumb have one, those get a placeholder (so they continue to be aligned and look similar to other results); but PageImages are only generated for some namespaces (usually only NS_MAIN - wouldn't really make sense elsewhere anyway), so we also didn't want all-thumbnails for other namespaces... [07:34:32] gotcha, makes sense. [07:37:54] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:841893|Enable NS_MAIN thumbnails only on wikipedias (T320510)]] (duration: 08m 24s) [07:37:59] T320510: Thumbnails are shown on Special:Search only on Wikipedias - https://phabricator.wikimedia.org/T320510 [07:46:13] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:48:54] matthiasmullie: how's it looking? [07:49:34] it looks like I forgot to log the end of the deployment window! [07:49:42] lol [07:49:45] Noone else was waiting, right? [07:49:53] well I was gonnna, which is why I asked, but go ahead then [07:50:00] nope, you were the one and only patch owner [07:50:33] !log UTC morning backports done [07:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:43] ty, see you next time! [07:51:20] glad I was the only in this window - I padded it well enough at the start & end [07:51:24] apologies, and thanks! [07:59:40] :-) [08:02:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from cluster for eventual decom [08:02:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1007.eqiad.wmnet with reason: Remove from cluster for eventual decom [08:04:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1017.eqiad.wmnet with reason: Remove from cluster for reimage [08:05:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1017.eqiad.wmnet with reason: Remove from cluster for reimage [08:07:28] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker images after code refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/841947 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [08:07:56] (03PS2) 10Muehlenhoff: Remove ganeti role from ganeti1007 [puppet] - 10https://gerrit.wikimedia.org/r/841914 (https://phabricator.wikimedia.org/T320419) [08:11:19] !log oblivian@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:11:39] (03PS10) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [08:11:50] (03CR) 10CI reject: [V: 04-1] charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [08:12:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti role from ganeti1007 [puppet] - 10https://gerrit.wikimedia.org/r/841914 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff) [08:12:12] !log oblivian@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:13:13] !log oblivian@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:13:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS buster [08:13:45] wait, oblivian? [08:14:18] (03PS1) 10Majavah: P:terraform: few minor bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/842348 [08:14:33] (03CR) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:14:49] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:19] (03CR) 10Gmodena: charts:eventgate bump common_templates and standardize labels (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [08:16:25] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [08:16:40] (03CR) 10CI reject: [V: 04-1] P:terraform: few minor bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/842348 (owner: 10Majavah) [08:16:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1017.eqiad.wmnet with OS bullseye [08:16:46] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1017.eqiad.wmnet with OS bullseye [08:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:18:06] (03PS2) 10Majavah: P:terraform: few minor bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/842348 [08:19:27] (03PS1) 10Filippo Giunchedi: mr: allow icmp from prometheus_group [homer/public] - 10https://gerrit.wikimedia.org/r/842350 (https://phabricator.wikimedia.org/T169860) [08:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:02] (03PS11) 10Gmodena: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [08:27:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:28:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:28:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:30:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1017.eqiad.wmnet with reason: host reimage [08:30:45] (03PS1) 10Majavah: toolforge: k8s: build images with explicit git version tags [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842351 (https://phabricator.wikimedia.org/T320476) [08:31:19] (03CR) 10Ayounsi: prometheus: probe mgmt network from netmon host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:31:50] (03CR) 10Ayounsi: [C: 03+1] mr: allow icmp from prometheus_group [homer/public] - 10https://gerrit.wikimedia.org/r/842350 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:33:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1017.eqiad.wmnet with reason: host reimage [08:35:01] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:07] (03PS3) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:39:15] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #7 [puppet] - 10https://gerrit.wikimedia.org/r/842352 (https://phabricator.wikimedia.org/T317748) [08:40:09] (03CR) 10David Caro: [C: 03+2] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842351 (https://phabricator.wikimedia.org/T320476) (owner: 10Majavah) [08:40:22] (03CR) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [08:42:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 11 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37524/console" [puppet] - 10https://gerrit.wikimedia.org/r/842352 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:43:53] (03Merged) 10jenkins-bot: toolforge: k8s: build images with explicit git version tags [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842351 (https://phabricator.wikimedia.org/T320476) (owner: 10Majavah) [08:44:37] (03PS2) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #7 [puppet] - 10https://gerrit.wikimedia.org/r/842352 (https://phabricator.wikimedia.org/T317748) [08:46:35] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:46:50] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Partition cache in one server per DC and cluster #7 [puppet] - 10https://gerrit.wikimedia.org/r/842352 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:47:41] (03CR) 10Filippo Giunchedi: [C: 03+2] mr: allow icmp from prometheus_group [homer/public] - 10https://gerrit.wikimedia.org/r/842350 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:48:27] !log partitioning the ATS cache in cp[2029-2030], cp[6001,6009], cp[1077-1078], cp[5002,5008], cp[3052-3053], cp4022 - T317748 [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:33] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [08:49:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @Andrew I think it's wise to proceed cautiously alright. And I've no objection to us keeping the Ceph host "public" and "cluster" NICs separate... [08:49:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1017.eqiad.wmnet with OS bullseye [08:49:59] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1017.eqiad.wmnet with OS bullseye completed: - ganeti1017 (**PASS**) - Downtimed on... [08:54:39] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline (no need to re-review)" [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842244 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:56:18] (03CR) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:59:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:00:52] I've acked the wikifunctions beta cert alert [09:03:00] (03PS3) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) [09:03:36] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:17] (03PS2) 10Ayounsi: Add ntc-netbox-plugin-metrics-ext plugin [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842244 (https://phabricator.wikimedia.org/T311052) [09:10:35] (03CR) 10Ayounsi: [C: 03+2] Add ntc-netbox-plugin-metrics-ext plugin (031 comment) [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842244 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:10:40] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add ntc-netbox-plugin-metrics-ext plugin [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842244 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:11:27] (03PS1) 10Sergio Gimeno: Enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 [09:11:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:13:05] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext [09:13:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:13:56] (03PS4) 10Filippo Giunchedi: prometheus: probe a sample of hosts in mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) [09:14:41] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:14:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:15:09] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext (duration: 02m 04s) [09:15:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:15:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:16:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:16:41] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: probe a sample of hosts in mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:16:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:18:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [09:19:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [09:21:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:23:49] (03PS1) 10Filippo Giunchedi: prometheus: temp disable mgmt checks until hiera export script is fixed [puppet] - 10https://gerrit.wikimedia.org/r/842357 (https://phabricator.wikimedia.org/T169860) [09:23:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:24:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:25:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:26:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: temp disable mgmt checks until hiera export script is fixed [puppet] - 10https://gerrit.wikimedia.org/r/842357 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:26:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:27:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:28:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:28:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:28:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [09:28:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:32:10] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:33:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1017.eqiad.wmnet to cluster eqiad and group B [09:34:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1017.eqiad.wmnet to cluster eqiad and group B [09:36:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:37:10] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:38:40] (03PS1) 10Filippo Giunchedi: customscripts: exclude decommissioning hosts from mgmt data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/842359 (https://phabricator.wikimedia.org/T310266) [09:48:46] (03PS3) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) [09:49:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [09:52:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >, @MoritzMuehlenhoff wrote: > Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: //No kernel modu... [09:53:01] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one device still showing ok and no alarms for FPC errrors. We can re-open if problem happens again. ` cmooney@re0.cr2-esams>... [09:56:09] (03CR) 10Kosta Harlan: [C: 04-1] Enable the Vue version of the mentee overview in all wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (owner: 10Sergio Gimeno) [09:56:45] (03CR) 10Kosta Harlan: [C: 04-1] Enable the Vue version of the mentee overview in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (owner: 10Sergio Gimeno) [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1000). [10:01:12] (03PS1) 10Matthias Mullie: Commons files can have thumbnails too [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842373 [10:01:51] (03CR) 10Muehlenhoff: [C: 03+2] Use profile::base::use_linux510_on_buster for cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/841538 (https://phabricator.wikimedia.org/T297814) (owner: 10Muehlenhoff) [10:05:56] (03PS1) 10Matthias Mullie: Commons files can have thumbnails too [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/842374 [10:08:14] (03CR) 10Ladsgroup: [C: 03+2] Add rename_flaggedrevs_indexes_T318950.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841899 (https://phabricator.wikimedia.org/T318950) (owner: 10Ladsgroup) [10:08:48] (03Merged) 10jenkins-bot: Add rename_flaggedrevs_indexes_T318950.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841899 (https://phabricator.wikimedia.org/T318950) (owner: 10Ladsgroup) [10:11:16] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841865 (owner: 10Jgiannelos) [10:13:58] (03PS2) 10Muehlenhoff: pki: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840142 (https://phabricator.wikimedia.org/T308013) [10:15:11] (03Merged) 10jenkins-bot: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841865 (owner: 10Jgiannelos) [10:16:09] !log draining ganeti1008 T320419 [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:14] T320419: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 [10:20:02] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10TheDJ) [10:20:37] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10TheDJ) [10:20:39] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:50] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10TheDJ) [10:23:45] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) [10:25:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > Perhaps something like cluster re-syncing when nodes are added/remove or in failure etc? Yep, that is when we hit the throughput limit yes, if... [10:26:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:27:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:27:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:27:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:27:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [10:27:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [10:27:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T318950)', diff saved to https://phabricator.wikimedia.org/P35452 and previous config saved to /var/cache/conftool/dbconfig/20221013-102757-ladsgroup.json [10:28:02] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:30:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318950)', diff saved to https://phabricator.wikimedia.org/P35453 and previous config saved to /var/cache/conftool/dbconfig/20221013-103015-ladsgroup.json [10:32:57] (03CR) 10Sergio Gimeno: GrowthExperiments: enable the Vue version of the mentee overview in all wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [10:33:56] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10Arian_Bozorg) Hi @KFrancis, I just sent over my email address. let me know if you need anything else. [10:35:50] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [10:35:59] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: enable the Vue version of the mentee overview in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [10:36:21] (03CR) 10DCausse: "commented on elasticsearch jvm.options but same comments might apply to opensearch as well, unsure about logstash ones tho" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [10:36:37] (03CR) 10Kosta Harlan: "so, did this end up getting synced? I guess if we don't see a bunch of production errors by now then we don't need it anyway" [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327) (owner: 10Urbanecm) [10:39:09] (03CR) 10Muehlenhoff: [C: 03+2] pki: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840142 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:39:39] (03PS3) 10Muehlenhoff: logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) [10:45:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P35454 and previous config saved to /var/cache/conftool/dbconfig/20221013-104521-ladsgroup.json [10:45:59] (03CR) 10Muehlenhoff: [C: 03+2] logstash: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840145 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:50:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) Just as an additional datapoint, if you connect to the console and anwer the question while the cookbook is running, it will happily continue onc... [11:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P35455 and previous config saved to /var/cache/conftool/dbconfig/20221013-110028-ladsgroup.json [11:00:38] (03PS1) 10Jbond: smart_data_dump: call raid fact directly [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) [11:01:07] (03PS1) 10Ayounsi: Re-generate wheels with pip list [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842417 (https://phabricator.wikimedia.org/T311052) [11:01:51] (03PS1) 10Cparle: Added structured data team [puppet] - 10https://gerrit.wikimedia.org/r/842418 (https://phabricator.wikimedia.org/T312235) [11:03:33] (03CR) 10Muehlenhoff: [C: 03+2] docker: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/839447 (owner: 10Muehlenhoff) [11:04:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842417 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [11:04:31] (03PS2) 10Muehlenhoff: tlsproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837096 (https://phabricator.wikimedia.org/T308013) [11:04:58] (03PS10) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [11:05:17] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Re-generate wheels with pip list [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842417 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [11:06:09] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels [11:06:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37525/console" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [11:07:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [11:10:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:19] (03PS1) 10Cparle: Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) [11:12:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:19] (03CR) 10CI reject: [V: 04-1] Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [11:15:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318950)', diff saved to https://phabricator.wikimedia.org/P35457 and previous config saved to /var/cache/conftool/dbconfig/20221013-111534-ladsgroup.json [11:15:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:15:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T318950)', diff saved to https://phabricator.wikimedia.org/P35458 and previous config saved to /var/cache/conftool/dbconfig/20221013-111556-ladsgroup.json [11:16:11] !ops logmsgbot is trying to log stuff, but stashbot just quit [11:18:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318950)', diff saved to https://phabricator.wikimedia.org/P35460 and previous config saved to /var/cache/conftool/dbconfig/20221013-111814-ladsgroup.json [11:21:38] (03CR) 10Jbond: [C: 03+2] sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [11:21:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [11:22:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::override: Add new helper define for overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [11:23:00] (03PS1) 10Ayounsi: Netbox: fix TZdata bug [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842433 (https://phabricator.wikimedia.org/T311052) [11:24:10] I’ll repeat the five missed logs then [11:24:10] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842433 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [11:24:25] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Netbox: fix TZdata bug [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/842433 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [11:24:39] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels (duration: 18m 30s) [11:24:49] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels [11:25:49] Lucas_WMDE: I'm pretty sure the bot just crashed again, it's not picking up these logs [11:26:04] Sariboo: I see them on sal.toolforge.org [11:26:17] it’s just not acknowledging logmsgbot anymore, that was changed a few weeks (months?) ago [11:26:24] to reduce the amount of messages [11:26:35] Ok, I was unaware of that change [11:26:42] Good to know, thanks [11:26:52] !log repeating five messages that got missed due to stashbot quit [11:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:02] !log 11:15 ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T318950)', diff saved to https://phabricator.wikimedia.org/P35457 and previous config saved to /var/cache/conftool/dbconfig/20221013-111534-ladsgroup.json [11:27:02] !log 11:15 ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:27:02] !log 11:15 ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [11:27:02] !log 11:15 ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T318950)', diff saved to https://phabricator.wikimedia.org/P35458 and previous config saved to /var/cache/conftool/dbconfig/20221013-111556-ladsgroup.json [11:27:02] !log 11:18 ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318950)', diff saved to https://phabricator.wikimedia.org/P35460 and previous config saved to /var/cache/conftool/dbconfig/20221013-111814-ladsgroup.json [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:06] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:18] there it is :) [11:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:28] \o/ [11:28:23] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels (duration: 03m 34s) [11:28:30] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels [11:30:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1005.eqiad.wmnet to drbd [11:31:43] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels (duration: 03m 12s) [11:31:48] (03PS2) 10David Caro: wmcs.ceph.upgrade_osds: allow specifying the osds to upgrade [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838835 (https://phabricator.wikimedia.org/T309786) [11:31:50] (03PS1) 10David Caro: ceph: add missing alert to downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842435 [11:33:02] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels [11:33:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P35461 and previous config saved to /var/cache/conftool/dbconfig/20221013-113320-ladsgroup.json [11:37:44] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Deploy ntc-netbox-plugin-metrics-ext and updated wheels (duration: 04m 41s) [11:40:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1005.eqiad.wmnet to drbd [11:42:11] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Sorry had missed it." [homer/public] - 10https://gerrit.wikimedia.org/r/829558 (owner: 10Ayounsi) [11:45:07] (03CR) 10Cathal Mooney: [C: 03+1] "Good call, is a lot clearer this way." [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi) [11:46:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1005.eqiad.wmnet to plain [11:46:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1005.eqiad.wmnet to plain [11:48:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P35462 and previous config saved to /var/cache/conftool/dbconfig/20221013-114827-ladsgroup.json [11:55:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:05] (03CR) 10Jbond: "I have also often wondered the same however it seems that this is mostly used by analytics and wmcs who dont necessarily get the "Check sy" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [11:58:42] !log installing curl security updates on buster [11:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:00] (03CR) 10David Caro: systemd: drop timer-specific alert in favor of generic alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [12:02:39] !log restarting FPM/Apache on mediawiki canaries to pick up new curl [12:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T318950)', diff saved to https://phabricator.wikimedia.org/P35464 and previous config saved to /var/cache/conftool/dbconfig/20221013-120334-ladsgroup.json [12:03:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:03:38] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:03:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:03:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T318950)', diff saved to https://phabricator.wikimedia.org/P35465 and previous config saved to /var/cache/conftool/dbconfig/20221013-120356-ladsgroup.json [12:06:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318950)', diff saved to https://phabricator.wikimedia.org/P35466 and previous config saved to /var/cache/conftool/dbconfig/20221013-120613-ladsgroup.json [12:14:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway) [12:17:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) Ok. Yeah there are some larger spikes around that time. Biggest on cloudcephosd1026 on Aug 18th. {F35565980} But still fairly comfortably wit... [12:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P35467 and previous config saved to /var/cache/conftool/dbconfig/20221013-122120-ladsgroup.json [12:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P35468 and previous config saved to /var/cache/conftool/dbconfig/20221013-123626-ladsgroup.json [12:37:21] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [12:37:27] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:43:50] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [12:45:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001" [12:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T318950)', diff saved to https://phabricator.wikimedia.org/P35469 and previous config saved to /var/cache/conftool/dbconfig/20221013-125133-ladsgroup.json [12:51:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [12:51:38] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:51:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [12:51:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T318950)', diff saved to https://phabricator.wikimedia.org/P35470 and previous config saved to /var/cache/conftool/dbconfig/20221013-125154-ladsgroup.json [12:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318950)', diff saved to https://phabricator.wikimedia.org/P35471 and previous config saved to /var/cache/conftool/dbconfig/20221013-125412-ladsgroup.json [12:56:32] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:57:42] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1300). [13:00:05] matthiasmullie and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:33] hello [13:00:54] o/ [13:05:55] I can get started with my patch already [13:06:42] starting [13:06:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/842374 (owner: 10Matthias Mullie) [13:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P35472 and previous config saved to /var/cache/conftool/dbconfig/20221013-130918-ladsgroup.json [13:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:46] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:15:48] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:16:02] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:16:54] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:46] (03Merged) 10jenkins-bot: Commons files can have thumbnails too [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/842374 (owner: 10Matthias Mullie) [13:22:10] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:842374|Commons files can have thumbnails too]] [13:22:35] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:842374|Commons files can have thumbnails too]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:22:36] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our repo? Or does it have to come straight from debian? [13:23:09] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) >>! In T319067#8313691, @MoritzMuehlenhoff wrote: >>, @MoritzMuehlenhoff wrote: >> Currently it doesn't run fully non-interactive yet, there's a... [13:23:20] (03CR) 10Matthias Mullie: [C: 03+2] Commons files can have thumbnails too [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842373 (owner: 10Matthias Mullie) [13:24:01] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:24:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P35473 and previous config saved to /var/cache/conftool/dbconfig/20221013-132425-ladsgroup.json [13:24:51] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:25:44] Any deployer around for sergi0_, or can you self-deploy? (if none of those options are available, I can probably do yours) [13:26:00] (03PS1) 10Jbond: Merge remote-tracking branch 'gerrit/3-2-2' [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/842445 [13:26:08] (note - still working though my own patches atm) [13:27:23] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10cmooney) @ayounsi thanks. And yes a good opportunity to change the MTU. For now I've changed the OSPF metric either side to drain the link until we can swap out the optic. ` cmooney@lsw1-e3-eqiad> show configuration protocols... [13:27:26] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:842374|Commons files can have thumbnails too]] (duration: 05m 15s) [13:29:40] I moved my patch to the late UTC window since I would need assistance. And requested the training so I can make myself more helpful :) [13:29:46] o/ [13:30:04] I could deploy now (or once matthiasmullie is done) [13:31:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] Merge remote-tracking branch 'gerrit/3-2-2' [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/842445 (owner: 10Jbond) [13:31:13] CI for current patch ETA 9min [13:31:52] Lucas_WMDE: Let's deploy now then if you can help since the patch is relevant to unblock things in our team. ty [13:33:22] (03CR) 10FNegri: [C: 03+1] "Thanks, this is useful!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842435 (owner: 10David Caro) [13:34:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842373 (owner: 10Matthias Mullie) [13:34:18] ok [13:34:22] * Lucas_WMDE looks [13:36:35] still waiting for matthiasmullie’s patch though (Zuul ETA now 4 min) [13:36:45] there should be enough time left in the window afterwards [13:36:47] (03CR) 10Urbanecm: [C: 03+1] "would work, a minor improvement suggested inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:37:59] great [13:38:42] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "AddContributeCardEntryPoint: Use RequestContext::getMain" (031 comment) [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841872 (https://phabricator.wikimedia.org/T319327) (owner: 10Urbanecm) [13:39:16] urbanecm: are you deploying? [13:39:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T318950)', diff saved to https://phabricator.wikimedia.org/P35477 and previous config saved to /var/cache/conftool/dbconfig/20221013-133931-ladsgroup.json [13:39:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:39:36] Lucas_WMDE: nope, i added C+2 on a different day [13:39:37] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:39:38] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) [13:39:45] * Lucas_WMDE looks [13:39:46] wikibugs confusingly displays it for some reason [13:39:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:39:50] oh right [13:39:50] just pushed urbanecm suggestion [13:39:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T318950)', diff saved to https://phabricator.wikimedia.org/P35478 and previous config saved to /var/cache/conftool/dbconfig/20221013-133953-ladsgroup.json [13:39:55] any new comment shows up with the same votes [13:39:57] ok then [13:40:03] makes sense [13:40:17] (03Merged) 10jenkins-bot: Commons files can have thumbnails too [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842373 (owner: 10Matthias Mullie) [13:40:27] (03CR) 10Sergio Gimeno: GrowthExperiments: enable the Vue version of the mentee overview in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:40:29] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:40:36] thanks sergi0_! [13:40:40] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:842373|Commons files can have thumbnails too]] [13:40:52] (03CR) 10DCausse: "you probably to install promtools locally to ease testing (the doc at https://wikitech.wikimedia.org/wiki/Alertmanager is awesome btw)" [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [13:41:03] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:842373|Commons files can have thumbnails too]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318950)', diff saved to https://phabricator.wikimedia.org/P35479 and previous config saved to /var/cache/conftool/dbconfig/20221013-134211-ladsgroup.json [13:42:53] (03CR) 10Ottomata: [C: 03+2] charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:45:23] (03CR) 10Urbanecm: [C: 03+1] "two questions left inline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:45:33] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:842373|Commons files can have thumbnails too]] (duration: 04m 53s) [13:45:41] Lucas_WMDE: I'm done [13:45:45] thanks! [13:46:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:46:29] dammit [13:46:32] (03CR) 10Lucas Werkmeister (WMDE): GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:46:34] (03Merged) 10jenkins-bot: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [13:46:40] I still need to get into a habit of using scap backport [13:46:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:46:56] (I still find the name confusing btw) [13:47:09] (03Merged) 10jenkins-bot: GrowthExperiments: enable the Vue version of the mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:47:34] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:842356|GrowthExperiments: enable the Vue version of the mentee overview in all wikis (T300532)]] [13:47:40] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [13:47:47] (03PS4) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) [13:47:57] Lucas_WMDE: not sure what the dammit was meant to mean, but it's not an issue with scap backport to manually +2. in fact, it's sometimes helpful (esp. with actual backports, that take a while to finish). [13:47:58] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and sgimeno: Backport for [[gerrit:842356|GrowthExperiments: enable the Vue version of the mentee overview in all wikis (T300532)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:48:17] the only thing to be aware of is that scap backport will sync _all_ merged changes [13:48:55] (it will warn you if it is about to sync something else than you specified as the argument, but it doesn't know how to sync only the specified patch, regardless of what else got merged in the meanwhile) [13:49:01] mhm [13:49:25] (03CR) 10Stang: logos: Document how to update wordmark/tagline via manage.py (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:49:37] sergi0_: I don’t think I can test this myself [13:49:54] I would probably need to be a mentor ^^ [13:50:02] (03CR) 10Urbanecm: [C: 03+1] "Thanks for the quick change! LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:50:06] (03PS5) 10Stang: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) [13:50:38] i can help with testing if needed [13:50:42] You would need to enroll as a mentor to test the UI. I can do that. Checking MW logs will be helpful since we enable some php feature with this [13:51:02] I can look at the MW logs [13:51:18] if you’re loading the page and testing the UI [13:51:33] (nothing in logstash so far, it seems) [13:51:42] (03PS1) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) [13:51:51] enwiki production is using vue [13:52:35] sergi0_: i found a regression though :/ [13:52:45] shouldn’t it still use the old version? (except on mwdebug) [13:52:58] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [13:53:00] ah, yes, i meant production as in "not beta enwiki" [13:53:01] urbanecm: is it? It's not loading for me [13:53:11] mwdebug only of course [13:53:14] ok :) [13:53:22] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [13:53:30] urbanecm: what's the regression? [13:53:35] sergi0_: it does load for me! https://usercontent.irccloud-cdn.com/file/3sQOecQq/image.png [13:53:46] (03CR) 10CI reject: [V: 04-1] wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:54:08] (03PS1) 10Ssingh: interface: add dependency on ethtool for interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/842455 [13:54:13] sergi0_: if you submit an empty string under "Maximum", it is replaced with 500 (the default) on the next load [13:54:17] urbanecm: also for me in mwdebug :) I don't have mentees though to test more [13:54:57] previously, an empty string for the editcount filters was persisted for next load [13:55:03] and resulted in the filter not being used [13:55:20] meh, I thought we had fixed that one :( [13:55:28] Lucas_WMDE: better not deploy it [13:55:36] the actual behavior of empty string in those filters is fine, it's just it doesn't persist :/ [13:55:39] perhaps it’s fixed but only for the next train? [13:55:55] sergi0_: alright, I’ll stop the deploy [13:56:02] let’s see what scap backport does then, haven’t done it before [13:56:14] ideally it should probably upload a fresh revert to Gerrit [13:56:17] !log lucaswerkmeister-wmde@deploy1002 Sync cancelled. [13:56:29] doesn’t look like it did lol [13:56:40] (03PS2) 10Raymond Ndibe: wmcs: format and refactor maintain-dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/842454 (https://phabricator.wikimedia.org/T304040) [13:56:46] Lucas_WMDE: you need to do scap backport --revert uri [13:56:51] ah ok [13:56:54] was about to revert in gerrit [13:57:03] it should do it for you, it just needs to be told [13:57:04] thanks [13:57:11] (03PS1) 10TrainBranchBot: Revert "GrowthExperiments: enable the Vue version of the mentee overview in all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842456 [13:57:13] (03CR) 10TrainBranchBot: "lucaswerkmeister-wmde@deploy1002 created a revert of this change as I37cc80f02d5ab6fc9c54e6eaee5b865f2b6d397e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842356 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P35480 and previous config saved to /var/cache/conftool/dbconfig/20221013-135718-ladsgroup.json [13:57:25] If we had some AI and ML, it'd know what you wanted it to do... [13:57:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842456 (owner: 10TrainBranchBot) [13:58:15] anyway, I'll go fill the regression in phab [13:58:23] (03Merged) 10jenkins-bot: Revert "GrowthExperiments: enable the Vue version of the mentee overview in all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842456 (owner: 10TrainBranchBot) [13:58:27] thanks [13:58:48] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:842456|Revert "GrowthExperiments: enable the Vue version of the mentee overview in all wikis"]] [13:58:57] yes must be another regression.. Thank you both for the assistance! [13:59:10] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and trainbranchbot: Backport for [[gerrit:842456|Revert "GrowthExperiments: enable the Vue version of the mentee overview in all wikis"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:59:38] (03CR) 10BBlack: [C: 03+1] interface: add dependency on ethtool for interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/842455 (owner: 10Ssingh) [13:59:45] sergi0_: can you quickly check that it’s undeployed again on mwdebug? [14:00:19] Lucas_WMDE: it is, all good. [14:00:23] thanks [14:01:45] sergi0_: filled as T320728 [14:01:46] T320728: Mentee overview(vue): Empty string in "Maximum"/"Minimum" filter options is not persisted - https://phabricator.wikimedia.org/T320728 [14:01:58] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:02:26] urbanecm: thank you! [14:03:35] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10gmodena) The patch has been reviewed and merged. @Ottomata and @dcausse helped me out with a deplo... [14:04:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:842456|Revert "GrowthExperiments: enable the Vue version of the mentee overview in all wikis"]] (duration: 05m 49s) [14:04:49] !log UTC afternoon backport+config window done [14:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:13] * urbanecm merges a docs-only change now [14:05:24] (03PS6) 10Urbanecm: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [14:05:27] (03CR) 10Urbanecm: [C: 03+2] "docs-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [14:06:41] !log urbanecm@deploy1002 backport aborted: (duration: 00m 04s) [14:06:56] !log running puppet/utils/pcc_update_facts.py to update nodes [14:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:08] (03Merged) 10jenkins-bot: logos: Document how to update wordmark/tagline via manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842010 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [14:07:31] * urbanecm done [14:07:54] thanks urbanecm [14:07:54] (03CR) 10JHathaway: [C: 03+2] rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [14:08:15] koi: thanks for making T307705 happen :) [14:08:15] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [14:08:33] is there anything else to do, or can we close that task? [14:08:52] yeah, I thought it could be closed [14:09:19] done :) [14:09:35] (03CR) 10Ssingh: [C: 03+2] interface: add dependency on ethtool for interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/842455 (owner: 10Ssingh) [14:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P35481 and previous config saved to /var/cache/conftool/dbconfig/20221013-141224-ladsgroup.json [14:18:13] (03CR) 10Andrew Bogott: [C: 03+2] P:terraform: few minor bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/842348 (owner: 10Majavah) [14:25:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:25:44] here [14:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:25:57] here [14:26:29] here [14:26:36] o/ [14:26:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [14:27:02] mostly eqsin, but some eqiad too [14:27:30] brief hit on esams, shorter [14:27:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T318950)', diff saved to https://phabricator.wikimedia.org/P35482 and previous config saved to /var/cache/conftool/dbconfig/20221013-142730-ladsgroup.json [14:27:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:27:47] ACKed for now [14:27:56] maybe move over to -security [14:28:10] yep [14:30:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:30:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:35:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:36:00] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4045.ulsfo.wmnet,service=ats-be [14:36:00] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4045.ulsfo.wmnet,service=ats-tls [14:36:01] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4045.ulsfo.wmnet,service=varnish-fe [14:36:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=ats-be [14:36:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=ats-tls [14:36:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4045.ulsfo.wmnet,service=varnish-fe [14:50:04] (03PS1) 10Elukey: ml-services: add EventGate settings for articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/842465 (https://phabricator.wikimedia.org/T320374) [14:50:47] (03PS3) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) [14:50:58] (03CR) 10CI reject: [V: 04-1] Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:52:39] (03PS4) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) [14:53:35] (03CR) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:53:51] (03PS5) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) [14:55:46] (03CR) 10JHathaway: [C: 03+2] otrs_aliases.py: add postfix support [puppet] - 10https://gerrit.wikimedia.org/r/841950 (owner: 10JHathaway) [14:56:50] (03CR) 10Elukey: [C: 03+2] ml-services: add EventGate settings for articletopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/842465 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:59:15] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) [14:59:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:01:05] (03CR) 10Filippo Giunchedi: "Ben, I'd like to remove the timer-specific alert and rely on the generic "check systemd state" one instead. After an audit it looks like t" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [15:02:34] (03Abandoned) 10Faidon Liambotis: Move tests/unit -> tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485714 (owner: 10Faidon Liambotis) [15:02:36] (03Abandoned) 10Faidon Liambotis: Add a bunch more tests [software/keyholder] - 10https://gerrit.wikimedia.org/r/485715 (owner: 10Faidon Liambotis) [15:02:38] (03Abandoned) 10Faidon Liambotis: Properly setup logging when /dev/log doesn't exist [software/keyholder] - 10https://gerrit.wikimedia.org/r/485724 (owner: 10Faidon Liambotis) [15:02:40] (03Abandoned) 10Faidon Liambotis: Test key and config file parsing using test data [software/keyholder] - 10https://gerrit.wikimedia.org/r/485716 (owner: 10Faidon Liambotis) [15:02:42] (03Abandoned) 10Faidon Liambotis: Add a (very basic) test using OpenSSH's ssh-add [software/keyholder] - 10https://gerrit.wikimedia.org/r/485717 (owner: 10Faidon Liambotis) [15:02:44] (03Abandoned) 10Faidon Liambotis: Add tests for OSError when loading config files [software/keyholder] - 10https://gerrit.wikimedia.org/r/485718 (owner: 10Faidon Liambotis) [15:02:46] (03Abandoned) 10Faidon Liambotis: Make all SshAgentConfig's methods instance methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/485719 (owner: 10Faidon Liambotis) [15:02:50] (03Abandoned) 10Faidon Liambotis: Add SshKeyBlob per RFC 4253 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485720 (owner: 10Faidon Liambotis) [15:03:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:03:00] (03Abandoned) 10Faidon Liambotis: Add a pylint tox environment [software/keyholder] - 10https://gerrit.wikimedia.org/r/485707 (owner: 10Faidon Liambotis) [15:03:02] (03Abandoned) 10Faidon Liambotis: Add a tox environment for Construct 2.8.16 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485708 (owner: 10Faidon Liambotis) [15:03:04] (03Abandoned) 10Faidon Liambotis: Update tox.ini to facilitate parallel builds [software/keyholder] - 10https://gerrit.wikimedia.org/r/485709 (owner: 10Faidon Liambotis) [15:03:06] (03Abandoned) 10Faidon Liambotis: protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [15:03:08] (03Abandoned) 10Faidon Liambotis: Bump minimum Python to 3.5; also test with 3.7 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706 (owner: 10Faidon Liambotis) [15:03:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:07:29] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) @Clement_Goubert and I will deploy the rest on Monday. [15:08:01] (03Abandoned) 10Faidon Liambotis: authdns: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/354073 (owner: 10Faidon Liambotis) [15:08:35] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:08:36] (03Abandoned) 10Faidon Liambotis: wmflib/hiera: wrap long lines [puppet] - 10https://gerrit.wikimedia.org/r/403700 (owner: 10Faidon Liambotis) [15:12:30] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #8 [puppet] - 10https://gerrit.wikimedia.org/r/842486 (https://phabricator.wikimedia.org/T317748) [15:14:46] (03CR) 10Faidon Liambotis: "I still think this idea has some kind of merit & value in de-duplicating, but given it's been sitting in Gerrit since 2015, I think I'll a" [dns] - 10https://gerrit.wikimedia.org/r/223059 (owner: 10Faidon Liambotis) [15:14:51] (03Abandoned) 10Faidon Liambotis: (WIP) Make project domains template-based/DRY [dns] - 10https://gerrit.wikimedia.org/r/223059 (owner: 10Faidon Liambotis) [15:17:05] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37528/console" [puppet] - 10https://gerrit.wikimedia.org/r/842486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [15:17:29] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:18:11] (03Abandoned) 10Faidon Liambotis: raid: split parts of raid into raid::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/357993 (owner: 10Faidon Liambotis) [15:19:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:19:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:19:56] (03Abandoned) 10Jbond: Revert "role::exim: update config to drop ldap validation" [puppet] - 10https://gerrit.wikimedia.org/r/739497 (owner: 10Jbond) [15:20:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:51] (03Abandoned) 10Jbond: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [15:23:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:24:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:27:52] (03Abandoned) 10Faidon Liambotis: TLS settings for public exim4 [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [15:28:59] (03Abandoned) 10Faidon Liambotis: mx: strengthen exim tls_require_ciphers [puppet] - 10https://gerrit.wikimedia.org/r/458061 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [15:29:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #8 [puppet] - 10https://gerrit.wikimedia.org/r/842486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [15:29:24] (03Abandoned) 10Faidon Liambotis: Add example systemd service file [software/keyholder] - 10https://gerrit.wikimedia.org/r/473270 (owner: 10Thcipriani) [15:29:32] !log partitioning the ATS cache in cp[2027-2028], cp[1075-1076], cp5007, cp[3050-3051] - T317748 [15:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:37] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [15:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez Would like to move connections can You depooled Servers so connection can move this afternoon?? [15:30:09] (03PS2) 10Jbond: smart_data_dump: call raid fact directly [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) [15:30:20] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10Jclark-ctr) @cmooney I will be available in 1 hour if you are still online [15:31:13] (03Abandoned) 10Faidon Liambotis: profile::base: run the apt configuration before anything else [puppet] - 10https://gerrit.wikimedia.org/r/404305 (owner: 10Giuseppe Lavagetto) [15:37:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Jclark-ctr we need to do lvs1020 first, and after it's done we can depool lvs1017, but we cannot depool both lvs instances at th... [15:38:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:43:48] (03PS1) 10Cwhite: Revert "smart: restore get_fact and deprecate get_raid_drivers" [puppet] - 10https://gerrit.wikimedia.org/r/842471 (https://phabricator.wikimedia.org/T320636) [15:47:04] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:47:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10cmooney) @Jclark-ctr yes I'm around, I'll drop you a line on irc too thanks. [15:49:04] (03PS15) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:49:06] (03PS4) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [15:49:52] (03PS3) 10Cwhite: raid: use raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) (owner: 10Jbond) [15:51:09] (03PS1) 10Elukey: admin_ng: set higher circuit breaking limits for EventGate on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/842494 (https://phabricator.wikimedia.org/T320374) [15:51:24] (03PS16) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:51:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:48] (03CR) 10David Caro: [C: 03+2] ceph: add missing alert to downtime (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842435 (owner: 10David Caro) [15:55:41] (03CR) 10AOkoth: [C: 03+2] admin: add bmansurov to analytics-research-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761477 (https://phabricator.wikimedia.org/T301215) (owner: 10AOkoth) [15:56:37] (03CR) 10AOkoth: vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [15:57:21] (03CR) 10AOkoth: [C: 03+2] gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [15:57:27] (03CR) 10Cwhite: [C: 03+2] Revert "smart: restore get_fact and deprecate get_raid_drivers" [puppet] - 10https://gerrit.wikimedia.org/r/842471 (https://phabricator.wikimedia.org/T320636) (owner: 10Cwhite) [15:58:11] (03CR) 10David Caro: [C: 03+2] wmcs.ceph.upgrade_osds: allow specifying the osds to upgrade [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838835 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [15:58:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:11] !Adjusting MTU on link from lsw1-e3-eqiad to lsw1-f1-eqiad (drained in advance) [15:59:19] !log Adjusting MTU on link from lsw1-e3-eqiad to lsw1-f1-eqiad (drained in advance) [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:01] (03Merged) 10jenkins-bot: wmcs.ceph.upgrade_osds: allow specifying the osds to upgrade [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/838835 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [16:02:03] (03Merged) 10jenkins-bot: ceph: add missing alert to downtime [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842435 (owner: 10David Caro) [16:05:41] 10SRE, 10Traffic: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) [16:05:53] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) 05Open→03Resolved After partitioning the ATS cache in the whole fleet c... [16:06:52] (03PS1) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 [16:06:54] (03PS1) 10Jbond: motd::script: update redfine to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [16:06:56] (03PS1) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [16:07:28] (03CR) 10CI reject: [V: 04-1] wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [16:07:53] 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Just FYI I've adjusted one of the links on the row E/F switches now. Quick run-down of process: # Drain link by chaning OSPF interface cost both sides: ** `set protocols ospf area 0.0.... [16:08:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37529/console" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [16:08:57] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:09:41] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:10:36] (03PS2) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [16:11:58] (03CR) 10Cwhite: [C: 03+1] "The new fact interface appears identical to the legacy fact." [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) (owner: 10Jbond) [16:14:12] (03PS1) 10BBlack: Add wikifunctions.org to exim domains [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) [16:17:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [16:19:31] (03PS2) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 [16:20:11] (03PS2) 10Jbond: motd::script: update redfine to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [16:20:16] (03PS3) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [16:21:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37531/console" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [16:22:31] (03PS3) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 [16:22:35] (03PS1) 10BBlack: wikifunctions.org: add temp DCV TXT record [dns] - 10https://gerrit.wikimedia.org/r/842500 (https://phabricator.wikimedia.org/T313227) [16:22:57] (03PS3) 10Jbond: motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [16:23:04] (03PS4) 10Jbond: motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [16:23:23] (03PS4) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [16:23:45] (03CR) 10BBlack: [C: 03+2] wikifunctions.org: add temp DCV TXT record [dns] - 10https://gerrit.wikimedia.org/r/842500 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [16:23:59] (03CR) 10JHathaway: "unfortunately there is a bit more, you need to add the domain to our secret puppet repo in /srv/private/modules/privateexim/manifests/init" [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [16:24:07] (03CR) 10CI reject: [V: 04-1] motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [16:24:13] (03CR) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [16:24:36] (03PS5) 10Jbond: motd::script: update define to all interpreted strings [puppet] - 10https://gerrit.wikimedia.org/r/842497 (https://phabricator.wikimedia.org/T320696) [16:24:46] (03PS5) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [16:26:57] !log draining ganeti1008 T320419 [16:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:01] T320419: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 [16:27:34] (03PS2) 10Cparle: Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) [16:27:41] (03PS1) 10BBlack: Revert "wikifunctions.org: add temp DCV TXT record" [dns] - 10https://gerrit.wikimedia.org/r/842501 (https://phabricator.wikimedia.org/T313227) [16:28:38] (03CR) 10BBlack: [C: 03+2] Revert "wikifunctions.org: add temp DCV TXT record" [dns] - 10https://gerrit.wikimedia.org/r/842501 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [16:29:17] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:30:06] (03CR) 10CI reject: [V: 04-1] Alert for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/842420 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [16:30:43] 10SRE, 10LDAP, 10SecTeam-Processed: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10mmartorana) >>! In T240870#5764088, @Peachey88 wrote: > Do we need a over-all wmf group at all? Would a group per service be better for a granularized access point of view and... [16:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:40:03] (03CR) 10BBlack: Add wikifunctions.org to exim domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [16:46:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10Jclark-ctr) a:03Jclark-ctr [16:47:14] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10Jclark-ctr) Replaced optic on E3 port 55 [16:50:30] !log disable Puppet and stop Pybal on lvs1020: T286881 [16:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:35] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [16:50:51] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:52:20] this is expected, see above ^ [16:53:12] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-10-13-111722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842504 [16:54:07] PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:59:39] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=120) https://wikitech.wikimedia.org/wiki/PyBal [17:00:04] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1700). [17:02:52] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-10-13-111722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842504 (owner: 10BryanDavis) [17:03:13] PROBLEM - Host lvs1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:47] RECOVERY - Host lvs1020 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:06:51] * bd808 waits patiently for his patch to show up on deployment.eqiad.wmnet [17:07:13] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-10-13-111722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842504 (owner: 10BryanDavis) [17:07:41] PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:09:29] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:09:50] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:09:57] RECOVERY - pybal on lvs1020 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:10:12] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:10:31] !log disable Puppet and stop Pybal on lvs1017: T286881 [17:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:35] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [17:10:41] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:10:51] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:11:19] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:11:53] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:06] !log disabling puppet on gitlab-runner1002 to debug jwt auth failure [17:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:09] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 120 connections established with conf1007.eqiad.wmnet:4001 (min=120) https://wikitech.wikimedia.org/wiki/PyBal [17:12:14] please ignore lvs1017 alerts. I don't want to downtime it so let's just pretend they are not there :) [17:16:43] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:16:57] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:16:59] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:19:53] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:22:41] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T320645 (10cmooney) 05Open→03Resolved Interface has been brought back up and traffic put on it again, showing error free after ~30mins so closing out task. Thanks @Jclark-ctr for getting it sorted out quickly :) [17:24:16] 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Ok I've fixed the MTUs for all the underlay / switch to switch links in the new cage now. All that remains on those are the uplink sub-ints to the CRs, which for some reason are at 9174... [17:25:59] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:26:01] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:26:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) Moved two connections cableid stayed the same lvs1017 is connected to asw2-b4-eqiad on old port xe-4/0/15 Cableid 4801, New asw... [17:26:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) a:03Jclark-ctr [17:28:17] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) @Papaul I can make myself available if @Cmjohnson cant [17:29:11] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:32:11] 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10RLazarus) p:05Triage→03Medium [17:36:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:25] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:16] 10SRE, 10Maps, 10observability, 10serviceops: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10RLazarus) [17:47:31] 10SRE, 10Maps, 10observability, 10serviceops: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10RLazarus) p:05Triage→03Medium [17:48:03] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:51:25] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:05] (03PS1) 10Dbrant: Add parameters for Reading Lists landing page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 [17:56:11] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti1008 [puppet] - 10https://gerrit.wikimedia.org/r/842510 (https://phabricator.wikimedia.org/T320419) [17:57:54] 10SRE, 10Maps, 10observability, 10serviceops: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10herron) @RLazarus thanks for putting a task together. Totally makes sense and is something I've wanted to do for some time as well. Yes I'm game, especially now that more SLO... [17:58:13] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:05] dduvall and ^demon: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T1800) [18:03:07] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842511 (https://phabricator.wikimedia.org/T314194) [18:03:09] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842511 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:04:40] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842511 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:08:53] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.5 refs T314194 [18:08:59] T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194 [18:09:45] (03PS2) 10Dbrant: Add parameters for Reading Lists landing page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 (https://phabricator.wikimedia.org/T313269) [18:14:37] (03PS1) 10QChris: Add .gitreview [debs/benthos] - 10https://gerrit.wikimedia.org/r/842513 [18:14:39] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/benthos] - 10https://gerrit.wikimedia.org/r/842513 (owner: 10QChris) [18:18:21] 10SRE, 10Maps, 10observability, 10serviceops: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10RLazarus) Oh, one more angle to think about! There are two Maps services we're writing SLOs for. For Kartotherian, we're just planning a request latency target at the 50th and... [18:19:36] (03CR) 10Jdlrobson: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [18:24:19] 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Actually I've discovered something odd on those sub-interfaces between switches and cr's. Firstly the value I was seeing was the protocol mtu (i.e. payload mtu) as I was looking at the... [18:27:30] (03PS2) 10Ebernhardson: beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) [18:28:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:33:01] (03PS1) 104nn1l2: commonswiki: add editcontentmodel right to interface-admin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842516 (https://phabricator.wikimedia.org/T320752) [18:33:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:12] (03CR) 10Andrew Bogott: "I don't know what this does :) Can you expand the commit message?" [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook) [18:37:50] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah) [18:37:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:37:55] (03PS5) 10Andrew Bogott: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah) [18:41:25] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:44] (03PS3) 10Andrew Bogott: dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080 (owner: 10Majavah) [18:45:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:34] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080 (owner: 10Majavah) [18:48:44] (03CR) 10Andrew Bogott: [C: 03+2] openstack::horizon: Remove proxy config from local_settings [puppet] - 10https://gerrit.wikimedia.org/r/834700 (owner: 10Majavah) [18:50:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:34] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::nova: remove stretch specific code [puppet] - 10https://gerrit.wikimedia.org/r/800009 (owner: 10Majavah) [18:50:39] (03PS2) 10Andrew Bogott: P:openstack::nova: remove stretch specific code [puppet] - 10https://gerrit.wikimedia.org/r/800009 (owner: 10Majavah) [18:52:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:11] (03CR) 10Andrew Bogott: [C: 03+1] add wmflib::is_active to pick a single active host [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [19:06:46] (03PS1) 10Andrew Bogott: Install Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842521 (https://phabricator.wikimedia.org/T309407) [19:06:49] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:08:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:11:28] (03PS1) 10PleaseStand: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 [19:12:35] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:13:45] (03PS2) 10Andrew Bogott: Install Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842521 (https://phabricator.wikimedia.org/T309407) [19:15:39] (03PS2) 10Vivian Rook: Allow cloud_provider_enabled [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) [19:16:05] (03CR) 10Vivian Rook: Allow cloud_provider_enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825676 (https://phabricator.wikimedia.org/T280792) (owner: 10Vivian Rook) [19:18:05] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:19:15] (03PS3) 10Andrew Bogott: Install Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842521 (https://phabricator.wikimedia.org/T309407) [19:19:58] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [19:22:18] (03PS3) 10Ryan Kemper: elasticsearch: Elasticsearch 7 does not need to specify number of masters [puppet] - 10https://gerrit.wikimedia.org/r/836912 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:23:07] (03PS4) 10Ryan Kemper: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:24:05] (03PS1) 10Jbond: systemd::override: ensure we also pass override => true [puppet] - 10https://gerrit.wikimedia.org/r/842523 [19:24:28] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/836912 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:25:17] (03PS4) 10Andrew Bogott: Install Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842521 (https://phabricator.wikimedia.org/T309407) [19:26:40] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: Elasticsearch 7 does not need to specify number of masters [puppet] - 10https://gerrit.wikimedia.org/r/836912 (https://phabricator.wikimedia.org/T313431) (owner: 10Gehel) [19:26:55] (03CR) 10Jbond: [C: 03+2] systemd::override: ensure we also pass override => true [puppet] - 10https://gerrit.wikimedia.org/r/842523 (owner: 10Jbond) [19:28:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:29:38] (03CR) 10Andrew Bogott: [C: 03+2] Install Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842521 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [19:33:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:36:31] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:54] !log rsyncing /srv/repos from phab1001 to 3 other phab servers (with bw limit) - T313360 [19:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:59] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [19:43:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:27] (03PS1) 10Andrew Bogott: Add rabbit/heat dummy password [labs/private] - 10https://gerrit.wikimedia.org/r/842526 [19:50:06] (03PS1) 10Lucas Werkmeister: commonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) [19:50:44] (03CR) 10CI reject: [V: 04-1] commonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) (owner: 10Lucas Werkmeister) [19:51:27] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:47] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:52:04] (03PS2) 10Andrew Bogott: Add rabbit/heat dummy password [labs/private] - 10https://gerrit.wikimedia.org/r/842526 [19:52:17] How long after a train deployment might rolling back the train be reasonably considered to be necessary? I'm asking because of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842522 (config change that depends on core change) [19:53:08] lucaswerkmeister: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842528/ is partly duplicating https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842516 if you're not aware ^^ (ref T320752) [19:53:09] T320752: Add editcontentmodel right to interface-admin group on Commons - https://phabricator.wikimedia.org/T320752 [19:53:22] yes, I said I was going to upload the config change but nn1l2 also uploaded one [19:53:32] ^^' [19:53:33] mine includes testcommonswiki, which I think makes sense [19:53:58] * TheresNoTime is just checking what's in the next deploy window and noticed [19:54:04] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add rabbit/heat dummy password [labs/private] - 10https://gerrit.wikimedia.org/r/842526 (owner: 10Andrew Bogott) [19:54:12] (03PS1) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [19:54:13] we should probably decide which one goes ^^ [19:54:16] (03PS8) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [19:54:30] (pls, as currently 842516 is scheduled) [19:54:55] (03CR) 10CI reject: [V: 04-1] Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [19:55:22] well, koi also has reservations about the change in general it seems [19:56:31] (03PS9) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [19:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:57:03] lucaswerkmeister: yeah, some concern as I see similar task (on zhwiki) got rejected [19:57:06] PleaseStand: I would guess that change would be safe to roll out next week. It is pretty uncommon for us to revert to a prior version of MediaWiki after it has been on enwiki for 48 hours or so. (Mostly because we tend to see the bad problems within a few hours of enwiki load) [19:57:06] T272473 [19:57:07] T272473: Interface admin can not edit page in Mediawiki namespace when page has full (sysop) protection - https://phabricator.wikimedia.org/T272473 [19:57:30] would’ve been good to link that in your comment… [19:57:32] * lucaswerkmeister reads [19:58:03] it seems that task was about protection, not content models [19:58:13] so I don’t think that’s a similar case [19:58:15] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:37] (03PS23) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [19:58:41] (03CR) 10Dduvall: [C: 03+1] P:ci::docker: Install upstream docker packages for all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [19:58:45] (03PS3) 10Samtar: Add parameters for Reading Lists landing page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [19:58:59] it's also about adding a permission, so I thought it's similar 0 0 [19:59:48] (03PS2) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:00:05] brennen and TheresNoTime: gettimeofday() says it's time for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221013T2000) [20:00:05] dbrant, ebernhardson, nn1l2, and Lucas_WMDE: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] * TheresNoTime can deploy! o/ [20:00:11] wait what [20:00:15] did I write Lucas_WMDE out of habit? [20:00:21] :p [20:00:24] lmao I did [20:00:36] idiot [20:00:45] here but busy for about 20 mins, please let me to be the last. Thanks [20:00:55] (03CR) 10CI reject: [V: 04-1] prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:01:03] (03CR) 10CI reject: [V: 04-1] Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:01:07] dbrant: will do yours first, I see its a beta-only [20:01:07] ok, so let’s postpone mine then as well ^^ [20:01:13] \o [20:01:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [20:02:15] (03Merged) 10jenkins-bot: Add parameters for Reading Lists landing page. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842509 (https://phabricator.wikimedia.org/T313269) (owner: 10Dbrant) [20:02:56] @TheresNoTime thx! [20:03:27] (03PS24) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [20:04:39] dbrant: https://integration.wikimedia.org/ci/view/Beta/ looks a little behind, so let me know if there's any issues when it finally syncs there [20:04:58] (03PS3) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:05:13] ebernhardson: will move onto yours now, I'll leave the beta one until after to let it catch up. I'm starting with 838276 [20:05:20] TheresNoTime: kk [20:05:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838276 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:05:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838269 (owner: 10Ebernhardson) [20:07:10] (03PS2) 10Lucas Werkmeister: commonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) [20:07:45] (03Merged) 10jenkins-bot: cirrus: remove cross-dc poolcounter increases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838269 (owner: 10Ebernhardson) [20:07:49] (03Merged) 10jenkins-bot: cirrus: Drop client side connect timeout config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838276 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:08:06] !log samtar@deploy1002 Started scap: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]] [20:08:11] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [20:08:20] (03PS4) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:08:27] !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:08:35] ebernhardson: those two are live on mwdebug now, can you test? [20:08:44] TheresNoTime: sortof, sec [20:08:56] (03PS3) 10Lucas Werkmeister: testcommonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) [20:09:12] I rebased by change onto nn1l2’s so it’s only about testcommonswiki [20:09:13] TheresNoTime: i can still do cross-dc queries, thats about all i can test for now :) Will find out more next time we do a dc switchover test [20:09:29] ebernhardson: syncing :) [20:10:42] (03PS3) 10Samtar: beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:11:29] (03PS5) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:13:37] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]] (duration: 05m 31s) [20:13:43] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [20:13:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:14:19] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:14:55] (03Merged) 10jenkins-bot: beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) (owner: 10Ebernhardson) [20:15:46] ebernhardson: everything merged :) that beta config one should be live in ~10 minutes or so [20:16:18] nn1l2: around for 842516? [20:16:26] yes [20:16:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842516 (https://phabricator.wikimedia.org/T320752) (owner: 104nn1l2) [20:16:45] \o/ [20:16:50] TheresNoTime: thanks [20:16:59] you're welcome :) [20:17:35] nn1l2 / lucaswerkmeister: any idea how you're going to test this? :) [20:17:48] userrights API, I guess? [20:17:49] https://commons.wikimedia.org/wiki/Special:ListGroupRights [20:17:54] or that [20:18:03] oh yeahh [20:18:35] (03Merged) 10jenkins-bot: commonswiki: add editcontentmodel right to interface-admin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842516 (https://phabricator.wikimedia.org/T320752) (owner: 104nn1l2) [20:18:39] not sure why I thought you'd need to test with an int-admin account ^^' [20:18:51] !log samtar@deploy1002 Started scap: Backport for [[gerrit:842516|commonswiki: add editcontentmodel right to interface-admin group (T320752)]] [20:18:57] T320752: Add editcontentmodel right to interface-admin group on Commons - https://phabricator.wikimedia.org/T320752 [20:19:02] I like the API because the JSON is easy to diff ^^ [20:19:11] !log samtar@deploy1002 samtar and nn1l2: Backport for [[gerrit:842516|commonswiki: add editcontentmodel right to interface-admin group (T320752)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:19:15] TheresNoTime: well, I could test like that too I suppose :P [20:19:16] nn1l2: live on mwdebug [20:19:44] OK [20:19:50] It's okay [20:19:52] diff of https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2 looks good to me [20:19:56] great, syncin' [20:20:13] * lucaswerkmeister pressed Ctrl+W on the wrong iwndow [20:21:02] * TheresNoTime should have done those two patches together.. sorry for the wait! [20:22:59] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@b5b51fa]: 0.3.117 and adding eu knowledge graph to whitelist [20:23:02] (03PS4) 10Samtar: testcommonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) (owner: 10Lucas Werkmeister) [20:23:55] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:842516|commonswiki: add editcontentmodel right to interface-admin group (T320752)]] (duration: 05m 03s) [20:23:59] T320752: Add editcontentmodel right to interface-admin group on Commons - https://phabricator.wikimedia.org/T320752 [20:24:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) (owner: 10Lucas Werkmeister) [20:25:30] (03Merged) 10jenkins-bot: testcommonswiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842528 (https://phabricator.wikimedia.org/T320752) (owner: 10Lucas Werkmeister) [20:25:48] !log samtar@deploy1002 Started scap: Backport for [[gerrit:842528|testcommonswiki: Add editcontentmodel to interface-admin (T320752)]] [20:26:07] !log samtar@deploy1002 samtar and lucaswerkmeister: Backport for [[gerrit:842528|testcommonswiki: Add editcontentmodel to interface-admin (T320752)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:26:12] lucaswerkmeister: same again ^ :) [20:26:13] (03PS6) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:26:22] hm, not seeing a diff on mwdebug yet [20:26:25] unless I’m testing wrong [20:26:27] one second [20:26:43] oh, of course I need to change the URL to test commons now [20:26:51] :D [20:26:58] yup, `diff <(curl -s 'https://test-commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2' | jq .) <(curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' -s 'https://test-commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&format=json&formatversion=2' | jq .)` looks good [20:27:15] syncin' [20:27:22] (03PS1) 10BryanDavis: striker: Bump container version to 2022-10-03-154059-production [puppet] - 10https://gerrit.wikimedia.org/r/842533 (https://phabricator.wikimedia.org/T316991) [20:27:51] (03CR) 10Dzahn: [C: 03+2] P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [20:28:14] (03CR) 10CI reject: [V: 04-1] Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:31:12] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:842528|testcommonswiki: Add editcontentmodel to interface-admin (T320752)]] (duration: 05m 24s) [20:31:17] T320752: Add editcontentmodel right to interface-admin group on Commons - https://phabricator.wikimedia.org/T320752 [20:31:52] (03PS7) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:33:17] all deployed :) [20:33:34] !log close UTC late backport window [20:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:53] * Sariboo gives TheresNoTime a barnstar [20:34:09] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) 05Open→03Resolved [20:34:10] *this* is the easy bit :p [20:34:15] 10SRE, 10serviceops, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10JArguello-WMF) [20:34:21] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-10-03-154059-production [puppet] - 10https://gerrit.wikimedia.org/r/842533 (https://phabricator.wikimedia.org/T316991) (owner: 10BryanDavis) [20:34:27] TheresNoTime, nn1l2: it works \o/ https://commons.wikimedia.org/w/index.php?title=MediaWiki:Gadget-Cat-a-lot.js/ne&diff=696216598&oldid=656740940 [20:34:29] thanks! [20:34:36] (03CR) 10Andrew Bogott: [C: 03+2] Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [20:34:43] and now it turns out that the edit request actually has an issue that needs to be resolved first ^^ [20:34:43] (03PS8) 10Andrew Bogott: Heat: use rabbitmq_nodes instead of openstack_controllers for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842529 (https://phabricator.wikimedia.org/T309407) [20:35:00] Thanks fro me too :) [20:35:02] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@b5b51fa]: 0.3.117 and adding eu knowledge graph to whitelist (duration: 12m 02s) [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:51] (03PS4) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) [20:48:23] (03CR) 10Cwhite: [C: 03+2] raid: use raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) (owner: 10Jbond) [20:48:31] (03PS4) 10Cwhite: raid: use raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/842416 (https://phabricator.wikimedia.org/T251293) (owner: 10Jbond) [20:53:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:01:46] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10demon) [21:06:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Dzahn) Hi @Damilare one thing that is needed to move this forward is approval from the manager on the ticket here. Can you ask them about that? [21:06:47] (03PS1) 10Andrew Bogott: Add haproxy entry for Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842536 (https://phabricator.wikimedia.org/T309407) [21:06:58] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10Dzahn) Hi @thcipriani This would need your approval. [21:08:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [21:08:50] (03CR) 10Andrew Bogott: [C: 03+2] Add haproxy entry for Openstack Heat in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842536 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [21:14:18] (03CR) 10JHathaway: wmflib::ansi: add new ansi formatting function (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [21:17:31] (03PS8) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [21:18:50] (03CR) 10JHathaway: [C: 03+1] Add wikifunctions.org to exim domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842499 (https://phabricator.wikimedia.org/T313227) (owner: 10BBlack) [21:27:02] (03PS10) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [21:28:06] (03CR) 10Dduvall: [C: 03+1] "This is good to merge from our end. I'll just want to coordinate with you on the merge timing as I'll need to depool contints in jenkins f" [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [21:29:34] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/842539 [21:33:04] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/842539 [21:33:11] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:44:27] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:51:47] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [21:58:35] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [22:48:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:08:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:14:39] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:06] 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10CAS-SSO, 10GitLab (Auth & Access): migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) [23:30:33] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:37:54] (03CR) 10Cwhite: "Changes seem sensible to me, although it's a fairly significant change to enforce the switch from G1GC to ConcMarkSweepGC. Smoketesting o" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [23:40:25] (03CR) 10Cwhite: [C: 03+1] "Changes look sensible to me, including the ones proposed earlier." [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [23:43:57] (03CR) 10Cwhite: [C: 03+1] "Regardless, we can always improve it later as needed. Thanks for this!" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [23:44:20] (03PS3) 10Tim Starling: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) [23:44:22] (03PS3) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) [23:44:24] (03CR) 10Jbond: "thanks for the review comments responses inline" [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [23:50:45] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:51:05] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:17] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status