[00:16:38] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:18:36] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:34] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:28:20] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:29:48] (03PS1) 10Eevans: hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) [00:29:50] (03PS1) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [00:29:52] (03PS1) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [00:30:58] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:27] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [00:33:02] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:16] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924138 [00:39:32] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924138 (owner: 10TrainBranchBot) [00:46:50] (03PS2) 10Eevans: hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) [00:46:52] (03PS2) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [00:46:54] (03PS2) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [00:47:23] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [00:58:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/924138 (owner: 10TrainBranchBot) [00:58:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:03:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:13:39] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10KFrancis) The NDA is complete. Please proceed with the access request. Thanks! [01:23:22] PROBLEM - Disk space on idp1002 is CRITICAL: DISK CRITICAL - free space: / 1450 MB (3% inode=97%): /tmp 1450 MB (3% inode=97%): /var/tmp 1450 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=idp1002&var-datasource=eqiad+prometheus/ops [01:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:06] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:05:32] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:41:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10Marostegui) 05Resolved→03Open We just got an alert of the same password being used in production and wmcs for this user. [04:48:28] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (cloudcontrol2005-dev), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:55:18] RECOVERY - MariaDB Replica IO: s1 on clouddb1017 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:50] (03PS1) 10Marostegui: db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924621 [04:57:31] (03CR) 10Marostegui: [C: 03+2] db1212: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924621 (owner: 10Marostegui) [04:57:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48639 and previous config saved to /var/cache/conftool/dbconfig/20230531-045754-root.json [04:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1221 (sanitarium s4 master) T337446', diff saved to https://phabricator.wikimedia.org/P48640 and previous config saved to /var/cache/conftool/dbconfig/20230531-045927-root.json [04:59:32] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [05:00:47] (03PS1) 10Marostegui: db1221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924622 [05:03:03] (03CR) 10Marostegui: [C: 03+2] db1221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924622 (owner: 10Marostegui) [05:03:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:13:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48642 and previous config saved to /var/cache/conftool/dbconfig/20230531-051259-root.json [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:21:12] (03PS1) 10Marostegui: db1156: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924636 [05:21:42] (03CR) 10Marostegui: [C: 03+2] db1156: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924636 (owner: 10Marostegui) [05:21:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48643 and previous config saved to /var/cache/conftool/dbconfig/20230531-052156-root.json [05:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48644 and previous config saved to /var/cache/conftool/dbconfig/20230531-052804-root.json [05:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48645 and previous config saved to /var/cache/conftool/dbconfig/20230531-053700-root.json [05:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48646 and previous config saved to /var/cache/conftool/dbconfig/20230531-054308-root.json [05:43:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48647 and previous config saved to /var/cache/conftool/dbconfig/20230531-055205-root.json [05:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48648 and previous config saved to /var/cache/conftool/dbconfig/20230531-055813-root.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T0600) [06:06:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48649 and previous config saved to /var/cache/conftool/dbconfig/20230531-060710-root.json [06:09:16] (03PS1) 10Marostegui: db1154: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924772 (https://phabricator.wikimedia.org/T337446) [06:10:00] (03CR) 10Marostegui: [C: 03+2] db1154: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924772 (https://phabricator.wikimedia.org/T337446) (owner: 10Marostegui) [06:13:11] (03PS1) 10Marostegui: db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924773 [06:13:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48650 and previous config saved to /var/cache/conftool/dbconfig/20230531-061318-root.json [06:13:41] (03CR) 10Marostegui: [C: 03+2] db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/924773 (owner: 10Marostegui) [06:22:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48651 and previous config saved to /var/cache/conftool/dbconfig/20230531-062216-root.json [06:28:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48652 and previous config saved to /var/cache/conftool/dbconfig/20230531-062823-root.json [06:33:34] (03PS3) 10JMeybohm: Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 [06:36:00] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:37:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48653 and previous config saved to /var/cache/conftool/dbconfig/20230531-063721-root.json [06:43:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48654 and previous config saved to /var/cache/conftool/dbconfig/20230531-064327-root.json [06:47:31] (03CR) 10JMeybohm: [C: 03+2] profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:52:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48655 and previous config saved to /var/cache/conftool/dbconfig/20230531-065225-root.json [06:53:10] (03PS1) 10JMeybohm: Fix owner of admin kubeconfig certs [puppet] - 10https://gerrit.wikimedia.org/r/924871 (https://phabricator.wikimedia.org/T325268) [06:54:13] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [06:54:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41441/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:55:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41442/console" [puppet] - 10https://gerrit.wikimedia.org/r/924871 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:55:29] (03CR) 10CI reject: [V: 04-1] Fix owner of admin kubeconfig certs [puppet] - 10https://gerrit.wikimedia.org/r/924871 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:57:13] (03PS1) 10JMeybohm: profile::imagecatalog: Fix permissions for client certs [puppet] - 10https://gerrit.wikimedia.org/r/924872 (https://phabricator.wikimedia.org/T325268) [06:58:13] (03PS2) 10JMeybohm: Fix owner of admin kubeconfig certs [puppet] - 10https://gerrit.wikimedia.org/r/924871 (https://phabricator.wikimedia.org/T325268) [06:58:15] (03PS2) 10JMeybohm: profile::imagecatalog: Fix permissions for client certs [puppet] - 10https://gerrit.wikimedia.org/r/924872 (https://phabricator.wikimedia.org/T325268) [06:59:38] (03CR) 10CI reject: [V: 04-1] profile::imagecatalog: Fix permissions for client certs [puppet] - 10https://gerrit.wikimedia.org/r/924872 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:00:09] Amir1, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T0700). [07:00:09] No Gerrit patches in the queue for this window AFAICS. [07:03:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41443/console" [puppet] - 10https://gerrit.wikimedia.org/r/924872 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:03:53] oh hm, if I stay up late enough there's a deployment window [07:04:31] (03PS2) 10Legoktm: Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) [07:04:33] (03PS2) 10Legoktm: Remove GWToolset configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) [07:05:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] profile::imagecatalog: Fix permissions for client certs [puppet] - 10https://gerrit.wikimedia.org/r/924872 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:05:35] (03CR) 10JMeybohm: [C: 03+2] Fix owner of admin kubeconfig certs [puppet] - 10https://gerrit.wikimedia.org/r/924871 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:06:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [07:07:16] (03Merged) 10jenkins-bot: Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [07:07:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48656 and previous config saved to /var/cache/conftool/dbconfig/20230531-070730-root.json [07:07:51] !log legoktm@deploy1002 Started scap: Backport for [[gerrit:921253|Remove GWToolset configuration (1/2) (T270911)]] [07:07:55] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [07:08:15] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.reimage: Remove specialised cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/922065 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [07:09:21] can I add a patch to the window? [07:09:49] kostajh: yes [07:09:57] !log legoktm@deploy1002 legoktm: Backport for [[gerrit:921253|Remove GWToolset configuration (1/2) (T270911)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:10:05] do you need someone to deploy it for you or do you have perms yourself? [07:10:33] I can do it myself [07:10:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Merge reimaging cookbooks - https://phabricator.wikimedia.org/T336491 (10SLyngshede-WMF) 05Open→03Resolved [07:10:40] RECOVERY - Disk space on idp1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=idp1002&var-datasource=eqiad+prometheus/ops [07:10:43] I'll add to the calendar now [07:10:56] (03PS1) 10Kosta Harlan: NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924571 (https://phabricator.wikimedia.org/T337320) [07:10:59] I'll ping you once I'm done, I have one more patch after this [07:12:39] ok [07:14:41] (03PS1) 10ArielGlenn: fix up regex comparisons in dumps nfs share testing script [puppet] - 10https://gerrit.wikimedia.org/r/924874 (https://phabricator.wikimedia.org/T325232) [07:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:17:42] !log legoktm@deploy1002 Finished scap: Backport for [[gerrit:921253|Remove GWToolset configuration (1/2) (T270911)]] (duration: 09m 51s) [07:17:47] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [07:17:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [07:18:07] (03CR) 10Muehlenhoff: [C: 03+2] Setup debmonitor2003 as bookworm debmonitor VM [puppet] - 10https://gerrit.wikimedia.org/r/924517 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [07:18:47] (03Merged) 10jenkins-bot: Remove GWToolset configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [07:19:12] !log legoktm@deploy1002 Started scap: Backport for [[gerrit:921254|Remove GWToolset configuration (2/2) (T270911)]] [07:20:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:20:59] (03CR) 10Filippo Giunchedi: "LGTM, however I'll let Traffic folks vote" [puppet] - 10https://gerrit.wikimedia.org/r/924550 (owner: 10JMeybohm) [07:21:12] (03PS1) 10JMeybohm: k8s::base_dirs: Ensure /etc/kubernetes/pki with proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [07:21:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:22:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:10] (03CR) 10Vgutierrez: [C: 03+1] run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:24:47] (03CR) 10Muehlenhoff: [C: 03+2] Add debmonitor[12]003 as additional scap targets [puppet] - 10https://gerrit.wikimedia.org/r/922126 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [07:24:50] (03CR) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:26:04] (03CR) 10Legoktm: Remove GWToolset configuration (2/2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [07:27:44] (03PS5) 10Legoktm: wmcs: Update URL to Cloud Services Introduction to bypass redirect [puppet] - 10https://gerrit.wikimedia.org/r/876293 (owner: 10Nintendofan885) [07:27:56] (03CR) 10Legoktm: [C: 03+2] "Congrats on your first merged Gerrit patch!!" [puppet] - 10https://gerrit.wikimedia.org/r/876293 (owner: 10Nintendofan885) [07:28:01] legoktm: still deploying? [07:28:07] yes :| [07:28:14] no worries [07:28:37] stepping away for a few minutes [07:28:49] I didn't realize `scap backport` would automatically rebuild the l10n cache so it's taking longer than I thought, should've just old school scap sync-file'd it [07:31:35] (03CR) 10CI reject: [V: 04-1] NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924571 (https://phabricator.wikimedia.org/T337320) (owner: 10Kosta Harlan) [07:34:07] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924571 (https://phabricator.wikimedia.org/T337320) (owner: 10Kosta Harlan) [07:35:54] !log apache2 restarted on logstash1032 before I could get a backtrace to debug logstash lag [07:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:28] (03PS2) 10JMeybohm: k8s::base_dirs: Ensure /etc/kubernetes/pki with proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [07:38:51] (03CR) 10CI reject: [V: 04-1] k8s::base_dirs: Ensure /etc/kubernetes/pki with proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:39:48] (03CR) 10Volans: "Nice! Did you follow https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/922065 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [07:39:55] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41445/console" [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:41:24] !log legoktm@deploy1002 legoktm: Backport for [[gerrit:921254|Remove GWToolset configuration (2/2) (T270911)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:41:29] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [07:41:44] syncing everywhere now [07:47:37] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/Debmonitor [07:51:02] (03PS1) 10Matthias Mullie: [ImageSuggestions] Process suggestions via job queue rather than sync [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) [07:51:13] (03PS1) 10Marostegui: Revert "db1221: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/924572 [07:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48657 and previous config saved to /var/cache/conftool/dbconfig/20230531-075126-root.json [07:51:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1221: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/924572 (owner: 10Marostegui) [07:53:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:54:59] (03PS3) 10JMeybohm: k8s::base_dirs: Ensure /etc/kubernetes/pki with proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [07:58:11] !log legoktm@deploy1002 Finished scap: Backport for [[gerrit:921254|Remove GWToolset configuration (2/2) (T270911)]] (duration: 38m 58s) [07:58:16] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [07:59:21] phew [07:59:23] kostajh: done! [08:01:02] legoktm: ok [08:01:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924571 (https://phabricator.wikimedia.org/T337320) (owner: 10Kosta Harlan) [08:01:45] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1009.eqiad.wmnet with OS bullseye [08:04:33] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye [08:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48658 and previous config saved to /var/cache/conftool/dbconfig/20230531-080631-root.json [08:08:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:11:05] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:30] (03CR) 10Ladsgroup: Enable parser cache warming jobs for parsoid on some top wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [08:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage [08:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48659 and previous config saved to /var/cache/conftool/dbconfig/20230531-082135-root.json [08:22:08] (03CR) 10Fabfur: [C: 03+2] run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:22:16] (03PS4) 10JMeybohm: k8s::base_dirs: Ensure /etc/kubernetes/pki with proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [08:22:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage [08:23:34] (03Merged) 10jenkins-bot: NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924571 (https://phabricator.wikimedia.org/T337320) (owner: 10Kosta Harlan) [08:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:01] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:924571|NewImpact: Cache empty user impact on account creation (T337320)]] [08:24:06] T337320: [Spike] Investigate A/B test results of Growth Experiments impact module - https://phabricator.wikimedia.org/T337320 [08:24:55] (03Merged) 10jenkins-bot: run-puppet-restart-varnish: Add dry_run support to check function [cookbooks] - 10https://gerrit.wikimedia.org/r/924590 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:25:39] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:924571|NewImpact: Cache empty user impact on account creation (T337320)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:25:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41447/console" [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:28:08] verifying the change [08:28:56] (03CR) 10Elukey: [C: 03+2] ml-services: add autoscaling capabilities to revert risk la [deployment-charts] - 10https://gerrit.wikimedia.org/r/924544 (owner: 10Elukey) [08:29:01] (03CR) 10Elukey: [C: 03+2] services: raise auth-users rate limit for Lift Wing in the API Gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/924545 (owner: 10Elukey) [08:30:46] syncing [08:32:37] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:09] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) When you're ready to publish your docs/coverage with [[ https://gitlab.wikimedia.org/repos/releng/docpub | docpub ]], your project members wi... [08:36:11] (03PS1) 10Hashar: zuul: add a gerrit-reporter gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) [08:36:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48660 and previous config saved to /var/cache/conftool/dbconfig/20230531-083640-root.json [08:37:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:37:49] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:924571|NewImpact: Cache empty user impact on account creation (T337320)]] (duration: 13m 48s) [08:37:54] T337320: [Spike] Investigate A/B test results of Growth Experiments impact module - https://phabricator.wikimedia.org/T337320 [08:39:32] right, all done [08:39:55] !log UTC morning deploys done [08:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:27] (03PS1) 10Ladsgroup: Remove legacy encoding option from dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) [08:40:47] (03PS2) 10Ladsgroup: Remove legacy encoding option from dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) [08:41:08] (03CR) 10Elukey: [C: 03+1] Decommission an-worker1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/922841 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [08:41:21] (03PS2) 10Jelto: microsites: remove annualreport, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:41:33] (03PS5) 10JMeybohm: Ensure /etc/kubernetes/pki has proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [08:41:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1009.eqiad.wmnet with OS bullseye [08:41:38] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye completed: - ms-fe1009 (**WARN**) - Downtim... [08:42:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:45] (03PS3) 10Jelto: microsites: remove annualreport, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:43:38] (03PS1) 10Muehlenhoff: Revert "Add debmonitor[12]003 as additional scap targets" [puppet] - 10https://gerrit.wikimedia.org/r/924886 (https://phabricator.wikimedia.org/T241049) [08:44:00] (03CR) 10Jcrespo: [C: 03+1] gerrit/bacula: adjust Gerrit file paths to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/924608 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [08:44:14] (03PS2) 10Hashar: zuul: add a gerrit-reporter gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) [08:45:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41448/console" [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:50:07] (03Abandoned) 10Arturo Borrero Gonzalez: lvs: remove wikireplicas S3 definition [puppet] - 10https://gerrit.wikimedia.org/r/924481 (https://phabricator.wikimedia.org/T337721) (owner: 10Arturo Borrero Gonzalez) [08:50:30] (03PS1) 10ArielGlenn: fix up more things in the docs for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/924887 (https://phabricator.wikimedia.org/T325232) [08:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48661 and previous config saved to /var/cache/conftool/dbconfig/20230531-085145-root.json [08:51:57] (03PS1) 10Jelto: microsites: move blackbox checks to dedicated monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/924888 (https://phabricator.wikimedia.org/T300171) [08:52:29] (03PS6) 10JMeybohm: Ensure /etc/kubernetes/pki has proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) [08:52:40] !log manually run puppet node clean/deactivate for labstore1004/1005 (which run into a traceback in the decom script) T337269 [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] T337269: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 [08:55:05] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: labstore1004.eqiad.wmnet [08:55:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: labstore1004.eqiad.wmnet [08:55:12] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1004.eqiad.wmnet [08:55:14] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: labstore1005.eqiad.wmnet [08:55:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: labstore1005.eqiad.wmnet [08:55:20] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission labstore100[45].eqiad.wmne - https://phabricator.wikimedia.org/T337269 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1005.eqiad.wmnet [08:56:08] (03CR) 10JMeybohm: [C: 03+2] Ensure /etc/kubernetes/pki has proper permissions [puppet] - 10https://gerrit.wikimedia.org/r/924875 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:56:16] (03CR) 10Fabfur: [C: 03+2] cache::upload: Switch HTTPS redirection from Varnish to HAProxy only on cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/924444 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:56:27] (03CR) 10Jelto: [C: 03+2] microsites: remove annualreport, migrated to k8s [puppet] - 10https://gerrit.wikimedia.org/r/923652 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:56:59] 10SRE, 10Infrastructure-Foundations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) [08:57:16] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2005-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336564 (10aborrero) 05In progress→03Resolved [08:57:17] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:32] (03CR) 10Jelto: [C: 03+2] microsites: move blackbox checks to dedicated monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/924888 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [08:57:39] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Add debmonitor[12]003 as additional scap targets" [puppet] - 10https://gerrit.wikimedia.org/r/924886 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [08:58:44] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1009.eqiad.wmnet [08:58:53] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1009.eqiad.wmnet [08:59:35] !log Testing new cookbook to switch port 80 from Varnish to HAProxy on cp2042 [08:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:55] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on P{cp2042.codfw.wmnet} and A:cp [09:01:26] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: bast2002.wikimedia.org [09:01:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: bast2002.wikimedia.org [09:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48662 and previous config saved to /var/cache/conftool/dbconfig/20230531-090649-root.json [09:07:03] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:46] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [09:14:47] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:09] (03PS1) 10Fabfur: hiera: Applying port 80 redirection on upload cluster in codfw [puppet] - 10https://gerrit.wikimedia.org/r/924894 (https://phabricator.wikimedia.org/T323557) [09:19:03] (03PS1) 10Slyngshede: R:idp_test Add Netbox next as OIDC consumer. [puppet] - 10https://gerrit.wikimedia.org/r/924895 (https://phabricator.wikimedia.org/T308002) [09:20:18] (03CR) 10Slyngshede: "Secret not yet added to private repo." [puppet] - 10https://gerrit.wikimedia.org/r/924895 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [09:20:57] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [09:21:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48663 and previous config saved to /var/cache/conftool/dbconfig/20230531-092154-root.json [09:23:11] (03PS1) 10Ayounsi: Initial gNMI support for network automation cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/924896 [09:23:26] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10TheresNoTime) [09:23:53] !log aborrero@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2004-dev.wikimedia.org [09:23:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on P{cp2042.codfw.wmnet} and A:cp [09:25:04] (03PS2) 10Fabfur: hiera: Swap port 80 from varnish to haproxy on codfw upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/924894 (https://phabricator.wikimedia.org/T323557) [09:25:40] (03PS1) 10Jbond: SreBaseClass: dont sleep on the last host in a batch [cookbooks] - 10https://gerrit.wikimedia.org/r/924897 [09:25:55] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:25:55] (03CR) 10CI reject: [V: 04-1] Initial gNMI support for network automation cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/924896 (owner: 10Ayounsi) [09:26:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924894 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:26:38] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/924897 (owner: 10Jbond) [09:26:46] (03PS1) 10Jelto: miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) [09:27:35] (03PS2) 10Majavah: P:toolforge::proxy: remove absented logster resources [puppet] - 10https://gerrit.wikimedia.org/r/919803 [09:27:37] (03PS2) 10Majavah: logster: remove classes [puppet] - 10https://gerrit.wikimedia.org/r/919804 [09:27:41] (03CR) 10CI reject: [V: 04-1] miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) (owner: 10Jelto) [09:28:22] (03PS2) 10Jelto: miscweb: add bienvenida release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/924898 (https://phabricator.wikimedia.org/T337047) [09:30:18] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: remove traces of cloudcontrol2004-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/924899 (https://phabricator.wikimedia.org/T337828) [09:34:09] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:06] (03CR) 10Vgutierrez: [C: 03+1] SreBaseClass: dont sleep on the last host in a batch [cookbooks] - 10https://gerrit.wikimedia.org/r/924897 (owner: 10Jbond) [09:35:28] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [09:35:52] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:36:15] (03CR) 10Jbond: [C: 03+2] SreBaseClass: dont sleep on the last host in a batch [cookbooks] - 10https://gerrit.wikimedia.org/r/924897 (owner: 10Jbond) [09:36:24] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:36:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48665 and previous config saved to /var/cache/conftool/dbconfig/20230531-093659-root.json [09:38:06] !next [09:38:36] (03Merged) 10jenkins-bot: SreBaseClass: dont sleep on the last host in a batch [cookbooks] - 10https://gerrit.wikimedia.org/r/924897 (owner: 10Jbond) [09:40:02] wrong cmd apparently :) [09:41:29] (03PS3) 10Hashar: zuul: add a gerrit-reporter gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) [09:41:40] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [09:41:41] jouncebot: nowandnext [09:41:41] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [09:41:41] In 0 hour(s) and 18 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1000) [09:41:47] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:48] taavi: thx <3 [09:48:28] (03PS1) 10Filippo Giunchedi: Deprecate nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/924901 (https://phabricator.wikimedia.org/T337831) [09:49:13] (03PS1) 10Jcrespo: backups: Add cloudcontrol2005 to the list of ignored backup errors [puppet] - 10https://gerrit.wikimedia.org/r/924902 [09:49:50] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:50:09] (03PS2) 10Jcrespo: backups: Add cloudcontrol2005 to the list of ignored backup errors [puppet] - 10https://gerrit.wikimedia.org/r/924902 [09:51:06] (03CR) 10Vgutierrez: [C: 03+1] service: move rest-gateway to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920664 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [09:51:43] (03CR) 10Kamila Součková: [C: 03+1] "LGTM (as good as it gets)" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [09:53:26] (03CR) 10Hnowlan: [C: 03+2] service: move rest-gateway to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920664 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [09:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:54:09] if there are whines for dumpsdata1006, please ignore, I am working on it. [09:54:26] (03CR) 10Vgutierrez: [C: 03+1] hiera: Swap port 80 from varnish to haproxy on codfw upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/924894 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:55:05] PROBLEM - NFS on dumpsdata1006 is CRITICAL: connect to address 10.64.130.3 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Dumps/Dumpsdata_hosts [09:55:19] (03PS1) 10Jbond: admin: remove ssh key for ksarabia [puppet] - 10https://gerrit.wikimedia.org/r/924903 [09:56:28] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin2002" [09:56:28] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:56:30] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2004-dev.wikimedia.org [09:56:41] PROBLEM - Check systemd state on dumpsdata1006 is CRITICAL: CRITICAL - degraded: The following units failed: nfs-mountd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:13] !log eoghan@cumin1001 START - Cookbook sre.hosts.decommission for hosts doc1002.eqiad.wmnet [09:57:40] (03CR) 10Fabfur: [C: 03+2] hiera: Swap port 80 from varnish to haproxy on codfw upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/924894 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [09:58:15] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.82:4113]) https://wikitech.wikimedia.org/wiki/PyBal [09:58:19] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.82:4113]) https://wikitech.wikimedia.org/wiki/PyBal [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1000) [10:01:30] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [10:01:41] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on P{cp[2028,2030,2032,2034,2036,2038,2040].codfw.wmnet} and A:cp [10:02:13] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 76 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [10:02:23] (03PS1) 10MVernon: Revert "swift: disable free inode btree at mkfs time" [puppet] - 10https://gerrit.wikimedia.org/r/924575 (https://phabricator.wikimedia.org/T199198) [10:02:49] (03CR) 10CI reject: [V: 04-1] Revert "swift: disable free inode btree at mkfs time" [puppet] - 10https://gerrit.wikimedia.org/r/924575 (https://phabricator.wikimedia.org/T199198) (owner: 10MVernon) [10:03:19] RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:01] (03PS2) 10MVernon: Revert "swift: disable free inode btree at mkfs time" [puppet] - 10https://gerrit.wikimedia.org/r/924575 (https://phabricator.wikimedia.org/T199198) [10:04:07] (ProbeDown) firing: (2) Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:07] RECOVERY - NFS on dumpsdata1006 is OK: TCP OK - 0.000 second response time on 10.64.130.3 port 2049 https://wikitech.wikimedia.org/wiki/Dumps/Dumpsdata_hosts [10:06:07] (03PS1) 10JMeybohm: k8s::kubeconfig: Don't specify certificate-authority [puppet] - 10https://gerrit.wikimedia.org/r/924905 (https://phabricator.wikimedia.org/T325268) [10:06:09] (03PS1) 10JMeybohm: profile::calico::kubernetes: Address potentially undefined variable [puppet] - 10https://gerrit.wikimedia.org/r/924906 (https://phabricator.wikimedia.org/T325268) [10:06:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: remove traces of cloudcontrol2004-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/924899 (https://phabricator.wikimedia.org/T337828) (owner: 10Arturo Borrero Gonzalez) [10:07:08] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [10:07:51] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dumpsdata1006.eqiad.wmnet [10:08:05] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 94 connections established with conf2004.codfw.wmnet:4001 (min=95) https://wikitech.wikimedia.org/wiki/PyBal [10:08:38] (03CR) 10CI reject: [V: 04-1] profile::calico::kubernetes: Address potentially undefined variable [puppet] - 10https://gerrit.wikimedia.org/r/924906 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:08:49] (03CR) 10Jbond: "personally I'm fine with this and happy to +1 for all the I/F services. however i think some teams have historically used this (or system" [puppet] - 10https://gerrit.wikimedia.org/r/924901 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:08:50] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [10:11:00] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41449/console" [puppet] - 10https://gerrit.wikimedia.org/r/924905 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:11:06] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [10:11:06] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:07] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc1002.eqiad.wmnet [10:11:51] (03PS4) 10Hashar: zuul: add a gerrit-reporter gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) [10:12:00] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [10:12:32] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dumpsdata1006.eqiad.wmnet [10:12:52] and that's two fails via the cookbook. sigh [10:13:20] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [10:16:22] (03CR) 10Jbond: [C: 04-1] planet: restrict firewall source range for port 443 to envoy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [10:16:30] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [10:17:03] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dumpsdata1006.eqiad.wmnet [10:17:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/924895 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [10:17:55] grrr [10:18:05] 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10aborrero) a:05aborrero→03Jhancock.wm Please @Jhancock.wm re-rack this server into [[ https://netbox.wikimedia.org/dcim/racks/51/ | rack B1 ]... [10:18:09] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:25] jouncebot: nowandnext [10:20:25] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1000) [10:20:25] In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1300) [10:20:30] cool [10:20:49] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [10:21:03] last try before I give up and try it manually [10:21:21] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dumpsdata1006.eqiad.wmnet [10:22:05] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10klausman) [10:22:19] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.82:4113]) https://wikitech.wikimedia.org/wiki/PyBal [10:22:45] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 79 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [10:23:56] (03CR) 10Ladsgroup: [C: 03+2] mwscript: Avoid prepending maintenance/ if >= 2 dots in argument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920788 (https://phabricator.wikimedia.org/T336819) (owner: 10Ladsgroup) [10:24:07] (ProbeDown) resolved: (2) Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:43] (03Merged) 10jenkins-bot: mwscript: Avoid prepending maintenance/ if >= 2 dots in argument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920788 (https://phabricator.wikimedia.org/T336819) (owner: 10Ladsgroup) [10:24:59] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 125 connections established with conf1007.eqiad.wmnet:4001 (min=126) https://wikitech.wikimedia.org/wiki/PyBal [10:25:04] (03CR) 10Hashar: "I think that will do it. This configures a new connection to Gerrit as gerrit-reporter but should however not be used at all until https:/" [puppet] - 10https://gerrit.wikimedia.org/r/924884 (https://phabricator.wikimedia.org/T309376) (owner: 10Hashar) [10:25:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:920788|mwscript: Avoid prepending maintenance/ if >= 2 dots in argument (T336819)]] [10:25:23] T336819: Maintenance script designed for run.php syntax cannot be executed in Wikimedia production - https://phabricator.wikimedia.org/T336819 [10:25:30] (03PS4) 10Jbond: proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) [10:26:05] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:26:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41450/console" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:27:11] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:920788|mwscript: Avoid prepending maintenance/ if >= 2 dots in argument (T336819)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [10:31:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:10] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.82:4113]) https://wikitech.wikimedia.org/wiki/PyBal [10:34:05] !log eoghan@cumin1001 START - Cookbook sre.hosts.decommission for hosts doc2001.codfw.wmnet [10:34:09] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:920788|mwscript: Avoid prepending maintenance/ if >= 2 dots in argument (T336819)]] (duration: 08m 50s) [10:34:13] T336819: Maintenance script designed for run.php syntax cannot be executed in Wikimedia production - https://phabricator.wikimedia.org/T336819 [10:36:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:49] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10aborrero) +1 to `wmcs-roots` access. The mentioned cumin access has more implications, so I'm hoping that others with more knowledge of the whole access request worklow can co... [10:55:52] PROBLEM - Check systemd state on doc1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "swift: disable free inode btree at mkfs time" [puppet] - 10https://gerrit.wikimedia.org/r/924575 (https://phabricator.wikimedia.org/T199198) (owner: 10MVernon) [10:57:22] (03CR) 10MVernon: [C: 03+2] Revert "swift: disable free inode btree at mkfs time" [puppet] - 10https://gerrit.wikimedia.org/r/924575 (https://phabricator.wikimedia.org/T199198) (owner: 10MVernon) [10:58:38] (03CR) 10Jbond: [C: 04-1] "-1: looks good but see inline for comments and one issue" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [10:59:08] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) [10:59:21] 10SRE-swift-storage, 10Patch-For-Review: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) 05Open→03Resolved All prod swift nodes running bullseye; xfs free inode btree restored, so we can close this. [11:00:51] (03CR) 10Slyngshede: [C: 03+2] R:idp_test Add Netbox next as OIDC consumer. [puppet] - 10https://gerrit.wikimedia.org/r/924895 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:02:24] (03PS6) 10Jbond: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [11:02:26] (03CR) 10Jbond: ferm::service: allow passing array of hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [11:03:02] jbond: thanks for the review! [11:07:07] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [11:08:12] (03PS7) 10Jbond: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [11:08:14] (03PS1) 10Jbond: P:cumin::cloud_targets: use array for srange [puppet] - 10https://gerrit.wikimedia.org/r/924909 [11:08:24] (03CR) 10Jbond: ferm::service: allow passing array of hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [11:09:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/924903 (owner: 10Jbond) [11:09:07] (ProbeDown) firing: (2) Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:12:43] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [11:14:10] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1001" [11:14:11] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:14:11] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc2001.codfw.wmnet [11:14:59] (03CR) 10Jbond: [C: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/924909" [puppet] - 10https://gerrit.wikimedia.org/r/924909 (owner: 10Jbond) [11:18:55] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T329049) [11:19:00] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [11:19:07] (ProbeDown) resolved: (2) Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:31] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 95 connections established with conf2004.codfw.wmnet:4001 (min=95) https://wikitech.wikimedia.org/wiki/PyBal [11:19:33] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:20:22] !log rebooted dumpsdata1006 manually after seeral timeouts trying to use the cookbook; in the end, forced to powercycle the host via mgmt console [11:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [11:25:31] (03CR) 10David Caro: [C: 03+2] P:toolforge::proxy: remove absented logster resources [puppet] - 10https://gerrit.wikimedia.org/r/919803 (owner: 10Majavah) [11:25:54] (03CR) 10David Caro: [C: 03+2] logster: remove classes [puppet] - 10https://gerrit.wikimedia.org/r/919804 (owner: 10Majavah) [11:26:07] (03CR) 10David Caro: [C: 03+2] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/919804 (owner: 10Majavah) [11:26:13] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10cmooney) [11:26:55] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 126 connections established with conf1007.eqiad.wmnet:4001 (min=126) https://wikitech.wikimedia.org/wiki/PyBal [11:26:57] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:27:36] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM in terms of the overall logic." [puppet] - 10https://gerrit.wikimedia.org/r/924899 (https://phabricator.wikimedia.org/T337828) (owner: 10Arturo Borrero Gonzalez) [11:33:34] (03PS11) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [11:36:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T329049) [11:36:48] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [11:37:37] RECOVERY - Check systemd state on doc1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:03] (03PS1) 10KartikMistry: MinT: Update to 2023-05-31-082339-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/924915 [11:42:35] (03PS1) 10Daimona Eaytoy: DeleteAction: Replace remaining OOUI fields [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) [11:42:50] (03PS2) 10Func: DeleteAction: Replace remaining OOUI fields [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [11:43:14] (03CR) 10Daimona Eaytoy: [C: 03+1] DeleteAction: Replace remaining OOUI fields [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [11:43:39] (03CR) 10Func: "lol I also clicked cherry-pick" [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [11:44:09] (03CR) 10Daimona Eaytoy: [C: 03+1] DeleteAction: Replace remaining OOUI fields (031 comment) [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [11:46:48] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10eoghan) [11:54:12] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::kubeconfig: Don't specify certificate-authority [puppet] - 10https://gerrit.wikimedia.org/r/924905 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:54:42] !log disabled puppet on all kubernetes hosts apart from staging-codfw for https://gerrit.wikimedia.org/r/c/operations/puppet/+/924905 [11:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:59:21] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:14] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:03:21] !log re-enabling puppet on all kubernetes hosts [12:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:17] (03PS12) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [12:05:44] (03PS2) 10Daniel Kinzler: Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924358 (owner: 10D3r1ck01) [12:06:12] Do config patches just always show a merge conflict? Is gerrit just choking on the large files? [12:06:39] I just rebased this one, and it applied cleanly. And it's not the first time... [12:08:05] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:08:17] (03CR) 10JMeybohm: [C: 03+1] "Given the "best effort"/"recreate" SLO in https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [12:09:12] (03PS2) 10JMeybohm: profile::calico::kubernetes: Address potentially undefined variable [puppet] - 10https://gerrit.wikimedia.org/r/924906 (https://phabricator.wikimedia.org/T325268) [12:10:16] (03PS3) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on some top wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [12:11:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41453/console" [puppet] - 10https://gerrit.wikimedia.org/r/924906 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:12:00] 10SRE, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Stop cadvisor from collecting extra metrics from docker - https://phabricator.wikimedia.org/T337856 (10fgiunchedi) [12:16:01] (03PS1) 10Filippo Giunchedi: prometheus: disable docker collection in cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924917 (https://phabricator.wikimedia.org/T337856) [12:18:36] (03CR) 10Muehlenhoff: "Thanks for the quick review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:22:23] (03PS13) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [12:24:42] (03CR) 10CI reject: [V: 04-1] Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:26:41] (03PS14) 10Muehlenhoff: Add a cookbook to drain a Ganeti node [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) [12:27:24] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] profile::calico::kubernetes: Address potentially undefined variable [puppet] - 10https://gerrit.wikimedia.org/r/924906 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:28:00] (03PS11) 10JMeybohm: prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) [12:29:49] (03CR) 10Volans: dhcp: reword some exception messages (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [12:29:54] (03PS5) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 [12:33:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41454/console" [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:35:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on P{cp[2028,2030,2032,2034,2036,2038,2040].codfw.wmnet} and A:cp [12:35:56] (03CR) 10Volans: [C: 03+2] dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [12:40:22] (03Merged) 10jenkins-bot: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225 (owner: 10Volans) [12:43:00] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: use sshkey for git-ssh public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:45:32] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10chadhat) [12:46:54] (03PS1) 10Vgutierrez: hiera: Swap port 80 from varnish to haproxy on codfw@text [puppet] - 10https://gerrit.wikimedia.org/r/924919 (https://phabricator.wikimedia.org/T323557) [12:48:36] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:02] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41456/console" [puppet] - 10https://gerrit.wikimedia.org/r/924919 (https://phabricator.wikimedia.org/T323557) (owner: 10Vgutierrez) [12:49:56] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:51:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/924921 [12:51:53] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/924921 (owner: 10Volans) [12:51:58] (03PS7) 10Ilias Sarantopoulos: ORES: add model versions configuration and thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) [12:52:06] (03PS1) 10Vgutierrez: service::catalog: Set schema depool_threshold to 0.5 [puppet] - 10https://gerrit.wikimedia.org/r/924922 [12:52:32] (03PS1) 10Ladsgroup: wikireplicas: Only try to update section databases in maintain_replica_indexes.py [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) [12:54:53] (03CR) 10CI reject: [V: 04-1] wikireplicas: Only try to update section databases in maintain_replica_indexes.py [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [12:55:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/924921 (owner: 10Volans) [12:56:07] (03CR) 10Ilias Sarantopoulos: "I also added the configuration for the thresholds instead of loading them from files" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/922512 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [12:57:03] (03PS1) 10Volans: Upstream release v7.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924925 [12:57:05] (03PS2) 10Ladsgroup: wikireplicas: Only try to update section databases in maintain_replica_indexes.py [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) [12:57:23] (03CR) 10Volans: [C: 03+2] Upstream release v7.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924925 (owner: 10Volans) [12:57:50] jouncebot: nowandnext [12:57:50] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [12:57:51] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1300) [12:57:53] (03PS3) 10Ladsgroup: wikireplicas: Only try to update section databases in maintain_replica_indexes.py [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) [12:58:57] (03PS1) 10Gehel: Cleanup unused hiera variable. [puppet] - 10https://gerrit.wikimedia.org/r/924946 [12:59:13] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1300). [13:00:05] duesen, Urbanecm, Func, and MdsShakil: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [13:00:09] * urbanecm waves [13:00:11] i can deploy today [13:00:18] o/ I'm around too [13:00:19] (03CR) 10CI reject: [V: 04-1] wikireplicas: Only try to update section databases in maintain_replica_indexes.py [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [13:00:20] o/ [13:00:24] (03PS1) 10Urbanecm: Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924939 (https://phabricator.wikimedia.org/T322452) [13:00:35] (03PS1) 10Urbanecm: Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) [13:01:00] urbanecm, taavi: i'm here [13:01:17] let's do it [13:01:24] cool [13:01:35] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:01:38] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924939 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:01:49] (03CR) 10Urbanecm: [C: 03+2] DeleteAction: Replace remaining OOUI fields [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [13:01:54] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s: Use kubernetes::clusters_defaults [puppet] - 10https://gerrit.wikimedia.org/r/913114 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:01:59] (03CR) 10Elukey: [C: 03+1] prometheus: disable docker collection in cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924917 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [13:02:03] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924358 (owner: 10D3r1ck01) [13:02:10] (03Merged) 10jenkins-bot: Upstream release v7.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/924925 (owner: 10Volans) [13:02:34] (03PS4) 10Ladsgroup: wikireplicas: Only add index in dbs defined in sections dblist [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) [13:02:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924358 (owner: 10D3r1ck01) [13:02:56] (03CR) 10Gehel: [C: 04-1] "There are actually changes to PCC, so this isn't what I expected." [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [13:03:37] MdsShakil: hi, just checking whether you're around for your patch as well (I'll ping you once it's ready for testing). [13:03:47] (03Merged) 10jenkins-bot: Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924358 (owner: 10D3r1ck01) [13:04:10] Hello :/ [13:04:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924358|Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis""]] [13:05:08] MdsShakil: thanks for confirming your presence, and hi! [13:06:07] !log urbanecm@deploy1002 urbanecm and d3r1ck01: Backport for [[gerrit:924358|Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis""]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:06:17] duesen: your patch is at mwdebug1001, can you test please? [13:06:46] (03PS4) 10Urbanecm: [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) [13:06:51] (03CR) 10Urbanecm: [C: 03+2] [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [13:09:04] (03Merged) 10jenkins-bot: [Growth] Enable new Impact for 10 additional wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924060 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [13:09:15] urbanecm: trying now [13:09:17] (03PS1) 10Filippo Giunchedi: prometheus: sort output for class/resource targets [puppet] - 10https://gerrit.wikimedia.org/r/924948 [13:09:20] let me know how it goes [13:10:38] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:50] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:35] (03PS1) 10Hokwelum: Make dumpsdata1006 the nfs primary for xmldumps and dumpsdata1005 a spare [puppet] - 10https://gerrit.wikimedia.org/r/924949 (https://phabricator.wikimedia.org/T330573) [13:12:40] urbanecm: tested on https://eo.wikipedia.org/wiki/Bruger:DKinzler_(WMF)/Sandbox2 [13:12:42] looks good. [13:12:46] great, syncing [13:13:06] confirmed that i'm getting an etag consistent with "direct mode" being enabled for the VE backend. [13:13:32] 👍 [13:13:39] 10SRE-swift-storage, 10Cloud-VPS (Debian Stretch Deprecation): Cloud VPS "swift" project Stretch deprecation - https://phabricator.wikimedia.org/T306098 (10MatthewVernon) [13:15:09] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_product@5a38fbf]: (no justification provided) [13:15:15] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_product@5a38fbf]: (no justification provided) (duration: 00m 06s) [13:15:56] !log uploaded spicerack_7.2.0 to apt.wikimedia.org bullseye-wikimedia [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:59] (03PS11) 10JMeybohm: prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) [13:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:26] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924358|Revert "Revert "Switch VisualEditor to not use RESTbase on small and medium wikis""]] (duration: 15m 10s) [13:19:44] duesen: should be live! [13:20:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924060|[Growth] Enable new Impact for 10 additional wikis (T336203)]] [13:20:07] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [13:20:45] (03PS2) 10Vgutierrez: hiera: Swap port 80 from varnish to haproxy on text@codfw [puppet] - 10https://gerrit.wikimedia.org/r/924919 (https://phabricator.wikimedia.org/T323557) [13:21:40] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:924060|[Growth] Enable new Impact for 10 additional wikis (T336203)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:21:45] testing... [13:21:47] urbanecm: logs and metrics look good so far [13:21:51] awesome [13:22:21] (03PS4) 10Ottomata: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) [13:22:42] (03CR) 10Fabfur: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/924919 (https://phabricator.wikimedia.org/T323557) (owner: 10Vgutierrez) [13:23:19] (03PS2) 10Hokwelum: Make dumpsdata1006 the nfs primary for xmldumps and dumpsdata1005 a spare [puppet] - 10https://gerrit.wikimedia.org/r/924949 (https://phabricator.wikimedia.org/T325232) [13:23:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41459/console" [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:27:29] (03CR) 10CI reject: [V: 04-1] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:27:57] ... [13:28:16] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924060|[Growth] Enable new Impact for 10 additional wikis (T336203)]] (duration: 08m 13s) [13:28:21] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [13:28:48] (03PS4) 10Urbanecm: Enable wgMinervaEnableSiteNotice for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923645 (https://phabricator.wikimedia.org/T337683) (owner: 10MdsShakil) [13:29:41] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-text_codfw [13:29:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923645 (https://phabricator.wikimedia.org/T337683) (owner: 10MdsShakil) [13:30:04] (03PS5) 10Ottomata: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) [13:30:30] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:31:07] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923645 (https://phabricator.wikimedia.org/T337683) (owner: 10MdsShakil) [13:31:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923645|Enable wgMinervaEnableSiteNotice for bnwikiquote (T337683)]] [13:31:41] T337683: Enable wgMinervaEnableSiteNotice for bnwikiquote - https://phabricator.wikimedia.org/T337683 [13:32:37] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s switch staging-codfw to client cert auth [puppet] - 10https://gerrit.wikimedia.org/r/913149 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:33:10] !log urbanecm@deploy1002 mdsshakil and urbanecm: Backport for [[gerrit:923645|Enable wgMinervaEnableSiteNotice for bnwikiquote (T337683)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:33:35] MdsShakil: your patch is available for testing at mwdebug1001. can you test it there please? [13:34:32] urbanecm: I have no edit interface access on this wiki, so how can i live test it? [13:34:38] (03CR) 10Stevemunene: Cleanup unused hiera variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924946 (owner: 10Gehel) [13:35:14] MdsShakil: i see. for some reason i thought the sitenotice is currently active there. i'll deploy it as-is then. [13:35:20] !log rm cadvisor.service symlink/alias and restart kubelet on affected hosts - T337836 [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] T337836: Cadvisor may be breaking Kubernetes worker nodes - https://phabricator.wikimedia.org/T337836 [13:36:00] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [13:36:37] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - deploy in eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/922839 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [13:37:40] (03PS1) 10Urbanecm: NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924941 (https://phabricator.wikimedia.org/T337320) [13:38:17] (03CR) 10Urbanecm: [C: 03+2] NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924941 (https://phabricator.wikimedia.org/T337320) (owner: 10Urbanecm) [13:39:21] !log destroy mw-page-content-change-enrich deployment in dse-k8s-eqiad in order to deploy in wikikube - T330507 [13:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:25] T330507: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 [13:40:07] urbanecm: Although i tried to contact with an administrator for active site notice but they are not respond yet [13:40:38] it should be fully deployed within a few seconds, and if it doesn't help, the task can be always reopened and we can investigate. [13:41:02] (03CR) 10Urbanecm: [C: 03+2] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:41:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923645|Enable wgMinervaEnableSiteNotice for bnwikiquote (T337683)]] (duration: 10m 01s) [13:41:42] T337683: Enable wgMinervaEnableSiteNotice for bnwikiquote - https://phabricator.wikimedia.org/T337683 [13:41:47] MdsShakil: and your patch is deployed! [13:41:57] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:42:02] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:42:32] urbanecm Thanks [13:42:35] np [13:44:47] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:44:51] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:45:48] (03PS1) 10JMeybohm: prometheus::k8s Fix removal of bearer_token_file [puppet] - 10https://gerrit.wikimedia.org/r/924953 (https://phabricator.wikimedia.org/T325268) [13:46:28] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T329049) [13:46:32] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [13:47:18] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 77 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [13:47:20] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:47:39] (03PS1) 10Majavah: dnsrecursor: add reverse DNS for cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) [13:48:07] (03CR) 10CI reject: [V: 04-1] prometheus::k8s Fix removal of bearer_token_file [puppet] - 10https://gerrit.wikimedia.org/r/924953 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:50:12] (03Merged) 10jenkins-bot: Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924939 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:50:16] finally [13:50:17] (03Merged) 10jenkins-bot: DeleteAction: Replace remaining OOUI fields [core] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/924576 (https://phabricator.wikimedia.org/T337809) (owner: 10Daimona Eaytoy) [13:50:20] (03CR) 10CI reject: [V: 04-1] Personalized praise: Fix first-ever notifications [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:50:47] (03CR) 10Urbanecm: [C: 03+2] "third time the charm..." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [13:51:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41460/console" [puppet] - 10https://gerrit.wikimedia.org/r/924953 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:51:31] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924939|Personalized praise: Fix first-ever notifications (T322452)]], [[gerrit:924576|DeleteAction: Replace remaining OOUI fields (T337809)]] [13:51:38] T322452: Personalized praise Echo notification - https://phabricator.wikimedia.org/T322452 [13:51:38] T337809: Error deleting pages on MediaWiki.org: Cannot use object of type OOUI\FieldLayout as array - https://phabricator.wikimedia.org/T337809 [13:51:48] Func: your patch is about to go next [13:51:53] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/919300/41452/, all of the changes and failures look unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [13:52:00] ack [13:52:44] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:53:03] !log urbanecm@deploy1002 daimona and urbanecm: Backport for [[gerrit:924939|Personalized praise: Fix first-ever notifications (T322452)]], [[gerrit:924576|DeleteAction: Replace remaining OOUI fields (T337809)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:53:05] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:53:12] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:53:20] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:53:47] urbanecm: Could you help to test by opening the delete form on a page with talk page? The form should be displayed without error, and with a checkbox to delete the talk page too. [13:53:54] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:54:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:55:06] Func: sure. i don't see an error when viewing outside of the debug interface though, the checkbox seems to be visible anyway. [13:55:25] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:55:39] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:55:44] trying on https://en.wikipedia.org/w/index.php?title=George_S._Morrison_(diplomat)&action=delete [13:55:49] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:55:55] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:56:02] (03PS2) 10EoghanGaffney: Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429 [13:56:08] (03CR) 10Cathal Mooney: "LGTM! I will leave to someone in traffic to +1 as I'm not 100% but this ought to do what we need." [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) (owner: 10Majavah) [13:56:12] urbanecm: only for wmf.11 which not reach enwiki [13:56:16] ah [13:56:18] of course [13:56:34] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 80 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [13:56:34] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:56:45] Func: now it works. sorry, didn't realize it was wmf.11 specific thing. proceeding! [13:56:45] (03CR) 10EoghanGaffney: [C: 03+2] Apply puppet role to new releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/923429 (owner: 10EoghanGaffney) [13:56:49] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:56:59] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:57:00] urbanecm: thanks [13:57:48] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:58:04] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:58:12] (03PS2) 10JMeybohm: prometheus::k8s Fix removal of bearer_token_file [puppet] - 10https://gerrit.wikimedia.org/r/924953 (https://phabricator.wikimedia.org/T325268) [13:58:18] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:17] (03CR) 10Majavah: [C: 03+1] "the changes you made LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [14:00:35] (03CR) 10Ssingh: [C: 03+1] "LGTM based on what you shared earlier and matches up to that! Checked PCC output as well." [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [14:00:51] (03CR) 10CI reject: [V: 04-1] NewImpact: Cache empty user impact on account creation [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924941 (https://phabricator.wikimedia.org/T337320) (owner: 10Urbanecm) [14:01:11] ...something's wrong with wmf.10 [14:01:12] (03CR) 10JMeybohm: [C: 03+2] prometheus::k8s Fix removal of bearer_token_file [puppet] - 10https://gerrit.wikimedia.org/r/924953 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:01:17] (03CR) 10Jbond: [C: 03+2] admin: remove ssh key for ksarabia [puppet] - 10https://gerrit.wikimedia.org/r/924903 (owner: 10Jbond) [14:01:56] jbond: okay to merge? [14:02:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T329049) [14:02:10] T329049: Configure REST Gateway - https://phabricator.wikimedia.org/T329049 [14:02:16] (03PS1) 10Jbond: Revert "admin: remove ssh key for ksarabia" [puppet] - 10https://gerrit.wikimedia.org/r/924942 [14:02:19] (03PS4) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on some top wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [14:02:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "admin: remove ssh key for ksarabia" [puppet] - 10https://gerrit.wikimedia.org/r/924942 (owner: 10Jbond) [14:02:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924939|Personalized praise: Fix first-ever notifications (T322452)]], [[gerrit:924576|DeleteAction: Replace remaining OOUI fields (T337809)]] (duration: 11m 11s) [14:02:48] jayme: can yuo cancle and then re merge [14:02:49] T322452: Personalized praise Echo notification - https://phabricator.wikimedia.org/T322452 [14:02:49] T337809: Error deleting pages on MediaWiki.org: Cannot use object of type OOUI\FieldLayout as array - https://phabricator.wikimedia.org/T337809 [14:02:49] (03PS5) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) [14:02:52] jbond: I see revert [14:02:52] i have sent a revert for mine [14:03:02] ack, canceled [14:03:02] yes you can merge themif the revert is ther [14:03:09] or i can merge yuors [14:03:22] {done} [14:03:27] thanks [14:03:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:14] (03PS3) 10Eevans: hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) [14:04:14] urbanecm: parser test issues? [14:04:16] (03PS3) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [14:04:18] (03PS3) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [14:04:37] (03CR) 10CI reject: [V: 04-1] hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [14:04:44] urbanecm: I guess we should backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/923701 to unbreak CI on wmf.10 [14:04:46] kostajh: yes, one of the things that failed in the last few attempts on this patch or the other one. are the parser issues known? [14:04:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) @Clement_Goubert It's been a week and I'm not seeing any errors from the lifecycle controller. Do you think this could be resolved now? [14:05:07] thanks Func. [14:05:29] ^ +1 [14:05:34] (03CR) 10Jbond: [C: 03+2] admin: remove ssh key for ksarabia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924903 (owner: 10Jbond) [14:05:49] my guess was https://gerrit.wikimedia.org/r/c/integration/config/+/923562 but I imagine we'd see many more failures if that was the case [14:05:50] i'll bypass the ci instead, wmf.10 will be not used for much longer, and both patches passed in .11 and master. [14:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:32] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "bypassing CI: I513703b4c1f002c75afd7d4792d47aa3cca0e726 is not in wmf.10, patch passed in .11 and master." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924941 (https://phabricator.wikimedia.org/T337320) (owner: 10Urbanecm) [14:06:38] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "bypassing CI: I513703b4c1f002c75afd7d4792d47aa3cca0e726 is not in wmf.10, patch passed in .11 and master." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.10) - 10https://gerrit.wikimedia.org/r/924940 (https://phabricator.wikimedia.org/T322452) (owner: 10Urbanecm) [14:06:51] (03CR) 10Jbond: ferm::service: allow passing array of hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [14:07:02] (03PS4) 10Eevans: hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) [14:07:05] (03PS4) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [14:07:09] thanks both :) [14:07:11] (03PS4) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [14:07:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924941|NewImpact: Cache empty user impact on account creation (T337320)]], [[gerrit:924940|Personalized praise: Fix first-ever notifications (T322452)]] [14:07:21] T337320: [Spike] Investigate A/B test results of Growth Experiments impact module - https://phabricator.wikimedia.org/T337320 [14:08:01] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [14:08:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:08:49] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:924941|NewImpact: Cache empty user impact on account creation (T337320)]], [[gerrit:924940|Personalized praise: Fix first-ever notifications (T322452)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:08:56] T322452: Personalized praise Echo notification - https://phabricator.wikimedia.org/T322452 [14:10:24] (03CR) 10Ssingh: [C: 03+1] dnsrecursor: add reverse DNS for cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) (owner: 10Majavah) [14:10:52] (03CR) 10Ssingh: [C: 03+1] "Let me know if you want someone to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) (owner: 10Majavah) [14:11:02] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/924954/41463/dns1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) (owner: 10Majavah) [14:13:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) 05Open→03Resolved Yes, thank you, resolving. [14:14:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924941|NewImpact: Cache empty user impact on account creation (T337320)]], [[gerrit:924940|Personalized praise: Fix first-ever notifications (T322452)]] (duration: 07m 26s) [14:14:47] * urbanecm done [14:14:49] T337320: [Spike] Investigate A/B test results of Growth Experiments impact module - https://phabricator.wikimedia.org/T337320 [14:14:49] T322452: Personalized praise Echo notification - https://phabricator.wikimedia.org/T322452 [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:01] (03PS1) 10JMeybohm: prometheus::k8s Make client certs readably by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) [14:21:12] (03PS1) 10Bking: rdf-streaming-updater: New docker image and jemalloc config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924958 (https://phabricator.wikimedia.org/T334244) [14:22:01] (03PS2) 10JMeybohm: prometheus::k8s Make client certs readably by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) [14:22:10] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: New docker image and jemalloc config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924958 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:22:19] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: New docker image and jemalloc config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924958 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:23:21] (03Merged) 10jenkins-bot: rdf-streaming-updater: New docker image and jemalloc config [deployment-charts] - 10https://gerrit.wikimedia.org/r/924958 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking) [14:25:01] jouncebot: nowandnext [14:25:01] No deployments scheduled for the next 2 hour(s) and 34 minute(s) [14:25:01] In 2 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1700) [14:25:20] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:25:22] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:25:30] (03CR) 10Vgutierrez: [C: 03+2] service::catalog: Set schema depool_threshold to 0.5 [puppet] - 10https://gerrit.wikimedia.org/r/924922 (owner: 10Vgutierrez) [14:25:37] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:27:12] !log bking@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:27:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41464/console" [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:27:54] !log bking@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:28:03] (03PS3) 10JMeybohm: prometheus::k8s Make client certs readably by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) [14:28:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:43] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:35] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BBlack) We've got a pair of patches to review now which configure this on the pybal and safe-service-... [14:30:00] (03CR) 10Klausman: [C: 03+1] prometheus: disable docker collection in cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924917 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [14:30:42] (03PS5) 10Eevans: hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) [14:30:44] (03PS5) 10Eevans: hieradata: upgrade cassandra-dev2002 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924611 (https://phabricator.wikimedia.org/T313814) [14:30:46] (03PS5) 10Eevans: hieradata: upgrade cassandra-dev2003 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924612 (https://phabricator.wikimedia.org/T313814) [14:32:06] PROBLEM - Check systemd state on releases2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:08] RECOVERY - Check systemd state on releases2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:28] (03PS1) 10Jbond: puppetdb: create function to check how other hosts have been configured [puppet] - 10https://gerrit.wikimedia.org/r/924961 [14:34:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41465/console" [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:36:22] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:36] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:38:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s Make client certs readably by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/924957 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [14:38:33] (03CR) 10Effie Mouzeli: [C: 03+1] Enable parser cache warming jobs for parsoid on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923588 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [14:38:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one final nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [14:39:10] (03CR) 10Ssingh: [C: 03+1] "[Reminder for us]: We should upgrade Pybal on all hosts to 1.15.13 before merging this change." [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [14:39:40] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:39:52] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:08] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:43:21] !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:43:43] (03PS1) 10Elukey: debian: remove cadvisor from the kubelet's systemd unit [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) [14:43:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:44:41] !log bking@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:44:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:46:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:48:23] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10Jhancock.wm) [14:48:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:48:50] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2004-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T337828 (10Jhancock.wm) @aborrero server has been re-racked in B1 - U21 and connected to the cloudsw-b1 switch on port ge-1/0/21. [14:49:49] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:50:35] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:50:55] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:51:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:52:36] (03PS1) 10JMeybohm: prometheus::k8s Enable client cert auth for mlstaging and staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/924965 (https://phabricator.wikimedia.org/T325268) [14:53:16] (03PS2) 10JMeybohm: prometheus::k8s Enable client cert auth for mlstaging and staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/924965 (https://phabricator.wikimedia.org/T325268) [14:53:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:54:19] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:54:43] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:55:18] !log klausman@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:55:30] !log klausman@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:56:06] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/924596/41470/mw1421.eqiad.wmnet/index.html looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [14:56:18] (03PS3) 10JMeybohm: prometheus::k8s Enable client cert auth for mlstaging and staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/924965 (https://phabricator.wikimedia.org/T325268) [14:58:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:59:34] (03CR) 10Elukey: "original commit: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/477ac9742258bf26348b269befb06db828978b98%5E%2" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [15:00:13] (03CR) 10Ssingh: [C: 03+1] safe-service-restart: use failover i13n (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:01:12] (03PS10) 10BBlack: pybal: configure failover i13n IPs [puppet] - 10https://gerrit.wikimedia.org/r/924593 (https://phabricator.wikimedia.org/T334703) [15:01:14] (03PS4) 10BBlack: safe-service-restart: use failover i13n [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) [15:01:25] (03CR) 10BBlack: safe-service-restart: use failover i13n (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:01:32] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41471/console" [puppet] - 10https://gerrit.wikimedia.org/r/924965 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:02:27] (03PS2) 10Hnowlan: service: move rest-gateway to production [puppet] - 10https://gerrit.wikimedia.org/r/920667 (https://phabricator.wikimedia.org/T329049) [15:03:13] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/924596/41472/mw1421.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:03:57] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s Enable client cert auth for mlstaging and staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/924965 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:04:21] (03CR) 10Klausman: [C: 03+1] debian: remove cadvisor from the kubelet's systemd unit [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [15:04:26] jayme: wow nice [15:05:51] trying to get all the things out of the door before pto :D [15:06:38] launch the granade and hide your hand, got it [15:07:19] (03CR) 10Filippo Giunchedi: [C: 03+1] debian: remove cadvisor from the kubelet's systemd unit [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/924963 (https://phabricator.wikimedia.org/T337836) (owner: 10Elukey) [15:09:31] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: disable docker collection in cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/924917 (https://phabricator.wikimedia.org/T337856) (owner: 10Filippo Giunchedi) [15:11:42] (03CR) 10Cathal Mooney: [C: 03+2] dnsrecursor: add reverse DNS for cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/924954 (https://phabricator.wikimedia.org/T335759) (owner: 10Majavah) [15:12:49] (03PS1) 10EoghanGaffney: releases: Add new hosts to failover servers list [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) [15:12:51] (03CR) 10BBlack: "Output looks correct now (array form): https://puppet-compiler.wmflabs.org/output/924596/41473/mw1419.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/924596 (https://phabricator.wikimedia.org/T334703) (owner: 10BBlack) [15:13:31] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) @Vgutierrez we are now running bullseye everywhere, so if you are wanting to look at TLS termination on the swift frontends, I think we're not longer blocking you by having... [15:14:11] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41474/console" [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [15:14:55] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase κ – Clean-up): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Jdforrester-WMF) 05Open→03Resolved I believe this is long-since done, thanks to SRE Traffic. [15:23:27] (03PS6) 10Elukey: varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 [15:23:29] (03PS10) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 [15:23:31] (03PS6) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 [15:24:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41475/console" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [15:25:39] (03PS2) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [15:26:06] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:22] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [15:26:27] (03PS1) 10BBlack: Fix some confusing typos in safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/924973 [15:26:29] (03PS1) 10BBlack: safe-service-restart: pre-verify the verifier [puppet] - 10https://gerrit.wikimedia.org/r/924974 (https://phabricator.wikimedia.org/T334703) [15:26:43] (03CR) 10Elukey: "Thanks a lot for the in depth review John, I tried to apply all the suggestions!" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [15:27:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41476/console" [puppet] - 10https://gerrit.wikimedia.org/r/924509 (owner: 10Elukey) [15:31:55] (03PS2) 10Jbond: puppetdb: create function to check how other hosts have been configured [puppet] - 10https://gerrit.wikimedia.org/r/924961 [15:31:57] (03PS1) 10Jbond: puppetmaster: create4 function to check parameteres on the puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/924975 (https://phabricator.wikimedia.org/T268344) [15:32:42] (03PS1) 10JMeybohm: prometheus::k8s Reload prometheus on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/924976 (https://phabricator.wikimedia.org/T325268) [15:33:37] (03CR) 10Jbond: puppetmaster: add new function to check for local files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922877 (https://phabricator.wikimedia.org/T268344) (owner: 10Jbond) [15:34:06] PROBLEM - Webrequests Varnishkafka log producer on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [15:34:16] PROBLEM - eventlogging Varnishkafka log producer on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [15:34:19] jouncebot: nowandnext [15:34:19] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [15:34:19] In 1 hour(s) and 25 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1700) [15:34:48] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:35:00] PROBLEM - statsv Varnishkafka log producer on cp2035 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [15:35:00] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:35:03] er? [15:35:22] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:35:40] PROBLEM - Varnish HTTP text-frontend - port 80 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:35:50] jouncebot: nowandnext [15:35:50] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [15:35:50] In 1 hour(s) and 24 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1700) [15:35:58] sukhe: -sre [15:35:58] sukhe: probably related with the cookbook run [15:36:08] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:36:09] yeah thanks, saw it on other channel just now! [15:36:36] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:36:44] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:36:45] (03PS1) 10Muehlenhoff: debmonitor: Install Debian Django packages on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) [15:36:46] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:36:50] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp2035 is CRITICAL: connect to address 10.192.32.18 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:37:08] (03CR) 10CI reject: [V: 04-1] debmonitor: Install Debian Django packages on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [15:37:29] (03CR) 10Ssingh: [C: 03+1] pybal: Switch codfw LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/924559 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:38:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wikireplicas: Only add index in dbs defined in sections dblist [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [15:39:18] (03PS2) 10Muehlenhoff: debmonitor: Install Debian Django packages on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) [15:39:40] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=1) rolling custom on A:cp-text_codfw [15:40:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: ipmi/mgmt console issues [15:40:53] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: ipmi/mgmt console issues [15:42:01] !log Maglev LVS scheduler rollout finished in codfw - T263797 [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:06] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [15:42:10] Oh shit, *began* [15:42:26] !log Maglev LVS scheduler rollout began IN PROGRESS, not finished - T263797 [15:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:16] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump image to 1.12.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924979 (https://phabricator.wikimedia.org/T325303) [15:44:23] (03PS2) 10Ottomata: mw-page-content-change-enrich - bump image to 1.12.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924979 (https://phabricator.wikimedia.org/T325303) [15:44:57] 10SRE-swift-storage, 10Cloud-VPS (Debian Stretch Deprecation): Cloud VPS "swift" project Stretch deprecation - https://phabricator.wikimedia.org/T306098 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I've now completed the bullseye upgrade for the remaining stretch nodes, and have decommissioned t... [15:45:48] !log cp2035 depooled as puppet is unable to run due to ipmi issues - T337247 [15:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:53] T337247: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 [15:46:17] (03CR) 10BCornwall: [C: 03+2] pybal: Switch codfw LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/924559 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall) [15:47:11] !log delete virtual machines from "swift" WMCS project [15:47:12] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 (10Vgutierrez) >>! In T337247#8883045, @Jhancock.wm wrote: > @jcrespo thank you for the insight! > > @BBlack could you assist me with this? when would be a good time for this? I know we're about to go into a holiday weekend. but the serv... [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:18] (03PS2) 10JMeybohm: prometheus::k8s Reload prometheus on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/924976 (https://phabricator.wikimedia.org/T325268) [15:47:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:48:27] Emperor: would that be better in the project's SAL as well? [15:48:33] !log brett@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 [15:48:37] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [15:48:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [15:50:57] !log brett@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 02m 24s) [15:51:03] !log brett@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 [15:51:25] !log swift delete virtual machines from "swift" WMCS project [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] PROBLEM - pybal on lvs2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:51:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:52:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:52:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41478/console" [puppet] - 10https://gerrit.wikimedia.org/r/924976 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:53:03] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] prometheus::k8s Reload prometheus on certificate change [puppet] - 10https://gerrit.wikimedia.org/r/924976 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [15:53:04] RECOVERY - pybal on lvs2010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:53:18] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:53:28] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:55:18] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:55:35] ^ expected [15:57:30] (03CR) 10Dzahn: "Membership of ops group in LDAP and YAML are not identical: ['fabfur', 'nskaggs']" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [15:57:52] (03CR) 10Dzahn: admin: Add fabfur user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [15:57:54] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:58:27] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10Dzahn) Membership of ops group in LDAP and YAML are not identical: ['fabfur', 'nskaggs'] ^ this would still need to be fixed [15:58:35] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] admin: Add fabfur user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur) [15:58:44] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [15:58:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:59:12] PROBLEM - pybal on lvs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:01:21] (03PS1) 10Majavah: ldap: Drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/924981 [16:01:23] (03PS1) 10Majavah: sudo: remove sudoldap support [puppet] - 10https://gerrit.wikimedia.org/r/924982 [16:01:25] (03PS1) 10Majavah: P:wmcs::instance: drop stretch-backports config [puppet] - 10https://gerrit.wikimedia.org/r/924983 [16:01:27] (03PS1) 10Majavah: ldap: inline yamlconfig [puppet] - 10https://gerrit.wikimedia.org/r/924984 [16:01:29] (03PS1) 10Majavah: ldap::client::sssd: use strongly typed parameters [puppet] - 10https://gerrit.wikimedia.org/r/924985 [16:02:01] (03PS3) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [16:02:54] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:02:56] (03PS2) 10Effie Mouzeli: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) [16:03:23] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 425 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:03:38] (03PS3) 10Ottomata: mw-page-content-change-enrich - bump image to 1.20.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924979 (https://phabricator.wikimedia.org/T325303) [16:03:41] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 429 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:03:47] RECOVERY - Webrequests Varnishkafka log producer on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:03:47] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 429 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:03:51] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 430 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:03:52] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mw-page-content-change-enrich - bump image to 1.20.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/924979 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [16:03:57] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 430 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:03:57] RECOVERY - eventlogging Varnishkafka log producer on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:04:03] (03CR) 10CI reject: [V: 04-1] ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:04:17] RECOVERY - statsv Varnishkafka log producer on cp2035 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:04:39] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [16:04:58] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:05:41] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 430 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:06:57] (03PS1) 10Ottomata: mw-page-content-change-enrich - topics should be a list [deployment-charts] - 10https://gerrit.wikimedia.org/r/924986 [16:07:48] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - topics should be a list [deployment-charts] - 10https://gerrit.wikimedia.org/r/924986 (owner: 10Ottomata) [16:08:27] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [16:08:39] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:09:26] (03PS4) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [16:10:31] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:10:53] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [16:10:58] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:11:55] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 430 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:11:55] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2035 is OK: HTTP OK: HTTP/1.1 200 OK - 430 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:11:58] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 3 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10Kappakayala) @LSobanski will connect on this to gather more information and discuss on next steps. [16:12:19] RECOVERY - pybal on lvs2009 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:12:55] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp2035.codfw.wmnet [16:12:55] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2035.codfw.wmnet [16:12:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:42] !log repool cp2035 - T337247 T323557 [16:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:47] T337247: ManagementSSHDown - https://phabricator.wikimedia.org/T337247 [16:13:47] T323557: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 [16:14:05] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 77 connections established with conf2005.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [16:17:11] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:17:47] (03PS5) 10Jbond: proffile::firewall: create new firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) [16:17:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:06] (03CR) 10Jbond: "thanks ill look to merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [16:18:21] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:18:33] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:20:10] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on P{cp[2037,2039,2041].codfw.wmnet} and A:cp [16:20:12] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10bking) Per today's SRE meeting, the larger SRE org is working on [[ https://etherpad.wikimedia.org/p/alert-review-may-2... [16:21:36] (03CR) 10Jbond: "Looks good to me cheers" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [16:21:47] (03CR) 10Jbond: [C: 03+1] profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (owner: 10Elukey) [16:21:57] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:30] !log `systemctl reset-failed session-c6111.scope session-c7230.scope` on stat1005 to clear old alerts [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:46] (transient units) [16:28:21] (03PS5) 10Ladsgroup: wikireplicas: Only add index in dbs defined in sections dblist [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) [16:28:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wikireplicas: Only add index in dbs defined in sections dblist [puppet] - 10https://gerrit.wikimedia.org/r/924923 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [16:32:32] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:32:36] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/922815 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [16:37:25] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@5379d83]: (no justification provided) [16:37:59] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@5379d83]: (no justification provided) (duration: 00m 34s) [16:50:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:51:04] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:53:12] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:53:36] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [16:53:38] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:54:06] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:54:50] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:55:12] Starting work to swap in dumpsdata1006 as primary nfs dumps server starting now, replacing dumpsdata1005 [16:55:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:55:27] (03PS5) 10Effie Mouzeli: ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) [16:56:06] (03CR) 10CI reject: [V: 04-1] ipoid: Create iPoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/921700 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [16:56:40] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:59:31] !log brett@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 [16:59:36] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [16:59:51] ^Continuing LVS rollout, had lost shell connection (and wasn't using tmux like a good boy" [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1700) [17:00:25] tsk! [17:00:53] brett: nothing planned in the MW infra window, you're fine to continue [17:01:04] thank you! [17:05:22] (03PS1) 10Kimberly Sarabia: Enables ab test for multiple languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) [17:06:45] jouncebot: nowandnext [17:06:45] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1700) [17:06:45] In 0 hour(s) and 53 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1800) [17:06:45] In 0 hour(s) and 53 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1800) [17:07:15] brett: can you ping me once you're done? [17:07:28] need to deploy some stuff [17:07:32] Amir1: Sure! I'm finishing the last one now [17:07:54] RECOVERY - pybal on lvs2012 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:08:02] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:08:03] no rush, a late deploy is always preferred to a fully depooled appserver cluster [17:08:09] ha [17:08:28] Amir1: this is a scary joke [17:08:30] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:08:38] I know :P [17:09:22] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [17:09:43] Amir1: better now than last year! [17:10:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on P{cp[2037,2039,2041].codfw.wmnet} and A:cp [17:10:30] !log Maglev LVS scheduler rollout in codfw finished - T263797 [17:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:35] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [17:10:38] !log brett@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 (duration: 11m 07s) [17:10:40] Amir1: Good to go [17:10:42] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [17:10:54] awesome. I wait ten minutes on top. Just in case [17:12:19] Amir1: I keep https://gerrit.wikimedia.org/r/c/operations/dns/+/924561 prepped for LVS work. the "just in case" part :) [17:12:36] thankfully bblack's recent patches should help alleviate this [17:12:58] ah nice [17:15:53] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 3 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10Dzahn) Maybe the admins listed here would be able to help: https://wikitech.wikimedia.org/wiki/Techblog.wikimedia.org#Git... [17:15:58] (03PS3) 10Ladsgroup: Remove legacy encoding option from dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) [17:16:01] (03CR) 10Ladsgroup: [C: 03+2] Remove legacy encoding option from dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) (owner: 10Ladsgroup) [17:16:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) (owner: 10Ladsgroup) [17:16:46] (03Merged) 10jenkins-bot: Remove legacy encoding option from dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924885 (https://phabricator.wikimedia.org/T128155) (owner: 10Ladsgroup) [17:17:11] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:924885|Remove legacy encoding option from dawiktionary (T128155)]] [17:17:16] T128155: Migrate all old DB rows from windows-1252 to UTF-8 on dawiktionary - https://phabricator.wikimedia.org/T128155 [17:18:42] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:924885|Remove legacy encoding option from dawiktionary (T128155)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [17:19:37] (03PS1) 10Ladsgroup: Depool half of wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924995 (https://phabricator.wikimedia.org/T337734) [17:24:27] Amir1: that patch ^ isn't related to T337700 is it...? [17:24:28] T337700: Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T337700 [17:24:45] (only ask as da.wikipedia has just had a lot of problems) [17:24:50] no, it's not [17:24:53] that LQT [17:25:01] *that's LQT [17:25:23] it's happening in huwiki too, that's not even on legacy encoding [17:25:41] seems like it might not be LQT, further looking found https://phabricator.wikimedia.org/T337700#8893177 (I had to delete some MediaWiki messages) [17:26:35] but that's dawiki, not dawiktionary [17:28:52] Amir1: it's more than LQTA [17:28:58] LQT* [17:30:05] sure but regardless, it's not related to the legacy encoding clean up. huwiki never was on legacy encoding and I've been doing dawiktionary and not dawiki [17:30:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:924885|Remove legacy encoding option from dawiktionary (T128155)]] (duration: 12m 54s) [17:30:11] T128155: Migrate all old DB rows from windows-1252 to UTF-8 on dawiktionary - https://phabricator.wikimedia.org/T128155 [17:30:19] (03CR) 10Nskaggs: [C: 03+1] Depool half of wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924995 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [17:30:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:30:54] ? [17:31:16] nah, just a deploy bleep. Good [17:31:28] (03CR) 10Ladsgroup: [C: 03+2] Depool half of wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/924995 (https://phabricator.wikimedia.org/T337734) (owner: 10Ladsgroup) [17:31:55] !log ladsgroup@deploy1002 Backport cancelled. [17:33:45] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T337705 (10wiki_willy) a:03Jhancock.wm [17:35:03] (03CR) 10CDanis: [C: 03+2] Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [17:35:05] (03CR) 10ArielGlenn: [C: 03+2] Make dumpsdata1006 the nfs primary for xmldumps and dumpsdata1005 a spare [puppet] - 10https://gerrit.wikimedia.org/r/924949 (https://phabricator.wikimedia.org/T325232) (owner: 10Hokwelum) [17:35:19] (03CR) 10CDanis: [C: 03+2] Allow query parameters in network probe url [puppet] - 10https://gerrit.wikimedia.org/r/923448 (https://phabricator.wikimedia.org/T337317) (owner: 10Jameel Kaisar) [17:35:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:35:45] cdanis: ok to merge yer stuff? [17:35:58] apergos: please do [17:36:05] just got the message that you have the lock :D [17:36:24] done [17:36:41] ty [17:36:44] yep, amazing timing there [17:37:02] 10SRE, 10Release-Engineering-Team, 10Security-Team, 10Wikimedia-GitHub, and 3 others: Add github.com/wikimedia as an SCM for Semgrep Cloud - https://phabricator.wikimedia.org/T337561 (10Aklapper) > because there doesn't seem to be a list of who has access in Github For the records, https://github.com/orgs... [17:37:12] apergos: it's actually even better than you thought, you got one of my patches but not the second [17:37:19] lolol [17:40:33] 10SRE, 10Traffic, 10Epic: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10bd808) [17:43:35] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 7 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [17:43:59] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Aklapper) Hi @chadhat, thanks for taking the time to report this and welcome to Wikimedia Phabricator! > For our current project we are trying to identify the edits of a sub... [17:46:14] (03CR) 10BryanDavis: [C: 03+1] signup:blocklist Expand blocklist feature (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [17:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:58:02] (03PS1) 10Ottomata: EventStreamConfig - page_change - Remove unused streams and settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924998 (https://phabricator.wikimedia.org/T336817) [18:00:05] dduvall and ^demon: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1800). [18:00:05] dduvall and ^demon: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T1800) [18:01:01] dduvall: ^demon i have a non-urgent config change to deploy, can you let me know when the train is clear to do so? [18:01:20] !log ran `wikiadmin2023@10.64.32.139(huwiki)> UPDATE thread SET thread_signature = ' [[User:Gubbubu|Γουββος Θιλο' WHERE thread_id = 1288;` (with `BEGIN`/`COMMIT`) for T337700 [18:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:26] T337700: Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T337700 [18:16:10] (03CR) 10Volans: debmonitor: Install Debian Django packages on Bookworm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:19:56] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:21:20] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:24:26] ottomata: will do. deploying momentarily [18:25:01] (03CR) 10Muehlenhoff: debmonitor: Install Debian Django packages on Bookworm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:26:50] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925001 (https://phabricator.wikimedia.org/T337525) [18:26:52] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925001 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:27:34] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925001 (https://phabricator.wikimedia.org/T337525) (owner: 10TrainBranchBot) [18:30:08] (03CR) 10Volans: debmonitor: Install Debian Django packages on Bookworm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/924977 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:34:52] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.11 refs T337525 [18:34:58] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [18:38:06] (03CR) 10Volans: [C: 04-1] "Did a first pass, it still needs some adjustments and to move some bits to spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [18:38:16] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_product@b3eb622]: (no justification provided) [18:38:23] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_product@b3eb622]: (no justification provided) (duration: 00m 07s) [18:40:54] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.11 refs T337525 (duration: 06m 02s) [18:40:59] T337525: 1.41.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T337525 [18:51:17] (03PS1) 10Dwisehaupt: Shift names that reference frdata to codfw host [dns] - 10https://gerrit.wikimedia.org/r/925004 (https://phabricator.wikimedia.org/T335446) [18:51:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) >>! In T326685#8816875, @Jclark-ctr wrote: > dns1004. A6. U.8 PORT. 11 CABLEID 1038 > dns1005. B6 U.5 PORT. 0 CABLEID 1969 > dns1006. C6 U27. PORT.27 CABLEID 3249 cabl... [18:53:37] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:54:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [18:55:29] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10mpopov) **Note for SRE**: KC is already in `analytics-privatedata-users` group (and has gone through all the steps necessary for that) – this ticke... [18:56:43] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new dns100[345] - robh@cumin1001" [18:57:48] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new dns100[345] - robh@cumin1001" [18:57:48] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:58:02] (03CR) 10Jgreen: [C: 03+2] Shift names that reference frdata to codfw host [dns] - 10https://gerrit.wikimedia.org/r/925004 (https://phabricator.wikimedia.org/T335446) (owner: 10Dwisehaupt) [18:58:40] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dns1004 [18:59:53] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1004 [19:05:42] (03CR) 10Eevans: [C: 03+2] hieradata: upgrade cassandra-dev2001 to Cassandra 4.1 [puppet] - 10https://gerrit.wikimedia.org/r/924610 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:05:56] swapping in dumpsdata1006 as primary nfs dumps server, replacing dumpsdata1005 now completed! [19:07:20] (03CR) 10Cwhite: [C: 03+2] hiera: disable security plugin on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/912391 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:09:59] !log swapping in dumpsdata1006 as primary nfs dumps server, replacing dumpsdata1005 now completed! [19:10:01] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dns1005 [19:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [19:11:21] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1005 [19:11:31] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host dns1006 [19:11:59] (03PS1) 10Cwhite: opensearch: add disable_security_plugin to instanceparams [puppet] - 10https://gerrit.wikimedia.org/r/924143 (https://phabricator.wikimedia.org/T333732) [19:12:39] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1006 [19:12:56] (03CR) 10Cwhite: [C: 03+2] opensearch: add disable_security_plugin to instanceparams [puppet] - 10https://gerrit.wikimedia.org/r/924143 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [19:13:10] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [19:13:43] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1005'] [19:13:48] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1006'] [19:14:06] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns1004'] [19:14:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [19:14:32] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns1005'] [19:14:37] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns1006'] [19:15:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) 05Open→03In progress a:05Jclark-ctr→03RobH ran network port setup steps, bios/idrac setup steps/dns/network steps applying firmware updates [19:16:28] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [19:16:55] !log robh@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['dns1004'] [19:17:50] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:59] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns1005.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:01] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dns1006.mgmt.eqiad.wmnet with reboot policy FORCED [19:20:47] !log we started swapping in dumpsdata1006 as primary nfs dumps server, replacing dumpsdata1005 at 16:55 UTC and completed at 19:09 UTC [19:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:58] (03CR) 10Jdrewniak: [C: 03+1] Enables ab test for multiple languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [19:32:44] (03PS1) 10Arlolra: Fix description link icon positioning [extensions/ImageMap] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925007 (https://phabricator.wikimedia.org/T329364) [19:33:46] dduvall: how goes? [19:34:25] ottomata: all clear [19:36:28] (03CR) 10Subramanya Sastry: [C: 03+1] Fix description link icon positioning [extensions/ImageMap] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925007 (https://phabricator.wikimedia.org/T329364) (owner: 10Arlolra) [19:36:39] ty [19:38:32] (03PS28) 10Jameel Kaisar: Set DoProbe cookie to initiate a probe [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) [19:41:12] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [19:48:25] (03PS2) 10Kimberly Sarabia: Enables ab test for multiple languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) [19:51:07] (03PS1) 10Ladsgroup: Pin AbuseFilterEnableBlockedExternalDomain to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925013 (https://phabricator.wikimedia.org/T337431) [19:54:42] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:54:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns1005.mgmt.eqiad.wmnet with reboot policy FORCED [19:54:47] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns1006.mgmt.eqiad.wmnet with reboot policy FORCED [19:54:54] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [19:55:29] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1004'] [19:55:47] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [19:55:53] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns1004'] [19:57:05] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [19:57:49] (03PS1) 10RLazarus: opentelemetry-collector: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/925015 (https://phabricator.wikimedia.org/T324117) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230531T2000). [20:00:05] Sohom_Datta, kimberly_sarabia, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hello [20:00:14] i can deploy today [20:00:24] o/ [20:01:10] (03CR) 10Urbanecm: Enable EditInSequence for beta-testing on napwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:01:22] arlolra: hi, around for your patch? [20:01:57] (03CR) 10Urbanecm: [C: 03+2] Enables ab test for multiple languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:01:59] yup [20:02:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:02:14] arlolra: great, +2'ing it and i'll let you know once it's ready [20:02:17] (03CR) 10Urbanecm: [C: 03+2] Fix description link icon positioning [extensions/ImageMap] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925007 (https://phabricator.wikimedia.org/T329364) (owner: 10Arlolra) [20:02:20] thank you [20:02:37] (03PS2) 10RLazarus: opentelemetry-collector: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/925015 (https://phabricator.wikimedia.org/T324117) [20:02:45] (03CR) 10Sohom Datta: Enable EditInSequence for beta-testing on napwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:02:47] (03Merged) 10jenkins-bot: Enables ab test for multiple languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924994 (https://phabricator.wikimedia.org/T336969) (owner: 10Kimberly Sarabia) [20:03:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:924994|Enables ab test for multiple languages (T336969)]] [20:03:21] T336969: [Zebra AB test] Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:04:40] (03CR) 10Urbanecm: Enable EditInSequence for beta-testing on napwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:05:38] !log urbanecm@deploy1002 ksarabia and urbanecm: Backport for [[gerrit:924994|Enables ab test for multiple languages (T336969)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:06:39] (03PS4) 10Urbanecm: Enable EditInSequence for beta-testing on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:06:57] kimberly_sarabia: hi, your patch is available at mwdebug1002. can you test please? [20:07:07] (03CR) 10Urbanecm: [C: 03+2] Enable EditInSequence for beta-testing on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:07:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1004'] [20:07:47] urbanecm: LGTM [20:07:51] thanks, syncing [20:07:53] (03Merged) 10jenkins-bot: Enable EditInSequence for beta-testing on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/923751 (https://phabricator.wikimedia.org/T337472) (owner: 10Sohom Datta) [20:07:54] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1005'] [20:07:56] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1006'] [20:09:27] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [20:10:10] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns1004'] [20:15:12] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:924994|Enables ab test for multiple languages (T336969)]] (duration: 11m 56s) [20:15:16] T336969: [Zebra AB test] Fix the mixing of global and user IDs for AB Test Enrollment Bucketing - https://phabricator.wikimedia.org/T336969 [20:16:08] kimberly_sarabia: your patch should be live [20:16:13] thanks! [20:16:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:923751|Enable EditInSequence for beta-testing on napwikisource (T337472)]] [20:16:31] T337472: Enable Edit-in-Sequence on napwikisource - https://phabricator.wikimedia.org/T337472 [20:16:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1005'] [20:17:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1006'] [20:18:08] !log urbanecm@deploy1002 soda and urbanecm: Backport for [[gerrit:923751|Enable EditInSequence for beta-testing on napwikisource (T337472)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:19:18] (03Merged) 10jenkins-bot: Fix description link icon positioning [extensions/ImageMap] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/925007 (https://phabricator.wikimedia.org/T329364) (owner: 10Arlolra) [20:19:34] Sohom_Datta: your patch is available at mwdebug1002. can you test please? [20:20:00] Yep just tested, LGTM [20:20:47] thanks, proceeding [20:22:00] !log mforns@deploy1002 Started deploy [analytics/refinery@04c11e6]: Regular analytics weekly train [analytics/refinery@04c11e6] [20:26:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:923751|Enable EditInSequence for beta-testing on napwikisource (T337472)]] (duration: 10m 09s) [20:26:40] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1004'] [20:26:41] T337472: Enable Edit-in-Sequence on napwikisource - https://phabricator.wikimedia.org/T337472 [20:26:44] Sohom_Datta: deployed. thanks! [20:27:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:925007|Fix description link icon positioning (T329364)]] [20:27:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1004'] [20:27:17] T329364: Description link no longer displayed when $wgParserEnableLegacyMediaDOM set to false - https://phabricator.wikimedia.org/T329364 [20:27:54] !log mforns@deploy1002 Finished deploy [analytics/refinery@04c11e6]: Regular analytics weekly train [analytics/refinery@04c11e6] (duration: 05m 53s) [20:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [20:28:04] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:28] !log mforns@deploy1002 Started deploy [analytics/refinery@04c11e6] (thin): Regular analytics weekly train THIN [analytics/refinery@04c11e6] [20:28:32] !log mforns@deploy1002 Finished deploy [analytics/refinery@04c11e6] (thin): Regular analytics weekly train THIN [analytics/refinery@04c11e6] (duration: 00m 04s) [20:28:41] !log mforns@deploy1002 Started deploy [analytics/refinery@04c11e6] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@04c11e6] [20:28:56] !log urbanecm@deploy1002 arlolra and urbanecm: Backport for [[gerrit:925007|Fix description link icon positioning (T329364)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:28:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) a:05RobH→03ssingh These are now ready for imaging! =] >>! In T326685#8876909, @ssingh wrote: > @Jclark-ctr: Hi John, Traffic has completed its work on the dns hosts in codfw,... [20:29:10] arlolra: your patch is available at mwdebug1002. can you test? [20:29:15] ok [20:30:10] !log mforns@deploy1002 Finished deploy [analytics/refinery@04c11e6] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@04c11e6] (duration: 01m 29s) [20:30:50] urbanecm: looks good [20:30:59] Thank you :) looks good :) [20:31:03] thanks, deploying [20:31:20] no problem Sohom_Datta, great to hear that. [20:31:29] (03PS1) 10TChin: Fix overlapping names edge case in flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) [20:37:29] (03CR) 10Ottomata: [C: 03+1] "Ah cool, don't forget chart version bump!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [20:40:04] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:925007|Fix description link icon positioning (T329364)]] (duration: 12m 51s) [20:40:09] T329364: Description link no longer displayed when $wgParserEnableLegacyMediaDOM set to false - https://phabricator.wikimedia.org/T329364 [20:40:17] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10nskaggs) I presume resolving {T337848} would unblock the situation you described, yes? And if I'm reading correctly, your desire would be to help in WMCS spaces longer term yes... [20:40:34] (03PS1) 10Zabe: manage-dblist: Add close command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925017 [20:40:43] arlolra: and deployed :) [20:41:04] thank you very much [20:41:20] (03CR) 10FNegri: [C: 03+2] backy2: switch from sqlite to postgres backend [puppet] - 10https://gerrit.wikimedia.org/r/923410 (https://phabricator.wikimedia.org/T332734) (owner: 10Andrew Bogott) [20:41:27] np [20:41:27] (03PS2) 10TChin: Fix overlapping names edge case in flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/925016 (https://phabricator.wikimedia.org/T336185) [20:43:59] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925017 (owner: 10Zabe) [20:45:35] urbanecm: can I merge this patch? only rebase needed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/925013 [20:45:40] (no deploy) [20:45:45] Amir1: go ahead, i'm done [20:45:51] (03PS2) 10Ladsgroup: Pin AbuseFilterEnableBlockedExternalDomain to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925013 (https://phabricator.wikimedia.org/T337431) [20:45:55] (03CR) 10Ladsgroup: [C: 03+2] Pin AbuseFilterEnableBlockedExternalDomain to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925013 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [20:46:00] awesome. Thanks [20:46:40] (03Merged) 10jenkins-bot: Pin AbuseFilterEnableBlockedExternalDomain to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925013 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [20:46:57] done [20:47:08] (03CR) 10Zabe: [C: 03+2] manage-dblist: Add close command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925017 (owner: 10Zabe) [20:48:02] (03Merged) 10jenkins-bot: manage-dblist: Add close command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925017 (owner: 10Zabe) [20:49:33] (03PS2) 10Zabe: Start reading from rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924605 (https://phabricator.wikimedia.org/T299954) [20:51:01] (03CR) 10Zabe: [C: 03+2] Start reading from rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924605 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [20:51:41] (03Merged) 10jenkins-bot: Start reading from rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924605 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [20:52:41] !log zabe@deploy1002 Started scap: Backport for [[gerrit:924605|Start reading from rev_comment_id everywhere (T299954)]] [20:52:46] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [20:54:15] !log zabe@deploy1002 zabe: Backport for [[gerrit:924605|Start reading from rev_comment_id everywhere (T299954)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:54:55] (03PS5) 10Catrope: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) [20:54:57] (03PS5) 10Catrope: beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) [20:54:59] (03PS3) 10Catrope: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) [20:55:01] (03PS3) 10Catrope: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) [20:55:22] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 52 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:55:56] !log foreachwikiindblist group0 extensions/AbuseFilter/maintenance/MigrateActorsAF.php (T336224) [20:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:01] T336224: Run MigrateActorsAF on all wikis - https://phabricator.wikimedia.org/T336224 [20:57:25] !log foreachwikiindblist group1 extensions/AbuseFilter/maintenance/MigrateActorsAF.php (T336224) [20:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:42] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for nskaggs - https://phabricator.wikimedia.org/T337571 (10nskaggs) >>! In T337571#8886056, @Volans wrote: > I'm wondering if this was the right long term approach. In general we're trying to reduce the need for global root, not expand it. > I see th... [20:58:18] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:50] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:26] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:924605|Start reading from rev_comment_id everywhere (T299954)]] (duration: 07m 44s) [21:00:31] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [21:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:04:42] Is everyone done deploying now? [21:04:52] yup [21:05:10] I need to do some testing that requires staging stuff on the deployment server briefly (on its way to mwdebug1001) [21:05:12] Ok great [21:05:28] have fun [21:06:28] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 34 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:11] (03PS1) 10Zabe: Stop writing to revision_comment_temp in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) [21:15:37] (03CR) 10Zabe: [C: 04-2] "need to wait until wiki replica views are dropped" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925047 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [21:22:12] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:34:35] (03CR) 10Catrope: [C: 03+2] beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [21:34:44] (03CR) 10Catrope: [C: 03+2] beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [21:35:26] (03Merged) 10jenkins-bot: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [21:35:29] (03Merged) 10jenkins-bot: beta: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [21:48:25] (03PS9) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [21:48:42] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [21:51:08] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [21:53:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:10:43] (03CR) 10BCornwall: sre.cdn: move common functions to base class (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [22:16:50] PROBLEM - Check systemd state on doc1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:54] RECOVERY - Check systemd state on doc1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:24] (03PS20) 10Cwhite: prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) [22:27:43] (03PS10) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [22:35:07] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:36:23] (03PS2) 10EoghanGaffney: releases: Add new hosts to failover servers list [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) [22:36:25] (03PS3) 10EoghanGaffney: releases: Ensure rsync jobs get removed on the non-active machine [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) [22:37:14] (03CR) 10Cwhite: [C: 03+2] prometheus: generate swagger targets from service catalog [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [22:37:34] (03CR) 10Cwhite: [C: 03+2] prometheus: generate swagger targets from service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916914 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [22:38:40] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41479/console" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [22:39:32] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 33 probes of 797 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:43:02] (03PS2) 10BCornwall: sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [22:45:48] (03CR) 10CI reject: [V: 04-1] sre.cdn: move common functions to base class [cookbooks] - 10https://gerrit.wikimedia.org/r/923662 (owner: 10Jbond) [22:46:40] (03PS4) 10EoghanGaffney: releases: Ensure rsync jobs get removed on the non-active machine [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) [22:47:55] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41480/console" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [22:49:13] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:49:36] (03PS1) 10Cwhite: prometheus: remove invalid cluster key [puppet] - 10https://gerrit.wikimedia.org/r/924144 (https://phabricator.wikimedia.org/T320620) [22:58:31] (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/924144/41481/" [puppet] - 10https://gerrit.wikimedia.org/r/924144 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [23:07:07] (03PS1) 10Cwhite: prometheus: disable new swagger job [puppet] - 10https://gerrit.wikimedia.org/r/924145 (https://phabricator.wikimedia.org/T320620) [23:13:54] (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/924145/41482/" [puppet] - 10https://gerrit.wikimedia.org/r/924145 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [23:27:06] (03CR) 10Dzahn: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:28:42] (03CR) 10Dzahn: [C: 03+2] releases: Ensure rsync jobs get removed on the non-active machine [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:35:21] (03PS1) 10Cwhite: prometheus: ensure absent invalid swagger targets file [puppet] - 10https://gerrit.wikimedia.org/r/925106 (https://phabricator.wikimedia.org/T320620) [23:36:30] (03CR) 10Cwhite: [C: 03+2] prometheus: ensure absent invalid swagger targets file [puppet] - 10https://gerrit.wikimedia.org/r/925106 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [23:41:42] (03CR) 10Reedy: deployment_server: Migrate tools/release to gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [23:46:41] (03CR) 10CI reject: [V: 04-1] releases: clone repos/releng/release from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [23:46:48] behave [23:47:27] We can't have follows-up in puppet? [23:48:17] Reedy: the commit message validator is rather strict :D [23:48:17] (03CR) 10Dzahn: [C: 03+2] "Confirmed this was a noop on all 3 non-active hosts and on releases1002 rsync and ferm config snippets were removed!" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:48:43] apparently so [23:49:26] (03CR) 10Dzahn: [C: 03+2] "while doing that noticed an entirely unrelated problem:" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:49:29] (03PS3) 10Reedy: releases: clone repos/releng/release from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) [23:49:40] Reedy: an example to add a new header https://gerrit.wikimedia.org/r/c/integration/commit-message-validator/+/862233 :D [23:50:59] I'm not even 100% sure what the correct header is... Follow-Up? [23:51:02] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/925033/" [puppet] - 10https://gerrit.wikimedia.org/r/924085 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [23:56:15] Reedy: https://gerrit.wikimedia.org/r/q/message%253DFollows-Up versus https://gerrit.wikimedia.org/r/q/message%253DFollow-Up [23:57:10] and with the fuzzy search there are way more at https://gerrit.wikimedia.org/r/q/message:%2522follow+up%2522 which indicates people usually says "Follow-up to xxxxx" [23:57:16] rather than as a meta header [23:57:39] (03CR) 10Dzahn: [C: 03+2] "In hindsight I don't get why I +2ed this with the "ensure latest" in it (which is considered an anti-pattern) but also claimed in the comm" [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [23:57:49] or maybe I got the search queries wrong [23:57:52] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-org-wikimedia-releases-releases2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:14] I think both get used on occasion.... [23:59:28] Gerrit has the followup button... but it doesn't leave anything behind