[00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945831 [00:38:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945831 (owner: 10TrainBranchBot) [00:54:04] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945831 (owner: 10TrainBranchBot) [01:25:06] 10SRE, 10Observability-Logging, 10Observability-Metrics, 10serviceops: Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10Krinkle) [01:29:27] 10SRE, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) [01:29:36] 10SRE, 10Traffic, 10WikimediaDebug, 10Developer Productivity: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794 (10Krinkle) [01:47:26] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet de... - https://phabricator.wikimedia.org/T172497 [01:47:54] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, 10Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Krinkle) [02:02:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:04:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:08] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:12] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:12] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:18] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:57:35] Is it just me or are API requests taking unusually long? Seeing issues with an IRC bot and NavPopups on enwiki [02:58:15] *and pageload times [03:01:16] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:26:52] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:12:06] (03PS1) 10Marostegui: db21[88-95]: Notifications disabled [puppet] - 10https://gerrit.wikimedia.org/r/945963 (https://phabricator.wikimedia.org/T342174) [04:13:50] (03CR) 10Marostegui: [C: 03+2] db21[88-95]: Notifications disabled [puppet] - 10https://gerrit.wikimedia.org/r/945963 (https://phabricator.wikimedia.org/T342174) (owner: 10Marostegui) [04:15:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Marostegui) >>! In T342174#9069908, @Papaul wrote: > @Marostegui all your's. Have fun Thank you!! [04:23:42] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:28:42] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:28:55] (03PS1) 10Marostegui: install_server: Do not reimage pc2015 [puppet] - 10https://gerrit.wikimedia.org/r/945964 [04:29:55] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc2015 [puppet] - 10https://gerrit.wikimedia.org/r/945964 (owner: 10Marostegui) [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:29:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "minor issue with the code, but LGTM otherwise." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [05:47:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Some small tweaks, overall seems the right direction to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [05:48:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [05:49:55] (03PS2) 10Giuseppe Lavagetto: profile::cache::base: add netmapper file for proxies [puppet] - 10https://gerrit.wikimedia.org/r/945818 (https://phabricator.wikimedia.org/T343294) [05:50:09] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) Unfortunately the errors are still happening: https://librenms.wikimedia.org/graphs/to=1691387100/id=11592/type=port_errors/from=1690782300/ [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:59:39] (03PS1) 10Marostegui: db12[34-49]: Add with notifications disabled [puppet] - 10https://gerrit.wikimedia.org/r/946340 (https://phabricator.wikimedia.org/T342166) [06:03:01] (03CR) 10Marostegui: [C: 03+2] db12[34-49]: Add with notifications disabled [puppet] - 10https://gerrit.wikimedia.org/r/946340 (https://phabricator.wikimedia.org/T342166) (owner: 10Marostegui) [06:04:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) [06:08:12] (03CR) 10Ayounsi: [C: 03+2] Update wheels [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/945748 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [06:09:15] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wheels for Aerleon 1.6.0 upgrade - ayounsi@cumin1001 [06:10:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wheels for Aerleon 1.6.0 upgrade - ayounsi@cumin1001 [06:14:22] * kart_ updating cxserver; minor changes. [06:14:34] (03PS2) 10KartikMistry: Update cxserver to 2023-08-03-132800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945697 (https://phabricator.wikimedia.org/T338602) [06:16:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1224 upgrade to mariadb 10.6', diff saved to https://phabricator.wikimedia.org/P50149 and previous config saved to /var/cache/conftool/dbconfig/20230807-061653-root.json [06:18:02] (03PS1) 10Marostegui: db1224.yaml: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/946342 (https://phabricator.wikimedia.org/T334650) [06:20:14] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:19] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-08-03-132800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945697 (https://phabricator.wikimedia.org/T338602) (owner: 10KartikMistry) [06:21:16] (03Merged) 10jenkins-bot: Update cxserver to 2023-08-03-132800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945697 (https://phabricator.wikimedia.org/T338602) (owner: 10KartikMistry) [06:21:48] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:14] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:22:35] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:22:38] (03CR) 10Marostegui: [C: 03+2] db1224.yaml: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/946342 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [06:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:25:44] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:26:16] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:28:04] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:28:32] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/946343 (https://phabricator.wikimedia.org/T334650) [06:28:37] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50150 and previous config saved to /var/cache/conftool/dbconfig/20230807-063104-root.json [06:33:54] !log Updated cxserver to 2023-08-03-132800-production (T338602, T333969, T343211) [06:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:01] T338602: Make MinT the default service for Zulu in Content Translation - https://phabricator.wikimedia.org/T338602 [06:34:01] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [06:34:01] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [06:46:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 3%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50151 and previous config saved to /var/cache/conftool/dbconfig/20230807-064608-root.json [06:51:36] (03CR) 10Giuseppe Lavagetto: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [07:00:05] Amir1, Urbanecm, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:13] (03PS1) 10Ayounsi: Fix changelog's formating [software/homer] - 10https://gerrit.wikimedia.org/r/946504 [07:00:49] (03PS2) 10Ayounsi: Fix changelog's formatting [software/homer] - 10https://gerrit.wikimedia.org/r/946504 [07:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50152 and previous config saved to /var/cache/conftool/dbconfig/20230807-070113-root.json [07:02:53] (03CR) 10Ayounsi: [C: 03+2] "Thanks I sent https://gerrit.wikimedia.org/r/c/operations/software/homer/+/946504 to fix those." [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [07:09:31] (03CR) 10Btullis: [C: 03+1] dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/946343 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:09:44] (03PS1) 10Giuseppe Lavagetto: requestctl: fix log match for wikimedia_nets [software/conftool] - 10https://gerrit.wikimedia.org/r/946505 [07:09:46] (03PS1) 10Giuseppe Lavagetto: Release 2.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/946506 [07:11:18] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/946343 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:11:34] !log Depool clouddb1015 T334650 [07:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:38] T334650: Migrate s6 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334650 [07:13:03] (03PS1) 10Marostegui: clouddb1015: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/946507 (https://phabricator.wikimedia.org/T334650) [07:13:32] (03CR) 10Btullis: [C: 03+1] eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey) [07:13:34] (03CR) 10Marostegui: [C: 03+2] clouddb1015: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/946507 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:15:35] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/945797 [07:16:11] (03PS1) 10Ayounsi: sonic-ssh: minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [07:16:18] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 15 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:16:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50153 and previous config saved to /var/cache/conftool/dbconfig/20230807-071618-root.json [07:16:24] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 15 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [07:17:04] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/945797 (owner: 10Marostegui) [07:17:09] (03CR) 10Ayounsi: [C: 03+2] "Thanks, a question and addressed the other points." [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [07:17:46] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:21:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::cache::base: add netmapper file for proxies [puppet] - 10https://gerrit.wikimedia.org/r/945818 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [07:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50154 and previous config saved to /var/cache/conftool/dbconfig/20230807-073123-root.json [07:33:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix log match for wikimedia_nets [software/conftool] - 10https://gerrit.wikimedia.org/r/946505 (owner: 10Giuseppe Lavagetto) [07:35:42] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:36:02] (03Merged) 10jenkins-bot: requestctl: fix log match for wikimedia_nets [software/conftool] - 10https://gerrit.wikimedia.org/r/946505 (owner: 10Giuseppe Lavagetto) [07:37:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 2.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/946506 (owner: 10Giuseppe Lavagetto) [07:40:07] (03Merged) 10jenkins-bot: Release 2.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/946506 (owner: 10Giuseppe Lavagetto) [07:46:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50155 and previous config saved to /var/cache/conftool/dbconfig/20230807-074628-root.json [07:57:42] (03PS1) 10Elukey: ext-ORES: force cswiki to use the ORES settings/backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946510 (https://phabricator.wikimedia.org/T343308) [07:58:07] jouncebot: next [07:58:07] In 2 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1000) [07:58:17] (03PS1) 10Filippo Giunchedi: dispatch: prune old docker images [puppet] - 10https://gerrit.wikimedia.org/r/946511 (https://phabricator.wikimedia.org/T329939) [07:59:14] (03CR) 10Samtar: [C: 03+2] shell: Always wrap maintenance scripts in mwscript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945883 (https://phabricator.wikimedia.org/T343291) (owner: 10Gergő Tisza) [07:59:55] (03Merged) 10jenkins-bot: shell: Always wrap maintenance scripts in mwscript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945883 (https://phabricator.wikimedia.org/T343291) (owner: 10Gergő Tisza) [08:00:10] I didn't mean to +2 that. [08:01:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50156 and previous config saved to /var/cache/conftool/dbconfig/20230807-080133-root.json [08:03:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dispatch: prune old docker images [puppet] - 10https://gerrit.wikimedia.org/r/946511 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi) [08:03:48] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: prune old docker images [puppet] - 10https://gerrit.wikimedia.org/r/946511 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi) [08:04:02] * TheresNoTime will deploy https://gerrit.wikimedia.org/r/945883 shortly then, slightly sooner than desired [08:05:19] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde) @fgiunchedi do you mean adding it on wikitech preferences? If so, then it's already there. [08:07:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945831 (owner: 10TrainBranchBot) [08:08:26] RECOVERY - Disk space on alert1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops [08:08:39] !log start docker-image-prune-old on alert hosts - T329939 [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:43] T329939: alert hosts short of root disk space - https://phabricator.wikimedia.org/T329939 [08:09:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by elukey@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946510 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [08:09:57] (03Merged) 10jenkins-bot: ext-ORES: force cswiki to use the ORES settings/backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946510 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [08:14:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945833 [08:14:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945833 (owner: 10TrainBranchBot) [08:16:03] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) >>! In T342969#9072774, @adee_wmde wrote: > @fgiunchedi do you mean adding it on wikitech preferences? If so, then it's already there. Production ssh keys (i.e. the on... [08:16:16] !log elukey@deploy1002 Started scap: Backport for [[gerrit:946510|ext-ORES: force cswiki to use the ORES settings/backend (T343308)]] [08:16:19] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [08:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50157 and previous config saved to /var/cache/conftool/dbconfig/20230807-081639-root.json [08:23:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42779/console" [puppet] - 10https://gerrit.wikimedia.org/r/945608 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [08:24:53] !log elukey@deploy1002 elukey: Backport for [[gerrit:946510|ext-ORES: force cswiki to use the ORES settings/backend (T343308)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:24:59] !log elukey@deploy1002 elukey: Continuing with sync [08:25:04] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [08:30:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945833 (owner: 10TrainBranchBot) [08:30:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:07] !log elukey@deploy1002 Finished scap: Backport for [[gerrit:946510|ext-ORES: force cswiki to use the ORES settings/backend (T343308)]] (duration: 14m 50s) [08:31:11] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [08:31:11] (03CR) 10Elukey: [V: 03+1 C: 03+1] "Steve: let's roll this out asap :)" [puppet] - 10https://gerrit.wikimedia.org/r/945608 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [08:34:58] (03CR) 10Stevemunene: [C: 03+2] Prevent removal of py2 on bullseye hadoop client and worker [puppet] - 10https://gerrit.wikimedia.org/r/945608 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [08:35:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:38:53] (03CR) 10Volans: [C: 03+1] "LGTM,thanks for the fixes" [software/homer] - 10https://gerrit.wikimedia.org/r/946504 (owner: 10Ayounsi) [08:39:42] (03CR) 10Volans: "thanks for the fixes" [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi) [08:40:33] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde) >>! In T342969#9072793, @fgiunchedi wrote: >>>! In T342969#9072774, @adee_wmde wrote: >> @fgiunchedi do you mean adding it on wikitech preferences? If so, then it's alre... [08:42:47] (03PS2) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [08:45:40] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) apologies, @fgiunchedi , I may have confused it. I added it to https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack . [08:52:05] (03PS3) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [08:59:05] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [09:07:51] (03CR) 10Ayounsi: [C: 03+2] Fix changelog's formatting [software/homer] - 10https://gerrit.wikimedia.org/r/946504 (owner: 10Ayounsi) [09:08:12] (03CR) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:09:29] (03Merged) 10jenkins-bot: Fix changelog's formatting [software/homer] - 10https://gerrit.wikimedia.org/r/946504 (owner: 10Ayounsi) [09:09:39] (03CR) 10Vgutierrez: [C: 03+1] cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:10:37] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) [09:12:57] (03PS1) 10Filippo Giunchedi: admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969) [09:14:03] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) >>! In T342968#9072890, @darthmon_wmde wrote: > apologies, @fgiunchedi , I may have misunderstood. I added it to https://wikitech.wikimedia.org/wiki/Special:Prefere... [09:16:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) [09:21:22] (03CR) 10Volans: Add cookbook to manage users SSH keys on SONiC devices (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [09:23:22] !log restarting blazegraph on wdqs1004 [09:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:26:14] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:27:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "lgtm, see the small nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [09:28:05] (03PS2) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:30:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:30:58] (03CR) 10CI reject: [V: 04-1] sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:31:23] (03PS2) 10Majavah: P:wmcs::graphite: disable incoming metrics [puppet] - 10https://gerrit.wikimedia.org/r/942691 (https://phabricator.wikimedia.org/T326266) [09:31:25] (03PS2) 10Majavah: wmcs: Disable Graphite query access [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) [09:31:32] (03CR) 10Majavah: wmcs: Disable Graphite query access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [09:32:16] (03PS3) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:33:16] (03PS4) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:34:00] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: Disable Graphite query access [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [09:34:06] (03PS3) 10Filippo Giunchedi: wmcs: Disable Graphite query access [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [09:34:17] (03PS5) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:34:22] taavi: I'll deploy your graphite patches now [09:34:27] thx! [09:36:44] (03CR) 10CI reject: [V: 04-1] sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:36:55] (03CR) 10Filippo Giunchedi: [C: 03+2] P:wmcs::graphite: disable incoming metrics [puppet] - 10https://gerrit.wikimedia.org/r/942691 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [09:37:41] taavi: all done [09:37:52] (03PS6) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:37:54] I mean, in 30 mins when puppet runs [09:45:47] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:46:30] (03PS1) 10Filippo Giunchedi: otel-collector: export traces to jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) [09:46:52] (03PS7) 10Ayounsi: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 [09:47:16] (03CR) 10Ayounsi: sonic-ssh: use local homer-public files (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:49:38] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:49:47] (03CR) 10Filippo Giunchedi: "I've outlined here what I think would be needed to get otel-collector to send traces to jaeger." [deployment-charts] - 10https://gerrit.wikimedia.org/r/946518 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi) [09:49:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:18] (03PS4) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [09:50:20] (03PS1) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [09:50:47] (03CR) 10CI reject: [V: 04-1] cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:50:58] (03CR) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:51:19] (03CR) 10Ayounsi: [C: 03+2] sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:52:04] (03PS7) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [09:52:08] (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:52:14] (03CR) 10CI reject: [V: 04-1] cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:53:33] (03PS5) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [09:53:35] (03PS2) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [09:53:42] (03Merged) 10jenkins-bot: sonic-ssh: use local homer-public files [cookbooks] - 10https://gerrit.wikimedia.org/r/946508 (owner: 10Ayounsi) [09:53:55] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:54:02] (03CR) 10CI reject: [V: 04-1] cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:55:33] (03CR) 10CI reject: [V: 04-1] cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [09:56:07] (03PS8) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1000) [10:00:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:07] (03PS6) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [10:02:09] (03PS3) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [10:02:42] (03CR) 10CI reject: [V: 04-1] cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:04:58] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:07:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:07:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [10:08:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [10:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [10:08:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T342617)', diff saved to https://phabricator.wikimedia.org/P50158 and previous config saved to /var/cache/conftool/dbconfig/20230807-100805-ladsgroup.json [10:08:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:09:01] (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:10:32] (03PS4) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [10:10:52] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:11:57] (03CR) 10MVernon: "Are you planning on doing the same to the equivalent swift roles, too, please?" [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [10:23:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:24:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:24:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:24:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:25:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:25:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:25:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:25:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:25:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:25:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:26:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:26:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:31:05] (03CR) 10Vgutierrez: cache: load ip reputation data and add request header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:31:20] (03CR) 10Vgutierrez: [C: 03+1] cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:33:35] (03CR) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:36:54] (03PS1) 10Ladsgroup: Stop writing to the old externallinks columns in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946521 (https://phabricator.wikimedia.org/T342683) [10:36:56] (03PS7) 10Giuseppe Lavagetto: cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) [10:36:58] (03PS5) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [10:40:24] (03CR) 10Vgutierrez: [C: 03+1] cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [10:40:26] jouncebot: nowandnext [10:40:26] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1000) [10:40:26] In 2 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1300) [10:44:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946521 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [10:45:17] (03Merged) 10jenkins-bot: Stop writing to the old externallinks columns in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946521 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [10:45:44] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:946521|Stop writing to the old externallinks columns in testwiki (T342683)]] [10:45:47] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [10:46:28] 10SRE, 10SEO: Bad Wikisource favicon at BING - https://phabricator.wikimedia.org/T343696 (10Fuzzy) [10:47:06] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:946521|Stop writing to the old externallinks columns in testwiki (T342683)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [10:48:18] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [10:49:15] (03CR) 10LSobanski: "Daniel is on sabbatical leave. If you need someone from SRE Collaboration Services to review this patch, please add the wmf-sre-collab gro" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [10:49:28] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10darthmon_wmde) >>! In T342968#9072997, @fgiunchedi wrote: >>>! In T342968#9072890, @darthmon_wmde wrote: >> apologies, @fgiunchedi , I may have misunderstood. I added it to htt... [10:53:50] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:946521|Stop writing to the old externallinks columns in testwiki (T342683)]] (duration: 08m 06s) [10:53:54] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [10:54:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:23] jouncebot: nowandnext [10:57:23] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1000) [10:57:23] In 2 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1300) [10:59:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:24] (03PS1) 10Dreamy Jazz: Write new for event table migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946527 (https://phabricator.wikimedia.org/T330158) [11:12:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:50] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10darthmon_wmde) [11:19:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:22:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.146 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.377 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:29:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10darthmon_wmde) [11:51:58] 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10Clement_Goubert) [11:54:03] (03PS1) 10Clément Goubert: Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 [11:54:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [11:54:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [11:54:25] (03PS2) 10Clément Goubert: Revert "mediawiki: Reduce requests for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945798 [12:12:43] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:17:29] !log repooling wdqs1004 [12:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:26] (03CR) 10Ayounsi: [C: 03+1] Allow HTTP return traffic from apt to network devices on TCP 8080 [homer/public] - 10https://gerrit.wikimedia.org/r/942639 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [12:21:27] (03CR) 10David Caro: [C: 03+2] wmcs: enable isort and black [puppet] - 10https://gerrit.wikimedia.org/r/936231 (owner: 10David Caro) [12:29:07] (03CR) 10Ayounsi: "Not sure I'm the best person to review this change as I don't have much Puppet infra knowledge" [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:29:47] (03PS6) 10Ayounsi: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [12:30:25] (03CR) 10Ayounsi: [C: 03+1] Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [12:31:20] (03CR) 10Ayounsi: [C: 03+1] sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:31:41] (03CR) 10Ayounsi: [C: 03+1] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:33:50] (03CR) 10Ayounsi: [C: 03+1] Update border-in firewall filter to set DSCP bits to DE [homer/public] - 10https://gerrit.wikimedia.org/r/931262 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:34:04] (03PS1) 10Stang: zhwiki: Grant "suppressredirect"to autoreviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946540 (https://phabricator.wikimedia.org/T343711) [12:34:06] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:37:54] (03CR) 10Ayounsi: Varnish: prefix 403 and 429 with a unique ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) (owner: 10Ayounsi) [12:39:34] (03PS1) 10Stang: wikifunctions: Allow transwiki import from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) [12:42:12] (03PS1) 10ArielGlenn: for pmatch arg to regexec, malloc one more [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/946542 (https://phabricator.wikimedia.org/T340096) [12:42:35] (03CR) 10Ayounsi: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [12:43:27] (03CR) 10Ayounsi: [C: 03+2] Add cookbook to manage users SSH keys on SONiC devices (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi) [12:46:22] (03PS1) 10ArielGlenn: version 0.1.4 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/946543 (https://phabricator.wikimedia.org/T340096) [12:57:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:57:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1300). [13:00:04] Dreamy_Jazz, koi, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] \o [13:00:15] I can’t deploy, I’m in a meeting, sorry [13:00:24] eo/ [13:00:36] I can deploy [13:00:49] o/ [13:01:25] hi all [13:02:18] aanzx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/945799/ has a CI failure. can you have a look and fix it please? [13:02:57] (03PS2) 10Urbanecm: Update knwiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945939 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:03:00] (03CR) 10Urbanecm: [C: 03+2] Update knwiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945939 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:03:09] (03PS2) 10Urbanecm: Write new for event table migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946527 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:03:12] (03CR) 10Urbanecm: [C: 03+2] Write new for event table migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946527 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:03:32] (03PS2) 10Urbanecm: zhwiki: Grant "suppressredirect"to autoreviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946540 (https://phabricator.wikimedia.org/T343711) (owner: 10Stang) [13:03:35] (03CR) 10Urbanecm: [C: 03+2] zhwiki: Grant "suppressredirect"to autoreviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946540 (https://phabricator.wikimedia.org/T343711) (owner: 10Stang) [13:03:40] (03Merged) 10jenkins-bot: Update knwiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945939 (https://phabricator.wikimedia.org/T343662) (owner: 10Anzx) [13:03:53] (03Merged) 10jenkins-bot: Write new for event table migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946527 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:04:02] urbanecm: looking [13:04:29] (03Merged) 10jenkins-bot: zhwiki: Grant "suppressredirect"to autoreviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946540 (https://phabricator.wikimedia.org/T343711) (owner: 10Stang) [13:04:31] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) @ayounsi i will be in early tomorrow working on unboxing 15 pallets of servers that have arrived will you be available tomorrow to assist. I would like to try another optic [13:05:24] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:945939|Update knwiktionary logos (T343662)]], [[gerrit:946527|Write new for event table migration on all wikis (T330158)]], [[gerrit:946540|zhwiki: Grant "suppressredirect"to autoreviewer (T343711)]] [13:05:31] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [13:05:31] T343711: Grant "suppressredirect" to autoreviewer on zhwiki - https://phabricator.wikimedia.org/T343711 [13:05:31] T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662 [13:06:09] (03PS5) 10Anzx: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) [13:06:37] urbanecm: fixed [13:06:49] !log urbanecm@deploy1002 anzx and dreamyjazz and stang and urbanecm: Backport for [[gerrit:945939|Update knwiktionary logos (T343662)]], [[gerrit:946527|Write new for event table migration on all wikis (T330158)]], [[gerrit:946540|zhwiki: Grant "suppressredirect"to autoreviewer (T343711)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-d [13:06:49] ebug kubernetes deployment (accessible via k8s-experimental XWD option) [13:07:07] Testing knwiktionary logos [13:07:11] aanzx: Dreamy_Jazz: koi: your patches are at mwdebug1001; can you test? [13:07:17] Sure. Testing now. [13:07:29] looking [13:07:44] (03CR) 10Urbanecm: "Adding James as a reviewer; Wikifunctions is a new project under active development; giving Abstract Wikipedia team a chance to object sho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang) [13:07:52] (03PS1) 10Samtar: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945800 [13:08:00] (03PS2) 10Samtar: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945800 [13:08:32] koi: fyi, i'm not going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946541/ today, as Wikifunctions is a new project and adding import ability might not be desirable at this point. I added James as a reviewer; feel free to reschedule once he/Abstract Wikipedia +1's :) [13:08:53] TheresNoTime: want me to ping you once i finish handling b&c? [13:09:02] ok, got it! [13:09:07] urbanecm, i checked special:usergrouprights and it looks fine [13:09:14] great, thanks! [13:09:29] urbanecm: I'm on the train at the moment, probably shouldn't deploy.. [13:09:56] TheresNoTime: ack. seems like the matching PS patch is out now, so the deploy should be harmless? [13:09:59] in that case, i can do it for you :) [13:10:37] My part of testing complete for my change. Please check that on enwiki there are rows in cu_log_event and cu_private_event. [13:10:52] Should be one row in both [13:11:02] that is right [13:11:14] aanzx: how is your testing going please? [13:12:06] Urbanecm: knwiktionary logos looks good , i wasn't able to see change in legacy vector skin [13:13:23] aanzx: i can see the change there, probably cache on your end :) [13:13:32] anyway, seems all changes are good to go, proceeding. [13:13:36] !log urbanecm@deploy1002 anzx and dreamyjazz and stang and urbanecm: Continuing with sync [13:14:07] Ok [13:16:29] (03CR) 10Urbanecm: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:16:34] aanzx: please see my comment above [13:16:53] (03CR) 10Urbanecm: [C: 03+2] idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:16:56] (03CR) 10Urbanecm: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:17:14] (03CR) 10Urbanecm: [C: 03+2] Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945800 (owner: 10Samtar) [13:17:55] (03Merged) 10jenkins-bot: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945800 (owner: 10Samtar) [13:19:07] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) Yep, ping me. [13:19:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:945939|Update knwiktionary logos (T343662)]], [[gerrit:946527|Write new for event table migration on all wikis (T330158)]], [[gerrit:946540|zhwiki: Grant "suppressredirect"to autoreviewer (T343711)]] (duration: 13m 54s) [13:19:25] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [13:19:25] T343711: Grant "suppressredirect" to autoreviewer on zhwiki - https://phabricator.wikimedia.org/T343711 [13:19:25] T343662: update knwiktionary logos - https://phabricator.wikimedia.org/T343662 [13:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:38] Thanks! [13:19:40] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:945800|Revert "enwiki: temp enable emergencyCaptcha"]] [13:19:41] Dreamy_Jazz: koi: aanzx: the three patches are now live :) [13:19:43] and no problem :) [13:21:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: load ip reputation data and add request header [puppet] - 10https://gerrit.wikimedia.org/r/945819 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [13:21:27] (03CR) 10Anzx: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:23:04] (03PS1) 10Elukey: ext-ORES: revert all wikis to use ORES instead of Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) [13:23:28] (03CR) 10Anzx: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:23:33] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:38] aanzx: i'm sorry, but i don't see any aliases in the patch you linked? [13:23:57] (03PS1) 10Bking: flink-zk: Add hostnames for CODFW cluster [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) [13:24:06] (03CR) 10AikoChou: [C: 03+1] ext-ORES: revert all wikis to use ORES instead of Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [13:24:18] liberica? [13:24:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:55] (03CR) 10Ladsgroup: [C: 04-1] "let me try to fix the thresholds config. If I don't get to fix it by tomorrow, feel free to deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [13:26:38] Urbanecm: it was done on https://phabricator.wikimedia.org/T337696 [13:26:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:945800|Revert "enwiki: temp enable emergencyCaptcha"]] (duration: 06m 59s) [13:26:49] TheresNoTime: taavi: done :) [13:26:59] vgutierrez: liberica is some kind of load balancing monitoring? [13:27:17] 10SRE, 10Infrastructure-Foundations: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) [13:27:20] ah, found https://phabricator.wikimedia.org/T332027 [13:27:28] jynus: nope, it's our L4LB based on katran [13:27:37] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) [13:27:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:41] (03CR) 10DCausse: flink-zk: Add hostnames for CODFW cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [13:27:56] sorry, what I think it failed is the monitoring job? [13:28:10] (only) [13:28:14] PROBLEM - Check systemd state on mw2312 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:25] (03CR) 10Urbanecm: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:28:28] the prometheus job [13:28:34] aanzx: see patch comment, hopefully i clarified it :) [13:29:00] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) [13:29:47] jouncebot: next [13:29:47] In 2 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1530) [13:31:35] jynus: don't worry about it, it's a beta exporter and I restarted it manually [13:31:52] (and beta is too generous TBH) [13:31:53] ok, sorry, I was caught by surprise about the name [13:32:05] no problem :) [13:32:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:24] aanzx: ping :). can i help with clarifying something (or maybe do the change that i'm trying to describe in my comments myself)? [13:34:12] (03CR) 10Ladsgroup: [C: 03+1] ext-ORES: revert all wikis to use ORES instead of Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [13:35:16] Urbanecm: i will make change now [13:35:20] okay, thank you [13:35:40] urbanecm: o/ [13:35:45] (03PS1) 10Filippo Giunchedi: aux: add grpc/http ports for jaeger collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) [13:35:49] I'd need to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946546 [13:36:02] afaics all green from the scap point of view right? [13:36:09] (namely your deployment finished etc..) [13:36:37] elukey: one more change left doing :) [13:36:42] i can ping you once finished [13:36:47] super thanks [13:38:08] (03PS6) 10Anzx: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) [13:38:16] (03CR) 10Bking: flink-zk: Add hostnames for CODFW cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [13:38:36] Urbanecm: made change is it what you meant [13:39:01] nope, let me fix it :) [13:39:56] (03PS7) 10Urbanecm: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:40:33] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/945799/5..7 is what i meant -- merely changing wgMetaNamespace is sufficient to change the namespace itself :) [13:40:45] (and the alias from Wiktionary [current namespace name] is the default) [13:40:50] (03PS8) 10Urbanecm: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:40:53] (03CR) 10Urbanecm: [C: 03+2] idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:40:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:41:25] Ok urbanecm thanks for fix [13:41:32] (03Merged) 10jenkins-bot: idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945799 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [13:41:32] no problem [13:41:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:945799|idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (T341175)]] [13:41:51] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [13:42:24] (03PS6) 10Giuseppe Lavagetto: cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) [13:43:10] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:945799|idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (T341175)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:43:16] Testing [13:43:20] ty [13:48:08] on my end, the patch seems to be working [13:48:08] urbanecm: looks good [13:48:09] Does it need to run namespacedupes [13:48:09] proceeding, ty [13:48:09] !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync [13:48:09] yup, i'll do that :) [13:48:09] (03PS2) 10Bking: flink-zk: Add hostnames for CODFW cluster [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) [13:50:15] (03CR) 10Bking: "Per IRC conversation with David, we do need a CODFW cluster. I just forgot to prefix it with "2" instead of "1"...per WMF standards, our e" [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [13:50:45] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [warning] - https://phabricator.wikimedia.org/T343318 (10BTullis) @gmodena and the rest of #event-platform will probably want to know about this. https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change... [13:51:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:945799|idwiktionary change wgSiteName, wgMetaNamespace and add project namespace alias (T341175)]] (duration: 09m 12s) [13:51:06] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [13:51:08] elukey: the floor is yours :) [13:51:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [13:51:25] 10sre-alert-triage, 10Data-Platform-SRE: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10BTullis) [13:51:26] urbanecm: thank youuuu [13:51:34] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php idwiktionary --fix --add-prefix=BROKEN # T341175 [13:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by elukey@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [13:52:15] (03Merged) 10jenkins-bot: ext-ORES: revert all wikis to use ORES instead of Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946546 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [13:52:23] thanks a lot urbanecm [13:52:29] no problem [13:52:30] !log elukey@deploy1002 Started scap: Backport for [[gerrit:946546|ext-ORES: revert all wikis to use ORES instead of Lift Wing (T343308)]] [13:52:33] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [13:52:42] (03CR) 10DCausse: [C: 03+1] flink-zk: Add hostnames for CODFW cluster [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [13:53:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) [13:53:52] !log elukey@deploy1002 elukey: Backport for [[gerrit:946546|ext-ORES: revert all wikis to use ORES instead of Lift Wing (T343308)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:53:58] !log elukey@deploy1002 elukey: Continuing with sync [13:54:26] (03PS6) 10Volans: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:54:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr) [13:55:06] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [13:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) [13:56:46] RECOVERY - Check systemd state on mw2312 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [13:57:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache: load ip reputation data everywhere [puppet] - 10https://gerrit.wikimedia.org/r/946519 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [13:57:28] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:57:39] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add json output when adding the ?format=json GET parameter [software/pampinus] - 10https://gerrit.wikimedia.org/r/818508 (owner: 10Jcrespo) [13:57:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10Jclark-ctr) [13:57:54] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add absolute number (bytes) changed & max staleness for backup status [software/pampinus] - 10https://gerrit.wikimedia.org/r/818538 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:58:06] (03CR) 10Bking: [C: 03+2] flink-zk: Add hostnames for CODFW cluster [puppet] - 10https://gerrit.wikimedia.org/r/946547 (https://phabricator.wikimedia.org/T341705) (owner: 10Bking) [13:58:10] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Attempt to follow Wikimedia's Design Style Guide [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:58:20] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add the possibility of searching racks for instances, too [software/pampinus] - 10https://gerrit.wikimedia.org/r/820073 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:58:29] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add missing analytics backups monitoring [software/pampinus] - 10https://gerrit.wikimedia.org/r/870549 (owner: 10Jcrespo) [13:58:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1078.eqiad.wmnet with OS bullseye [13:58:38] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Improvements on css [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup) [13:58:47] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] pampinus: Fix bugs with codfw-only sections & very small backups [software/pampinus] - 10https://gerrit.wikimedia.org/r/870550 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [13:58:57] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add the "very_stale" HTML style as a red label [software/pampinus] - 10https://gerrit.wikimedia.org/r/884820 (owner: 10Jcrespo) [13:59:19] !log elukey@deploy1002 Finished scap: Backport for [[gerrit:946546|ext-ORES: revert all wikis to use ORES instead of Lift Wing (T343308)]] (duration: 06m 49s) [13:59:23] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [14:00:37] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Make backups statistics optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932388 (https://phabricator.wikimedia.org/T339894) (owner: 10Jcrespo) [14:01:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [14:06:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:25] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1078.eqiad.wmnet with OS bullseye [14:08:30] !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1078.eqiad.wmnet'] [14:08:33] (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1078.eqiad.wmnet'] [14:10:36] !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host an-worker1078.eqiad.wmnet [14:14:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-worker1078.eqiad.wmnet [14:14:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1078.eqiad.wmnet with OS bullseye [14:16:15] (03PS1) 10Anzx: update idwiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946555 (https://phabricator.wikimedia.org/T341175) [14:18:00] (03PS1) 10Ladsgroup: Drop old externallinks columns [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) [14:18:23] (03PS2) 10Anzx: update idwiktionary legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946555 (https://phabricator.wikimedia.org/T341175) [14:18:33] (JobUnavailable) firing: (3) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:22] vgutierrez: I think the jobunavailable for liberica can be silenced/acked ^ ? [14:23:35] uh... [14:23:40] I'll do that if that seems sensible [14:23:40] that shouldn't be firing :) [14:23:47] but yes, go ahead please [14:24:27] ah, shouldn't be firing as in you weren't expecting prometheus to fail to scrape metrics now ? [14:24:43] but yeah I'll ack [14:24:59] indeed [14:25:35] got it, started ~1h ago, dashboard is https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [14:27:34] jouncebot: nowandnext [14:27:34] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [14:27:34] In 1 hour(s) and 2 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1530) [14:28:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:25] !log zabe@deploy1002 Started scap: T343294 [14:33:22] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/946559 [14:33:45] (03CR) 10JMeybohm: [C: 04-1] "This requires a chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [14:34:47] (03PS9) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [14:35:46] (03PS10) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [14:35:49] (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [14:36:36] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:38] !log zabe@deploy1002 Finished scap: T343294 (duration: 07m 13s) [14:38:13] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10fnegri) [14:39:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:40:07] (03CR) 10JMeybohm: [C: 03+1] mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [14:41:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:36] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:44:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 6.519 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:11] 10sre-alert-triage, 10Data-Platform-SRE, 10Discovery-Search (Current work): Alert triage: overdue alert [warning] - https://phabricator.wikimedia.org/T343319 (10bking) a:03bking [14:53:31] 10sre-alert-triage, 10Data-Platform-SRE, 10Discovery-Search (Current work): Alert triage: overdue alert [warning] - https://phabricator.wikimedia.org/T343319 (10bking) [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/454788 | This is how we provisioned the certificate in 2018 ]] . Will check with more... [14:54:11] 10sre-alert-triage, 10Data-Platform-SRE, 10Discovery-Search (Current work): search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10bking) [14:58:26] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) 05Open→03In progress [14:58:32] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [15:00:02] 10sre-alert-triage, 10Data-Platform-SRE, 10Discovery-Search (Current work): search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10bking) Checking the Puppet repo... `modules/profile/files/ssl/search.discovery.wmnet.crt` is valid for `search.discov... [15:01:31] (03PS1) 10Giuseppe Lavagetto: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 [15:01:33] (03PS4) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [15:03:23] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [15:07:28] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability: WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater - https://phabricator.wikimedia.org/T337801 (10bking) a:05dcausse→03bking [15:08:34] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T343722 (10phaultfinder) [15:13:02] XioNoX: wut? ^^^ [15:13:15] ^ the duplicate IP was me replacing a motherboard [15:13:41] already updated and should clear [15:13:47] ah ok, thanks JennH :) [15:13:48] 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel) [15:13:59] thx! [15:19:34] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T343722 (10Jhancock.wm) my mistake. was replacing a motherboard for T341546. [15:21:00] 10sre-alert-triage, 10Data-Platform-SRE: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10Gehel) [15:22:25] (03PS1) 10Elukey: admin_ng: set target-burst-capacity to zero for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/946569 [15:23:22] (03PS2) 10Elukey: admin_ng: set target-burst-capacity to zero for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/946569 [15:24:59] (03CR) 10Klausman: [C: 03+1] admin_ng: set target-burst-capacity to zero for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/946569 (owner: 10Elukey) [15:26:52] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved previously on: we attempted to replace the ram again, with no results. tried multiple hardware switch outs. but it kept hanging on loading the bios. motherboard has been replaced, idrac... [15:29:20] (03CR) 10Elukey: [C: 03+2] admin_ng: set target-burst-capacity to zero for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/946569 (owner: 10Elukey) [15:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1530). [15:32:33] 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10bking) [15:34:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:34:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:35:14] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1078.eqiad.wmnet with OS bullseye [15:35:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:35:39] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1078.eqiad.wmnet with OS bullseye [15:35:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:38:07] 10sre-alert-triage, 10Data-Platform-SRE: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10bking) T162037 might also have more context. [15:41:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:42:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:46:34] (03CR) 10Herron: "taking a stab at switching this to cfssl as prep for adding SANs for pyrra/slo services" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [15:49:09] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42790/console" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron) [15:50:19] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1078.eqiad.wmnet with OS bullseye [15:52:22] (03PS5) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) [15:53:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1078.eqiad.wmnet with OS bullseye [15:54:08] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Bartosz!" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński) [15:54:12] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Seddon) Approved (direct manager) [15:55:11] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1078.eqiad.wmnet with reason: host reimage [15:58:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1078.eqiad.wmnet with reason: host reimage [15:59:36] (03PS1) 10BCornwall: Release 0.5 for bookworm [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/946578 (https://phabricator.wikimedia.org/T342154) [16:03:39] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T343722 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm idrac was updated on the replacement MB. alert has cleared. [16:03:57] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team: auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10JJMC89) [16:07:10] jouncebot: nowandnext [16:07:10] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [16:07:11] In 0 hour(s) and 52 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1700) [16:07:11] In 0 hour(s) and 52 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1700) [16:08:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945624 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [16:10:14] (03PS2) 10Jforrester: Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945624 (https://phabricator.wikimedia.org/T343400) [16:10:18] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945624 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [16:11:17] (03Merged) 10jenkins-bot: Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945624 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [16:11:32] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:945624|Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions (T343400)]] [16:11:42] T343400: Allow editing of "about" with less rights - https://phabricator.wikimedia.org/T343400 [16:13:02] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:945624|Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions (T343400)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:13:07] !log jforrester@deploy1002 jforrester: Continuing with sync [16:16:12] (03PS2) 10Jforrester: Wikifunctions: Add oathauth-enable to wikifunctions-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945808 (https://phabricator.wikimedia.org/T342868) [16:16:14] (03PS1) 10Jforrester: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) [16:18:44] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:945624|Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions (T343400)]] (duration: 07m 11s) [16:18:47] T343400: Allow editing of "about" with less rights - https://phabricator.wikimedia.org/T343400 [16:22:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1078.eqiad.wmnet with OS bullseye [16:27:12] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10cloud-services-team: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10fnegri) @Andrew depooled dbproxy1018 on Friday when it went down, then repooled it one hour later when it was back online, and traffic seemed to flow... [16:34:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:34:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [16:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T342617)', diff saved to https://phabricator.wikimedia.org/P50163 and previous config saved to /var/cache/conftool/dbconfig/20230807-163421-ladsgroup.json [16:34:25] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:35:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1079.eqiad.wmnet with OS bullseye [16:47:20] !log bking@puppetmaster1001 removing unused(?) puppet cert search.svc.codfw.wmnet T343319 [16:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:24] T343319: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 [16:49:50] (03PS1) 10Elukey: admin_ng: allow host headers for base domain in istio mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946593 (https://phabricator.wikimedia.org/T343740) [16:52:39] (03CR) 10Fabfur: [C: 03+1] "Builds on separate host, tested with lintian and piuparts, all fine!" [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/946578 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:53:57] (03PS1) 10Ladsgroup: Stop writing to old columns of externallinks in ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946597 (https://phabricator.wikimedia.org/T342683) [16:55:31] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10cloud-services-team: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10Marostegui) It was showing down on haproxy itself. So it wasn't reloaded until I did it manually. [16:56:54] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1079.eqiad.wmnet with reason: host reimage [16:59:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1079.eqiad.wmnet with reason: host reimage [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1700) [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T1700). [17:02:14] !log bking@puppetmaster1001 removing unused(?) puppet cert search.svc.eqiad.wmnet T343319 [17:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:19] T343319: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 [17:08:06] (03CR) 10Btullis: [C: 03+2] Bump up mediawiki_history_snapshot to 2023-07 [puppet] - 10https://gerrit.wikimedia.org/r/945852 (owner: 10Mforns) [17:10:16] 10SRE, 10ops-eqiad, 10sre-alert-triage, 10DC-Ops, 10cloud-services-team: dbproxy1018 network interface down - https://phabricator.wikimedia.org/T343560 (10fnegri) @Marostegui you're right, I was confusing dbproxy with clouddb* proxies. Puppet only reloaded haproxy on `clouddb-wikireplicas-proxy-2.clouddb... [17:16:20] (03CR) 10JHathaway: profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [17:19:54] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:22:09] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: civi1001.frack.eqiad.wmnet - jgreen@cumin1001" [17:22:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1079.eqiad.wmnet with OS bullseye [17:22:57] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: civi1001.frack.eqiad.wmnet - jgreen@cumin1001" [17:22:57] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:49] (03CR) 10BCornwall: [C: 03+2] Release 0.5 for bookworm [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/946578 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:31:42] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:32:41] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [17:33:42] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frdev1001 from DNS for decommissioning - jgreen@cumin1001" [17:34:29] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frdev1001 from DNS for decommissioning - jgreen@cumin1001" [17:34:29] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:35:59] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:37:48] (03PS1) 10BryanDavis: striker: Bump container version to 2023-08-07-172444-production [puppet] - 10https://gerrit.wikimedia.org/r/946601 (https://phabricator.wikimedia.org/T342082) [17:42:45] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:46:13] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:46:59] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frmon1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:46:59] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:54:58] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:55:16] !log jgreen@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [17:55:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:55:53] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342693 (10Jgreen) a:03Jclark-ctr [17:56:18] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [17:58:27] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frmon2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:58:38] 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jgreen) a:03Papaul [17:59:13] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frmon2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [17:59:13] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:45] 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T342694 (10Jgreen) Ready for disk wipe! [18:03:06] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (10Jgreen) [18:04:22] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (10Jgreen) [18:10:50] (03PS3) 10Krinkle: mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604) [18:11:32] (03CR) 10BryanDavis: "PCC output at https://puppet-compiler.wmflabs.org/output/946601/42792/ matches expected change." [puppet] - 10https://gerrit.wikimedia.org/r/946601 (https://phabricator.wikimedia.org/T342082) (owner: 10BryanDavis) [18:11:35] (03CR) 10Krinkle: [C: 03+2] mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [18:11:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T342617)', diff saved to https://phabricator.wikimedia.org/P50164 and previous config saved to /var/cache/conftool/dbconfig/20230807-181151-ladsgroup.json [18:11:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:12:18] (03Merged) 10jenkins-bot: mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604) (owner: 10Krinkle) [18:12:43] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:937197|mc: Remove mcrouter-with-onhost-tier from ParserCache (T264604)]] [18:12:46] T264604: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 [18:13:26] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2023-08-07-172444-production [puppet] - 10https://gerrit.wikimedia.org/r/946601 (https://phabricator.wikimedia.org/T342082) (owner: 10BryanDavis) [18:14:06] !log krinkle@deploy1002 krinkle: Backport for [[gerrit:937197|mc: Remove mcrouter-with-onhost-tier from ParserCache (T264604)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [18:16:11] !log krinkle@deploy1002 krinkle: Continuing with sync [18:20:51] PROBLEM - SSH on bast3006 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:21:17] o.O [18:21:20] hmmm [18:21:50] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:937197|mc: Remove mcrouter-with-onhost-tier from ParserCache (T264604)]] (duration: 09m 07s) [18:21:55] T264604: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 [18:22:21] RECOVERY - SSH on bast3006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:22:28] * TheresNoTime just SSH'd to bast3006 so.. [18:26:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P50165 and previous config saved to /var/cache/conftool/dbconfig/20230807-182657-ladsgroup.json [18:30:11] (03PS1) 10BCornwall: Release 1.9-4 to target bullseye [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) [18:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P50166 and previous config saved to /var/cache/conftool/dbconfig/20230807-184204-ladsgroup.json [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:44:07] 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel) [18:45:16] (03PS1) 10Andrew Bogott: Remove unused file for horizon scap deploy [puppet] - 10https://gerrit.wikimedia.org/r/946605 (https://phabricator.wikimedia.org/T341640) [18:47:12] (03PS1) 10Ahmon Dancy: testing [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/945838 [18:47:20] (03CR) 10CI reject: [V: 04-1] testing [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/945838 (owner: 10Ahmon Dancy) [18:48:16] (03Abandoned) 10Ahmon Dancy: testing [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/945838 (owner: 10Ahmon Dancy) [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:54:10] (03PS1) 10Jgreen: Remove fundraising hosts civi1001,frbast1001,frbast2001,frdev1001,frmon1001,frmon2001 for decom. [dns] - 10https://gerrit.wikimedia.org/r/946606 [18:55:01] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/945554 (owner: 10Muehlenhoff) [18:56:26] (03CR) 10Jgreen: [C: 03+1] Remove fundraising hosts civi1001,frbast1001,frbast2001,frdev1001,frmon1001,frmon2001 for decom. [dns] - 10https://gerrit.wikimedia.org/r/946606 (owner: 10Jgreen) [18:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T342617)', diff saved to https://phabricator.wikimedia.org/P50167 and previous config saved to /var/cache/conftool/dbconfig/20230807-185710-ladsgroup.json [18:57:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:57:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:57:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:57:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T342617)', diff saved to https://phabricator.wikimedia.org/P50168 and previous config saved to /var/cache/conftool/dbconfig/20230807-185732-ladsgroup.json [18:59:16] 10sre-alert-triage, 10Data-Platform-SRE: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10bking) After some help from #wikimedia-sre , I was able to get this solved. Basically, the alert is from a check that runs locally on the puppetmaster. Th... [18:59:36] (03CR) 10Dwisehaupt: [C: 03+2] "This all looks correct. Shipit." [dns] - 10https://gerrit.wikimedia.org/r/946606 (owner: 10Jgreen) [18:59:44] 10sre-alert-triage, 10Data-Platform-SRE: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10bking) 05Open→03Resolved [19:01:43] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused file for horizon scap deploy [puppet] - 10https://gerrit.wikimedia.org/r/946605 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [19:05:12] (03PS2) 10Jdlrobson: Fix finnish projects, remove unused SVG/PNGs, resize wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944318 (https://phabricator.wikimedia.org/T343278) [19:09:20] (03PS1) 10Jdlrobson: Wikivoyage logos should always be on a single line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946608 (https://phabricator.wikimedia.org/T343279) [19:09:28] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [19:10:07] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340155 (10Jgreen) a:03Jclark-ctr [19:11:29] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frbast1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1001" [19:12:13] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frbast1001.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1001" [19:12:13] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:12:44] !log jgreen@cumin1001 START - Cookbook sre.dns.netbox [19:14:46] !log jgreen@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frbast2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [19:15:32] !log jgreen@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove frbast2001.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1001" [19:15:33] !log jgreen@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:14] 10ops-codfw, 10decommission-hardware, 10fundraising-tech-ops: decommission frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T340156 (10Jgreen) a:03Papaul [19:35:48] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [19:42:28] (03PS1) 10Bartosz Dziewoński: ThreadItemStore: Ignore duplicates caused by duplicate executions [extensions/DiscussionTools] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945803 (https://phabricator.wikimedia.org/T323080) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T2000). [20:00:05] Jdlrobson, aanzx, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] hello [20:00:15] hi [20:00:18] i can deploy today [20:00:33] (03CR) 10Urbanecm: [C: 03+2] Fix finnish projects, remove unused SVG/PNGs, resize wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944318 (https://phabricator.wikimedia.org/T343278) (owner: 10Jdlrobson) [20:00:34] o/ [20:00:36] (03CR) 10Urbanecm: [C: 03+2] Wikivoyage logos should always be on a single line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946608 (https://phabricator.wikimedia.org/T343279) (owner: 10Jdlrobson) [20:00:49] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wcqs2003.codfw.wmnet with OS bullseye [20:01:07] aanzx: you've scheduled https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/945799/, but that seems to be already deployed? [20:01:27] (03Merged) 10jenkins-bot: Fix finnish projects, remove unused SVG/PNGs, resize wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944318 (https://phabricator.wikimedia.org/T343278) (owner: 10Jdlrobson) [20:01:31] (03Merged) 10jenkins-bot: Wikivoyage logos should always be on a single line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946608 (https://phabricator.wikimedia.org/T343279) (owner: 10Jdlrobson) [20:01:40] (03CR) 10Urbanecm: [C: 03+2] ThreadItemStore: Ignore duplicates caused by duplicate executions [extensions/DiscussionTools] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945803 (https://phabricator.wikimedia.org/T323080) (owner: 10Bartosz Dziewoński) [20:02:31] Urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946555 I will update in deployment page [20:02:36] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:944318|Fix finnish projects, remove unused SVG/PNGs, resize wikiversity (T343278)]], [[gerrit:946608|Wikivoyage logos should always be on a single line (T343279)]] [20:02:40] T343279: Confirm Wikivoyage logos inconsistency is within brand guidelines - https://phabricator.wikimedia.org/T343279 [20:02:40] T343278: Follow up recent logo deploys (Finnish projects and taglines) - https://phabricator.wikimedia.org/T343278 [20:03:00] (03PS3) 10Urbanecm: update idwiktionary legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946555 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [20:03:02] (03CR) 10Urbanecm: [C: 03+2] update idwiktionary legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946555 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [20:03:44] (03Merged) 10jenkins-bot: update idwiktionary legacy vector logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946555 (https://phabricator.wikimedia.org/T341175) (owner: 10Anzx) [20:03:59] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:944318|Fix finnish projects, remove unused SVG/PNGs, resize wikiversity (T343278)]], [[gerrit:946608|Wikivoyage logos should always be on a single line (T343279)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimen [20:03:59] tal XWD option) [20:04:09] Jdlrobson: please test your patches [20:05:53] urbanecm: looking [20:06:39] (03Merged) 10jenkins-bot: ThreadItemStore: Ignore duplicates caused by duplicate executions [extensions/DiscussionTools] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945803 (https://phabricator.wikimedia.org/T323080) (owner: 10Bartosz Dziewoński) [20:06:49] wikivoyage one LGTM [20:07:54] ack [20:08:02] urbanecm: as does the other one [20:08:06] you can sync those now! [20:08:07] proceeding [20:08:10] !log urbanecm@deploy1002 jdlrobson and urbanecm: Continuing with sync [20:08:14] I also have one more patch if we have time. [20:08:57] (03PS1) 10Jdlrobson: Update wikisource wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946616 (https://phabricator.wikimedia.org/T341255) [20:09:03] Jdlrobson: sure [20:09:08] is this the one you just uploaded? [20:09:28] yep [20:09:31] just adding to wikitech [20:09:41] (done) [20:09:44] ty [20:09:58] (03CR) 10Urbanecm: [C: 03+2] Update wikisource wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946616 (https://phabricator.wikimedia.org/T341255) (owner: 10Jdlrobson) [20:11:15] (03Merged) 10jenkins-bot: Update wikisource wordmarks and taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946616 (https://phabricator.wikimedia.org/T341255) (owner: 10Jdlrobson) [20:13:54] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:944318|Fix finnish projects, remove unused SVG/PNGs, resize wikiversity (T343278)]], [[gerrit:946608|Wikivoyage logos should always be on a single line (T343279)]] (duration: 11m 18s) [20:13:59] T343279: Confirm Wikivoyage logos inconsistency is within brand guidelines - https://phabricator.wikimedia.org/T343279 [20:13:59] T343278: Follow up recent logo deploys (Finnish projects and taglines) - https://phabricator.wikimedia.org/T343278 [20:14:02] Jdlrobson: deployed :) [20:14:07] urbanecm: yipee [20:14:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:945803|ThreadItemStore: Ignore duplicates caused by duplicate executions (T323080 T341811)]], [[gerrit:946616|Update wikisource wordmarks and taglines (T341255)]], [[gerrit:946555|update idwiktionary legacy vector logo (T341175)]] [20:14:29] T341255: Design: Provide wordmarks/taglines for Wikisource projects - https://phabricator.wikimedia.org/T341255 [20:14:30] T323080: "Duplicate entry 'XXX-YYY' for key 'itr_itemid_id_revision_id'" in "INSERT INTO `discussiontools_item_revisions`" - https://phabricator.wikimedia.org/T323080 [20:14:30] T341811: DBQueryError: "Duplicate entry … for key 'itp_items_id_page_id'" in "INSERT INTO `discussiontools_item_pages`" - https://phabricator.wikimedia.org/T341811 [20:14:31] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [20:15:45] !log urbanecm@deploy1002 urbanecm and jdlrobson and anzx and matmarex: Backport for [[gerrit:945803|ThreadItemStore: Ignore duplicates caused by duplicate executions (T323080 T341811)]], [[gerrit:946616|Update wikisource wordmarks and taglines (T341255)]], [[gerrit:946555|update idwiktionary legacy vector logo (T341175)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, [20:15:45] mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:16:01] Jdlrobson: MatmaRex: aanzx: please test your patches :) [20:16:03] Testing [20:16:14] looking [20:16:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10taavi) Hey - anything I can do to move this forward? [20:16:48] urbanecm: looking now [20:16:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2003.codfw.wmnet with reason: host reimage [20:17:41] urbanecm: idwiktionary logo looks good [20:17:46] ack [20:18:21] urbanecm: nothing seems broken, i will look at the logs later to see if the errors that this should fix have disappeared [20:18:35] urbanecm: LGTM [20:18:37] ack [20:18:39] syncing [20:18:41] !log urbanecm@deploy1002 urbanecm and jdlrobson and anzx and matmarex: Continuing with sync [20:19:17] MatmaRex: regarding the script at s1, as you mentioned in Slack... is there a way to determine how far it actually got? [20:19:38] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs2003.codfw.wmnet with reason: host reimage [20:20:05] yeah, i was going to reply to you [20:20:27] urbanecm: it should have printed "Finished in ..." at the end, so if it hasn't, that means it died somehow [20:20:50] (03PS1) 10CDanis: Filter tracing headers from the outside [puppet] - 10https://gerrit.wikimedia.org/r/946617 (https://phabricator.wikimedia.org/T320559) [20:20:55] or that the output tracking stopped working :) [20:21:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1080.eqiad.wmnet with OS bullseye [20:21:24] (03CR) 10CDanis: [C: 03+2] Filter tracing headers from the outside [puppet] - 10https://gerrit.wikimedia.org/r/946617 (https://phabricator.wikimedia.org/T320559) (owner: 10CDanis) [20:21:33] MatmaRex: i can re-start it with the latest --start it printed, if that would be benefitial. [20:22:01] urbanecm: yes please [20:22:06] ok, doing [20:22:15] i'd be surprised if it kept running but stopped printing somehow. i've never seen that happen [20:22:32] and nothing bad will happen if it processes some data again. it will just take longer [20:23:17] (i was going to schedule that for tomorrow, but we might as well start now!) [20:24:10] (03PS1) 10CDanis: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/946619 [20:24:27] (03CR) 10CDanis: [V: 03+2 C: 03+2] fix typo [puppet] - 10https://gerrit.wikimedia.org/r/946619 (owner: 10CDanis) [20:24:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:945803|ThreadItemStore: Ignore duplicates caused by duplicate executions (T323080 T341811)]], [[gerrit:946616|Update wikisource wordmarks and taglines (T341255)]], [[gerrit:946555|update idwiktionary legacy vector logo (T341175)]] (duration: 10m 22s) [20:24:50] !log mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki=enwiki --current --all --start '["18618299"]' # T315510 [20:24:52] T341255: Design: Provide wordmarks/taglines for Wikisource projects - https://phabricator.wikimedia.org/T341255 [20:24:53] T323080: "Duplicate entry 'XXX-YYY' for key 'itr_itemid_id_revision_id'" in "INSERT INTO `discussiontools_item_revisions`" - https://phabricator.wikimedia.org/T323080 [20:24:53] T341811: DBQueryError: "Duplicate entry … for key 'itp_items_id_page_id'" in "INSERT INTO `discussiontools_item_pages`" - https://phabricator.wikimedia.org/T341811 [20:24:53] T341175: Change the Indonesian Wiktionary's name and project namespace from Wiktionary to Wikikamus - https://phabricator.wikimedia.org/T341175 [20:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:56] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:25:02] and we're synced now :) [20:25:27] Thanks [20:26:11] MatmaRex: and script's running (see my !log) [20:26:24] thanks [20:30:36] (03PS2) 10CDanis: cache: move vendor proxy lookup to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/946566 (owner: 10Giuseppe Lavagetto) [20:31:36] thanks urbanecm ! [20:31:42] np [20:38:14] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1080.eqiad.wmnet with reason: host reimage [20:39:23] urbanecm: still around? [20:39:27] yes [20:39:28] Small follow up for large logo :) https://pa.wikisource.org/?useskin=vector-2022 [20:39:38] hehe [20:39:43] link and i'll deploy :) [20:40:08] just resizing now [20:41:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1080.eqiad.wmnet with reason: host reimage [20:42:44] (03PS1) 10Jdlrobson: unset orwikisource logo and resize pawikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946623 (https://phabricator.wikimedia.org/T341255) [20:42:44] ^ pushed! [20:43:20] and on calendar urbanecm [20:43:30] (03CR) 10Urbanecm: [C: 03+2] unset orwikisource logo and resize pawikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946623 (https://phabricator.wikimedia.org/T341255) (owner: 10Jdlrobson) [20:44:21] (03Merged) 10jenkins-bot: unset orwikisource logo and resize pawikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946623 (https://phabricator.wikimedia.org/T341255) (owner: 10Jdlrobson) [20:44:35] (03CR) 10Urbanecm: [C: 04-1] "I352622e336c6cf4e96a1e29165876acb60fd0744 has this effect automatically. granting oathauth-enable separately is redundant (and creates a r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945808 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [20:45:00] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:946623|unset orwikisource logo and resize pawikisource logo (T341255)]] [20:45:03] T341255: Design: Provide wordmarks/taglines for Wikisource projects - https://phabricator.wikimedia.org/T341255 [20:45:10] (03CR) 10Urbanecm: [C: 03+1] Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [20:45:17] (03CR) 10Jforrester: Wikifunctions: Add oathauth-enable to wikifunctions-staff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945808 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [20:45:24] (03Abandoned) 10Jforrester: Wikifunctions: Add oathauth-enable to wikifunctions-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945808 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester) [20:45:30] (03PS2) 10Jforrester: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) [20:46:15] that was quick :) [20:46:34] !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:946623|unset orwikisource logo and resize pawikisource logo (T341255)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:46:43] Jdlrobson: pulled for testing :) [20:47:12] urbanecm: much better! please sync! [20:47:17] on it [20:47:18] !log urbanecm@deploy1002 jdlrobson and urbanecm: Continuing with sync [20:53:10] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:946623|unset orwikisource logo and resize pawikisource logo (T341255)]] (duration: 08m 09s) [20:53:13] T341255: Design: Provide wordmarks/taglines for Wikisource projects - https://phabricator.wikimedia.org/T341255 [20:53:15] Jdlrobson: and live :) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230807T2100). [21:03:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10andrea.denisse) Hi @taavi ! Apologies for the delay this is taking. Unfortunately the patch I sent to grant you access to ops (#940269) hasn't been reviewed yet and I can't me... [21:03:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1080.eqiad.wmnet with OS bullseye [21:03:36] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wcqs2003.codfw.wmnet with OS bullseye [21:03:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1081.eqiad.wmnet with OS bullseye [21:05:52] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:16:33] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:17:50] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1081.eqiad.wmnet with reason: host reimage [21:20:17] (03CR) 10Eevans: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/943511 (owner: 10WMDE-leszek) [21:20:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1081.eqiad.wmnet with reason: host reimage [21:27:17] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) [21:38:37] (03PS1) 10Eevans: admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) [21:41:31] (03CR) 10Eevans: [C: 03+1] admin: add adri to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946515 (https://phabricator.wikimedia.org/T342969) (owner: 10Filippo Giunchedi) [21:42:21] 10sre-alert-triage, 10Data-Platform-SRE: search.svc.eqiad.wmnet, search.svc.codfw.wmnet certs about to expire - https://phabricator.wikimedia.org/T343319 (10RKemper) Just some investigation we did to understand where the metrics come from: `probe_ssl_earliest_cert_expiry` comes from the blackbox exporter. That... [21:43:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1081.eqiad.wmnet with OS bullseye [21:49:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:59] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:51:05] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:29] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:51:33] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:52:15] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:01:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:59] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wcqs2003.codfw.wmnet [22:17:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1010.eqiad.wmnet [22:21:56] (03CR) 10BCornwall: [V: 03+1] "Passes both lintian and piuparts tests" [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/946604 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [22:22:42] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wcqs-blazegraph.service Failed on wcqs2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:24:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1010.eqiad.wmnet [22:24:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1011.eqiad.wmnet [22:30:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1011.eqiad.wmnet [22:30:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1012.eqiad.wmnet [22:38:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1012.eqiad.wmnet [22:38:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1013.eqiad.wmnet [22:43:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1013.eqiad.wmnet [22:43:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1014.eqiad.wmnet [22:49:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1014.eqiad.wmnet [22:49:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1015.eqiad.wmnet [22:56:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1015.eqiad.wmnet [23:05:44] (03PS1) 10BCornwall: Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) [23:11:41] (03PS2) 10BCornwall: Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) [23:11:51] (03CR) 10CI reject: [V: 04-1] Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [23:12:37] (03PS3) 10BCornwall: Rebuild against Bullseye [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/946635 (https://phabricator.wikimedia.org/T342154) [23:19:00] (03CR) 10Krinkle: [C: 03+2] api: Fix broken /api/index.html rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941969 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:19:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941969 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:19:43] (03Merged) 10jenkins-bot: api: Fix broken /api/index.html rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941969 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:19:57] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:941969|api: Fix broken /api/index.html rendering (T113114)]] [23:20:00] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [23:21:24] !log krinkle@deploy1002 krinkle: Backport for [[gerrit:941969|api: Fix broken /api/index.html rendering (T113114)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [23:23:15] !log krinkle@deploy1002 krinkle: Continuing with sync [23:27:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:28:58] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:941969|api: Fix broken /api/index.html rendering (T113114)]] (duration: 09m 00s) [23:29:01] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [23:32:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency