[00:20:51] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:14:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Aklapper) [01:15:20] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Aklapper) [01:15:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Aklapper) [01:36:39] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:49] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:14:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:34] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Base) The deletion link should be remove... [02:19:51] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:20:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:45:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 2.786 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:07:22] 10SRE, 10SRE-swift-storage, 10Commons: File not found in Commons - https://phabricator.wikimedia.org/T328889 (10Peachey88) [03:08:58] 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Peachey88) [03:24:37] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:57:05] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:07] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 166 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:17] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:07:47] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:09:17] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:12:53] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:45:21] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:13:53] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:11] (03PS2) 10Hashar: scap: remove plugins/.eslintrc.json before promote [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) [07:04:00] (03CR) 10Hashar: [C: 03+2] "Look" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [07:04:31] (03Merged) 10jenkins-bot: scap: remove plugins/.eslintrc.json before promote [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/884317 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [07:06:50] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e09efc0]: remove plugins/.eslintrc.json | T328134 [07:06:54] T328134: Remove plugins/.eslintrc.json - https://phabricator.wikimedia.org/T328134 [07:07:01] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e09efc0]: remove plugins/.eslintrc.json | T328134 (duration: 00m 10s) [07:07:31] (03PS1) 10Marostegui: dbproxy1016,dbproxy1020: Place db1164 as secondary [puppet] - 10https://gerrit.wikimedia.org/r/886624 (https://phabricator.wikimedia.org/T328404) [07:09:19] (03CR) 10Marostegui: [C: 03+2] dbproxy1016,dbproxy1020: Place db1164 as secondary [puppet] - 10https://gerrit.wikimedia.org/r/886624 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [07:12:05] I am going to restart Gerrit for a deployment [07:14:27] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e09efc0]: remove plugins/.eslintrc.json [07:14:32] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e09efc0]: remove plugins/.eslintrc.json (duration: 00m 05s) [07:15:18] <_joe_> I was about to ask [07:17:43] !log Restarted Gerrit for deployment [07:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2020 es2024 es2026 es2027 es2028 T327925', diff saved to https://phabricator.wikimedia.org/P43586 and previous config saved to /var/cache/conftool/dbconfig/20230206-071913-root.json [07:19:32] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [07:20:01] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:09] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:28:00] (03CR) 10Marostegui: [C: 03+1] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [07:30:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2094 db2097 db2103 db2104 db2105 db2106 db2121 db2122 db2132 db2133 db2136 db2142 db2145 db2146 db2153 db2154 db2155 db2156 db2157 db2158 db2175 db2176 db2183 T327925', diff saved to https://phabricator.wikimedia.org/P43587 and previous config saved to /var/cache/conftool/dbconfig/20230206-073015-root.json [07:30:19] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [07:31:02] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:41:32] (03CR) 10Giuseppe Lavagetto: sre: add alerting for mediawiki on k8s (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [07:41:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [07:42:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] icinga: remove mediawiki alerts [puppet] - 10https://gerrit.wikimedia.org/r/885288 (owner: 10Giuseppe Lavagetto) [07:43:27] (03Merged) 10jenkins-bot: sre: add alerting for mediawiki on k8s [alerts] - 10https://gerrit.wikimedia.org/r/797315 (owner: 10Giuseppe Lavagetto) [07:49:12] 10SRE, 10Icinga, 10SRE Observability, 10serviceops: High average POST latency for mw requests on api_appserver in codfw on alert1001 - https://phabricator.wikimedia.org/T326544 (10Joe) 05Open→03Resolved [07:56:52] 10SRE, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:57:22] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) 05Open→03Invalid ATS is not longer being used to terminate TLS so we can't close this one [08:00:04] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T0800) [08:00:04] WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:27] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:23] 10SRE, 10Traffic: varnish test text/02-frontend-headers.vtc is currently failing in production - https://phabricator.wikimedia.org/T328898 (10Vgutierrez) [08:03:58] 10SRE, 10Traffic: varnish test text/02-frontend-headers.vtc is currently failing in production - https://phabricator.wikimedia.org/T328898 (10Vgutierrez) p:05Triage→03Medium [08:04:56] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] Revert "Request high-entropy Sec-CH-UA* client hints" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx) [08:05:32] (03PS1) 10Muehlenhoff: Remove access for dannyh [puppet] - 10https://gerrit.wikimedia.org/r/886798 [08:08:41] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dannyh [puppet] - 10https://gerrit.wikimedia.org/r/886798 (owner: 10Muehlenhoff) [08:12:46] (03PS1) 10Slyngshede: P:openldap Extend wmf-user schema with SUL account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 [08:13:52] (03PS2) 10Slyngshede: P:openldap Extend wmf-user schema with SUL account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [08:16:47] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:19:47] (03PS1) 10Hashar: scap: use proper path when deleting .eslintrc.json [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/886801 (https://phabricator.wikimedia.org/T328134) [08:20:14] (03CR) 10Hashar: [C: 03+2] scap: use proper path when deleting .eslintrc.json [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/886801 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [08:20:46] (03Merged) 10jenkins-bot: scap: use proper path when deleting .eslintrc.json [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/886801 (https://phabricator.wikimedia.org/T328134) (owner: 10Hashar) [08:23:31] 10SRE, 10Infrastructure-Foundations: IDM integration into CAS SSO - https://phabricator.wikimedia.org/T320799 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [08:23:35] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [08:24:11] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [08:24:16] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10SLyngshede-WMF) [08:30:06] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) Missing actual email sending [08:30:20] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) a:03SLyngshede-WMF [08:41:05] Damn I missed my deployment. [08:41:12] Anyone still there? [08:43:29] hey WMDE-Fisch I usually have the backport window on Thurs but I could be here for this, question: wil the merge finish in time? with only 15 minutes left I mean [08:43:36] * urbanecm waves too [08:43:56] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [08:44:03] oh cool, urbanecm you're actually listed, happy to step back [08:44:07] apergos: it failes CI anyway, we'd have to forcemerge [08:44:44] up to you (as actually managing the window) if you want to do that in the last 15 minutes [08:44:53] WMDE-Fisch: let's do that :) [08:45:02] \o/ [08:45:16] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "forcemerging per Thiemo" [extensions/Kartographer] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886105 (https://phabricator.wikimedia.org/T328739) (owner: 10WMDE-Fisch) [08:46:07] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886105|Fix and add mising parser test for maplink with suppressed text="" (T328739)]] [08:46:10] T328739: Kartographer maplink presents senseless coordinates in Wikivoyage for geoline external data - https://phabricator.wikimedia.org/T328739 [08:48:54] (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:49:02] (03PS9) 10Hashar: phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) [08:49:13] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:49:40] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) [08:51:23] (03CR) 10Hashar: "I have cherry picked the patch against the tip of production branch after the original parent change got merged." [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:51:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [08:54:34] ...i really wish scap backport was faster :-/ [08:54:52] *sigh* [08:54:56] (03CR) 10JMeybohm: [C: 04-1] "`failure-domain.beta.kubernetes.io` annotations are deprecated and will be removed (see https://phabricator.wikimedia.org/T325066). Please" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [08:56:03] !log urbanecm@deploy1002 wmde-fisch and urbanecm: Backport for [[gerrit:886105|Fix and add mising parser test for maplink with suppressed text="" (T328739)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:56:06] T328739: Kartographer maplink presents senseless coordinates in Wikivoyage for geoline external data - https://phabricator.wikimedia.org/T328739 [08:56:12] WMDE-Fisch: can you test please? [08:56:26] Yep [08:57:59] urbanecm: Works! [08:58:05] great, syncing! [08:58:27] * urbanecm has doubts it'll finish in 2 minutes, but we also don't have anything immediately after this window [08:58:36] ^^' [08:59:21] good good [09:04:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:04:20] (03CR) 10Majavah: [C: 04-1] P:openldap Extend wmf-user schema with SUL account. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [09:04:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:05:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:05:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886105|Fix and add mising parser test for maplink with suppressed text="" (T328739)]] (duration: 18m 56s) [09:05:06] T328739: Kartographer maplink presents senseless coordinates in Wikivoyage for geoline external data - https://phabricator.wikimedia.org/T328739 [09:05:10] WMDE-Fisch: it's live now [09:05:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:05:28] urbanecm: Thanks so much! [09:05:37] but...19 minutes is too long :-/ [09:05:40] no problem :) [09:05:49] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/875266/1599/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [09:06:32] thanks from me too :-) [09:06:44] np [09:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:17] (03PS1) 10Alexandros Kosiaris: Include wikivoyage in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/886828 (https://phabricator.wikimedia.org/T226931) [09:10:51] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:12:54] (03CR) 10Volans: [C: 04-1] "I've spotted some issues, see inline for the details." [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:16:55] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39390/console" [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) (owner: 10Hashar) [09:17:44] (03CR) 10David Caro: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) (owner: 10Hashar) [09:18:57] (03PS2) 10Vgutierrez: varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) [09:18:59] (03PS2) 10Jaime Nuche: jenkins: remove hardcoded values from sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) [09:19:01] (03PS11) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [09:20:48] (03CR) 10David Caro: [V: 03+1 C: 03+2] extdist: remove integration/composer.git [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) (owner: 10Hashar) [09:26:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:32:50] (03PS1) 10Jcrespo: dbbackups: Add a way to disable mydumper runs and reschedule them [puppet] - 10https://gerrit.wikimedia.org/r/886833 [09:32:59] ACKNOWLEDGEMENT - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Singtel maintenance CHG000000132280 - The acknowledgement expires at: 2023-02-15 09:32:32. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:32:59] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Singtel maintenance CHG000000132280 - The acknowledgement expires at: 2023-02-15 09:32:32. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:39] (03CR) 10Jelto: [C: 03+2] idp: add gitlab-replica-old to gitlab-replica service_id [puppet] - 10https://gerrit.wikimedia.org/r/886333 (https://phabricator.wikimedia.org/T328635) (owner: 10Jelto) [09:33:56] (03CR) 10Jelto: [C: 03+2] gitlab: get gitlab url from config while restoring [puppet] - 10https://gerrit.wikimedia.org/r/886336 (https://phabricator.wikimedia.org/T328635) (owner: 10Jelto) [09:34:56] (03PS3) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [09:35:32] (03PS2) 10Jcrespo: dbbackups: Add a way to disable mydumper runs and reschedule them [puppet] - 10https://gerrit.wikimedia.org/r/886833 [09:37:07] (03CR) 10Slyngshede: P:openldap Extend wmf-user schema with global account. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [09:42:13] (03PS1) 10Jcrespo: dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) [09:42:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:51] (03CR) 10Marostegui: [C: 03+1] dbbackups: Add a way to disable mydumper runs and reschedule them [puppet] - 10https://gerrit.wikimedia.org/r/886833 (owner: 10Jcrespo) [09:43:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:28] (03CR) 10Majavah: P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [09:44:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:48:13] (03PS3) 10Jcrespo: dbbackups: Add a way to disable mydumper runs and reschedule them [puppet] - 10https://gerrit.wikimedia.org/r/886833 [09:50:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:17] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add a way to disable mydumper runs and reschedule them [puppet] - 10https://gerrit.wikimedia.org/r/886833 (owner: 10Jcrespo) [09:51:39] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/886833/39393/" [puppet] - 10https://gerrit.wikimedia.org/r/886833 (owner: 10Jcrespo) [09:51:59] (03PS2) 10Jcrespo: dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) [09:53:50] (03PS1) 10Clément Goubert: sre: add missing servergroup to MediaWikiHighErrorRate summary [alerts] - 10https://gerrit.wikimedia.org/r/886836 [09:54:02] (03CR) 10JMeybohm: "During the reimages of the wikikube staging clusters I also created a downtime/silence in alertmanager with a selector like "site=codfw,pr" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:57:06] 10SRE, 10Traffic: varnish test text/02-frontend-headers.vtc is currently failing in production - https://phabricator.wikimedia.org/T328898 (10Vgutierrez) This could be an intermittent issue as my latest run against a text node worked as expected: `0 tests failed, 0 tests skipped, 35 tests passed` [09:57:12] (03PS5) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [09:57:51] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10akosiaris) >>! In T327925#8587186, @Marostegui wrote: >>>! In T327925#8587104, @Joe wrote: >> I would suggest that instead of handling individual systems... [09:58:03] (03CR) 10Filippo Giunchedi: [C: 03+1] sre: add missing servergroup to MediaWikiHighErrorRate summary [alerts] - 10https://gerrit.wikimedia.org/r/886836 (owner: 10Clément Goubert) [09:58:23] (03CR) 10Clément Goubert: [C: 03+2] sre: add missing servergroup to MediaWikiHighErrorRate summary [alerts] - 10https://gerrit.wikimedia.org/r/886836 (owner: 10Clément Goubert) [09:59:32] (03Merged) 10jenkins-bot: sre: add missing servergroup to MediaWikiHighErrorRate summary [alerts] - 10https://gerrit.wikimedia.org/r/886836 (owner: 10Clément Goubert) [09:59:37] (03CR) 10Vgutierrez: [C: 03+2] varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/886337 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [10:00:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:01:26] (03CR) 10Jcrespo: [C: 04-1] "Wrong role, amending: https://puppet-compiler.wmflabs.org/output/886834/39395/" [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [10:01:49] (03CR) 10Klausman: Add sre.k8s.upgrade-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:04:26] (03PS6) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:04:28] (03CR) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:05:15] 10SRE, 10Fundraising-Backlog, 10LDAP-Access-Requests: LDAP access to the wmf group for Anil Kanji - https://phabricator.wikimedia.org/T328805 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi {{done}} [10:05:37] (03PS3) 10Jcrespo: dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) [10:08:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling calendar - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [10:09:11] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/886834/39396/" [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [10:09:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [10:09:47] !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@b798462] (releasing): (no justification provided) [10:10:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) p:05Triage→03Medium [10:10:26] !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@b798462] (releasing): (no justification provided) (duration: 00m 38s) [10:10:50] (03CR) 10JMeybohm: "Wouldn't it be possible to select the cookbook for etcd and control planes based on netbox_server.is_virtual as well, like you do for work" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [10:10:55] (03CR) 10Marostegui: [C: 03+1] dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [10:11:31] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Cool! I am going to repool the hosts then :) [10:13:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43591 and previous config saved to /var/cache/conftool/dbconfig/20230206-101301-root.json [10:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43592 and previous config saved to /var/cache/conftool/dbconfig/20230206-101308-root.json [10:13:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43593 and previous config saved to /var/cache/conftool/dbconfig/20230206-101315-root.json [10:13:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43594 and previous config saved to /var/cache/conftool/dbconfig/20230206-101332-root.json [10:13:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43595 and previous config saved to /var/cache/conftool/dbconfig/20230206-101336-root.json [10:13:48] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) p:05Triage→03High [10:21:07] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) @Jcross @Htriedman https://gerrit.wikimedia.org/r/886337 got merged a few minutes ago, initial tests in cp6016 look good: ` vgutierrez@cp6016:~$ curl -v -o /dev/null "https://... [10:22:35] (03CR) 10Clément Goubert: [C: 03+2] configmaster: Remove disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [10:23:34] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:24:30] (03PS4) 10Jcrespo: dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) [10:24:32] (03PS1) 10Jcrespo: bacula: Reschedule run of es backups codfw -> eqiad [puppet] - 10https://gerrit.wikimedia.org/r/886837 [10:24:55] (03CR) 10Jcrespo: [C: 04-1] "Probably shouldn't be merged." [puppet] - 10https://gerrit.wikimedia.org/r/886837 (owner: 10Jcrespo) [10:26:15] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [10:27:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudsw1-b1-codfw mgmt IP. - cmooney@cumin1001" [10:27:34] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:28:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43596 and previous config saved to /var/cache/conftool/dbconfig/20230206-102806-root.json [10:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43597 and previous config saved to /var/cache/conftool/dbconfig/20230206-102812-root.json [10:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43598 and previous config saved to /var/cache/conftool/dbconfig/20230206-102820-root.json [10:28:27] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Delay codfw es (db content) backups by one day [puppet] - 10https://gerrit.wikimedia.org/r/886834 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [10:28:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43599 and previous config saved to /var/cache/conftool/dbconfig/20230206-102837-root.json [10:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43600 and previous config saved to /var/cache/conftool/dbconfig/20230206-102841-root.json [10:30:57] (03PS1) 10Jcrespo: Revert "dbbackups: Delay codfw es (db content) backups by one day" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925) [10:31:33] .8 [10:31:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:31:58] claime: it did to me [10:33:55] (03CR) 10Jcrespo: [C: 04-2] "Not to be merged until Wednesday morning (otherwise, backups won't run this week)" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [10:36:37] !log Upgrade db1115 (db_inventory master) to 10.6. T328408 [10:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:40] T328408: Migrate db_inventory section to MariaDB 10.6 - https://phabricator.wikimedia.org/T328408 [10:37:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:37:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:07] (03PS1) 10Marostegui: db1115: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/886838 (https://phabricator.wikimedia.org/T328408) [10:38:32] (03CR) 10Marostegui: [C: 03+2] db1115: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/886838 (https://phabricator.wikimedia.org/T328408) (owner: 10Marostegui) [10:39:06] (03CR) 10Jbond: [C: 03+1] phabricator: create phd home directory on service start [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:41:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43601 and previous config saved to /var/cache/conftool/dbconfig/20230206-104310-root.json [10:43:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43602 and previous config saved to /var/cache/conftool/dbconfig/20230206-104317-root.json [10:43:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43603 and previous config saved to /var/cache/conftool/dbconfig/20230206-104324-root.json [10:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43604 and previous config saved to /var/cache/conftool/dbconfig/20230206-104341-root.json [10:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43605 and previous config saved to /var/cache/conftool/dbconfig/20230206-104346-root.json [10:45:59] (03CR) 10Clément Goubert: [C: 03+1] Include wikivoyage in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/886828 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris) [10:46:42] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/886839 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [10:48:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:19] (03PS4) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [10:49:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Include wikivoyage in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/886828 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris) [10:49:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [10:51:43] 10SRE, 10Traffic: varnish test text/02-frontend-headers.vtc is currently failing in production - https://phabricator.wikimedia.org/T328898 (10Vgutierrez) 05Open→03Declined hmmm this is triggered by https://gerrit.wikimedia.org/r/c/operations/puppet/+/886401, error was caused by running a non rebased test s... [10:52:10] (03CR) 10Jbond: [C: 03+2] monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [10:52:18] (03PS6) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [10:52:32] (03PS1) 10Vgutierrez: varnish: Remove upload.wm.o test from text test [puppet] - 10https://gerrit.wikimedia.org/r/886840 (https://phabricator.wikimedia.org/T262996) [10:54:08] (03Merged) 10jenkins-bot: Include wikivoyage in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/886828 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris) [10:55:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [10:56:10] (03PS1) 10Jelto: sre.gitlab.upgrade: check for unknown version [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) [10:56:26] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Clement_Goubert) [10:56:50] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:57:09] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Clement_Goubert) p:05Triage→03Low [10:58:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43606 and previous config saved to /var/cache/conftool/dbconfig/20230206-105815-root.json [10:58:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43607 and previous config saved to /var/cache/conftool/dbconfig/20230206-105822-root.json [10:58:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43608 and previous config saved to /var/cache/conftool/dbconfig/20230206-105829-root.json [10:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43609 and previous config saved to /var/cache/conftool/dbconfig/20230206-105846-root.json [10:58:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43610 and previous config saved to /var/cache/conftool/dbconfig/20230206-105851-root.json [10:58:55] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:58:59] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:59:16] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1100) [11:00:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (and very welcome change with my o11y hat on)" [puppet] - 10https://gerrit.wikimedia.org/r/883151 (https://phabricator.wikimedia.org/T239862) (owner: 10Clément Goubert) [11:01:26] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [11:01:39] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:02:20] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:03:16] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:03:28] !log deploy changeprop 0.10.19, adding wikivoyage to list of domains the mobile-sections get rerendered for. T226931 [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:32] T226931: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 [11:03:35] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:03:53] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:04:24] (03PS1) 10Vgutierrez: aptrepo: Add missing Suite for ceph-quincy [puppet] - 10https://gerrit.wikimedia.org/r/886842 (https://phabricator.wikimedia.org/T326945) [11:05:52] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There are 3869 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [11:05:54] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There are 3872 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [11:06:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886842 (https://phabricator.wikimedia.org/T326945) (owner: 10Vgutierrez) [11:07:22] 3800 unmerged cahnges? [11:07:25] *changes [11:08:03] mmhh claime ^ FYI [11:08:32] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:08:39] vgutierrez: wth [11:08:52] Ok so crashing in the middle of puppet-merge is *bad* [11:09:07] (03CR) 10Btullis: [C: 03+1] aptrepo: Add missing Suite for ceph-quincy [puppet] - 10https://gerrit.wikimedia.org/r/886842 (https://phabricator.wikimedia.org/T326945) (owner: 10Vgutierrez) [11:09:40] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:08] claime: yes it is :) i can take a look [11:10:26] do you still have the terminal output you can paste somewhere [11:11:46] jbond: can we discuss this in one place ? we're already talking about it on -sre [11:11:55] ahh sorry let me go there [11:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43612 and previous config saved to /var/cache/conftool/dbconfig/20230206-111320-root.json [11:13:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43613 and previous config saved to /var/cache/conftool/dbconfig/20230206-111327-root.json [11:13:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43614 and previous config saved to /var/cache/conftool/dbconfig/20230206-111334-root.json [11:13:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43615 and previous config saved to /var/cache/conftool/dbconfig/20230206-111351-root.json [11:13:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43616 and previous config saved to /var/cache/conftool/dbconfig/20230206-111356-root.json [11:21:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10fgiunchedi) @Ottomata @odimitrijevic what do you think re: the request above? thank you ! [11:23:49] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) [11:26:02] 10SRE, 10SRE-tools, 10Spicerack, 10serviceops, 10Datacenter-Switchover: Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) p:05Triage→03Low [11:27:58] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on puppetmaster2002.codfw.wmnet,puppetmaster1002.eqiad.wmnet with reason: Decom [11:28:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on puppetmaster2002.codfw.wmnet,puppetmaster1002.eqiad.wmnet with reason: Decom [11:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43617 and previous config saved to /var/cache/conftool/dbconfig/20230206-112825-root.json [11:28:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43618 and previous config saved to /var/cache/conftool/dbconfig/20230206-112832-root.json [11:28:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43619 and previous config saved to /var/cache/conftool/dbconfig/20230206-112839-root.json [11:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43620 and previous config saved to /var/cache/conftool/dbconfig/20230206-112856-root.json [11:29:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43621 and previous config saved to /var/cache/conftool/dbconfig/20230206-112900-root.json [11:29:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43622 and previous config saved to /var/cache/conftool/dbconfig/20230206-112942-root.json [11:29:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43623 and previous config saved to /var/cache/conftool/dbconfig/20230206-112948-root.json [11:30:05] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: decomission puppetmaster[12]00[12] and replace them with puppetmaster[12]00[45] - https://phabricator.wikimedia.org/T314136 (10jbond) Im going to add the puppetmasters[12]002 back into services. No that puppetserver 7 is out it would be nice to bui... [11:30:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43624 and previous config saved to /var/cache/conftool/dbconfig/20230206-113053-root.json [11:31:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43625 and previous config saved to /var/cache/conftool/dbconfig/20230206-113104-root.json [11:31:32] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:31:41] (03PS9) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:33:00] (03PS1) 10Jbond: Revert "puppetmaster: remove puppetmaster[12]002 for decom" [puppet] - 10https://gerrit.wikimedia.org/r/886813 [11:33:46] (03PS2) 10Jbond: Revert "puppetmaster: remove puppetmaster[12]002 for decom" [puppet] - 10https://gerrit.wikimedia.org/r/886813 [11:34:04] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [11:35:22] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster: remove puppetmaster[12]002 for decom" [puppet] - 10https://gerrit.wikimedia.org/r/886813 (owner: 10Jbond) [11:35:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [11:38:37] (03PS1) 10Jbond: puppetmaster: add load factor [puppet] - 10https://gerrit.wikimedia.org/r/886867 [11:39:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster: add load factor [puppet] - 10https://gerrit.wikimedia.org/r/886867 (owner: 10Jbond) [11:41:20] (03CR) 10Volans: [C: 03+1] "LGTM, just one general question. Is there any alert on alertmanager that might be affected too?" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43626 and previous config saved to /var/cache/conftool/dbconfig/20230206-114446-root.json [11:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43627 and previous config saved to /var/cache/conftool/dbconfig/20230206-114453-root.json [11:45:02] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:45:15] (03PS1) 10Jbond: add whitespace to test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/886868 [11:45:17] (03PS10) 10Clément Goubert: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43628 and previous config saved to /var/cache/conftool/dbconfig/20230206-114558-root.json [11:46:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43629 and previous config saved to /var/cache/conftool/dbconfig/20230206-114609-root.json [11:46:26] (03CR) 10Jbond: [C: 03+2] add whitespace to test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/886868 (owner: 10Jbond) [11:46:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host db1108.eqiad.wmnet [11:46:49] (03PS1) 10Jbond: Revert "add whitespace to test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/886814 [11:47:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "add whitespace to test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/886814 (owner: 10Jbond) [11:47:28] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.03555 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:47:32] !log puppetmaster[12]002 reintroduced to services [11:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:35] * jbond looking [11:47:56] I've made a right mess haven't I [11:48:26] claime: no no this was me deploying the change to the puppetmasters. it causes an apache restart i forgot about [11:49:18] ack [11:50:52] (03CR) 10Marostegui: "I don't think we have any alerts on alertmanager" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:51:13] (03PS12) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:51:31] (03PS1) 10Ilias Sarantopoulos: ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 [11:53:20] (03PS13) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [11:53:34] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:54:10] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:54:17] (03CR) 10CI reject: [V: 04-1] ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (owner: 10Ilias Sarantopoulos) [11:54:25] (03PS2) 10Ilias Sarantopoulos: ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 [11:54:51] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:55:19] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:58:11] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1108.eqiad.wmnet [11:58:36] PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:58:36] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:36] PROBLEM - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:36] PROBLEM - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:36] PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:38] PROBLEM - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:58:38] PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:59:22] weird, reboot didn't downtime those? [11:59:33] all good? [11:59:43] it is not me [11:59:52] so I don't know [11:59:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43630 and previous config saved to /var/cache/conftool/dbconfig/20230206-115951-root.json [11:59:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43631 and previous config saved to /var/cache/conftool/dbconfig/20230206-115958-root.json [12:00:10] it was done by ben [12:00:13] btullis: ^ [12:00:25] I just saw the comment in the phab task [12:00:45] (JobUnavailable) firing: Reduced availability for job k8s-api in k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43633 and previous config saved to /var/cache/conftool/dbconfig/20230206-120103-root.json [12:01:05] I'm on it. Thanks. db1108 came back from a reboot, but the systemd services for the two database instances no longer exist. [12:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43634 and previous config saved to /var/cache/conftool/dbconfig/20230206-120114-root.json [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica Lag: matomo on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:50] ACKNOWLEDGEMENT - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:01:51] ACKNOWLEDGEMENT - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:01:51] ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T304492 Issues after rebooting. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:08:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:13:29] (03CR) 10Majavah: P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [12:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43635 and previous config saved to /var/cache/conftool/dbconfig/20230206-121456-root.json [12:15:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43636 and previous config saved to /var/cache/conftool/dbconfig/20230206-121503-root.json [12:15:15] (03PS5) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [12:15:23] (03CR) 10Slyngshede: P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [12:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43637 and previous config saved to /var/cache/conftool/dbconfig/20230206-121608-root.json [12:16:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43638 and previous config saved to /var/cache/conftool/dbconfig/20230206-121619-root.json [12:20:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:23:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST configurations) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10nfraison) [12:25:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10nfraison) [12:25:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:26:59] (03PS2) 10Jelto: sre.gitlab.upgrade: check for unknown version [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) [12:28:58] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST configurations) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43639 and previous config saved to /var/cache/conftool/dbconfig/20230206-123001-root.json [12:30:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43640 and previous config saved to /var/cache/conftool/dbconfig/20230206-123007-root.json [12:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43641 and previous config saved to /var/cache/conftool/dbconfig/20230206-123112-root.json [12:31:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43642 and previous config saved to /var/cache/conftool/dbconfig/20230206-123124-root.json [12:34:06] (03PS3) 10Ilias Sarantopoulos: ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328526) [12:35:54] (03PS6) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [12:36:19] (03CR) 10AikoChou: "I created a Phab task T328916 for this, could you add it to the patch? Thanks:)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328526) (owner: 10Ilias Sarantopoulos) [12:37:36] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: check for unknown version [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:38:13] (03CR) 10Slyngshede: P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [12:38:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) Downtime part dry-runs correctly. I will reopen if I hit issues in the live-test. [12:38:33] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [12:38:40] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) 05Open→03Resolved [12:38:49] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review, 10User-notice-archive: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Clement_Goubert) [12:39:52] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: check for unknown version [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [12:41:10] (03CR) 10Ilias Sarantopoulos: ml-services: remove additional drafttopic deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328526) (owner: 10Ilias Sarantopoulos) [12:41:48] RECOVERY - MariaDB read only matomo on db1108 is OK: Version 10.4.22-MariaDB-log, Uptime 75s, read_only: True, event_scheduler: True, 13.66 QPS, connection latency: 0.003573s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:41:55] btullis: ^ [12:42:21] Wut? [12:43:28] RECOVERY - mysqld processes on db1108 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:43:30] RECOVERY - MariaDB Replica IO: analytics_meta on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:43:30] RECOVERY - MariaDB Replica SQL: analytics_meta on db1108 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:43:34] btullis: ^ [12:43:36] RECOVERY - MariaDB read only analytics_meta on db1108 is OK: Version 10.4.22-MariaDB-log, Uptime 65s, read_only: True, event_scheduler: True, 64.55 QPS, connection latency: 0.004865s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:45:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43643 and previous config saved to /var/cache/conftool/dbconfig/20230206-124506-root.json [12:45:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43644 and previous config saved to /var/cache/conftool/dbconfig/20230206-124513-root.json [12:45:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:46:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43645 and previous config saved to /var/cache/conftool/dbconfig/20230206-124617-root.json [12:46:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43646 and previous config saved to /var/cache/conftool/dbconfig/20230206-124629-root.json [12:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43647 and previous config saved to /var/cache/conftool/dbconfig/20230206-124725-root.json [12:47:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43648 and previous config saved to /var/cache/conftool/dbconfig/20230206-124730-root.json [12:47:36] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10dcaro) [12:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43649 and previous config saved to /var/cache/conftool/dbconfig/20230206-124745-root.json [12:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43650 and previous config saved to /var/cache/conftool/dbconfig/20230206-124751-root.json [12:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43651 and previous config saved to /var/cache/conftool/dbconfig/20230206-124808-root.json [12:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43652 and previous config saved to /var/cache/conftool/dbconfig/20230206-124814-root.json [12:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43653 and previous config saved to /var/cache/conftool/dbconfig/20230206-124841-root.json [12:49:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43654 and previous config saved to /var/cache/conftool/dbconfig/20230206-124909-root.json [12:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43655 and previous config saved to /var/cache/conftool/dbconfig/20230206-124924-root.json [12:49:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43656 and previous config saved to /var/cache/conftool/dbconfig/20230206-124937-root.json [12:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43657 and previous config saved to /var/cache/conftool/dbconfig/20230206-125017-root.json [12:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43658 and previous config saved to /var/cache/conftool/dbconfig/20230206-125025-root.json [12:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43659 and previous config saved to /var/cache/conftool/dbconfig/20230206-125029-root.json [12:50:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43660 and previous config saved to /var/cache/conftool/dbconfig/20230206-125037-root.json [12:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43661 and previous config saved to /var/cache/conftool/dbconfig/20230206-125042-root.json [12:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43662 and previous config saved to /var/cache/conftool/dbconfig/20230206-125059-root.json [12:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43663 and previous config saved to /var/cache/conftool/dbconfig/20230206-125103-root.json [12:51:44] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I am repooling all the databases since we are going to fully depool codfw for reads. [12:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43664 and previous config saved to /var/cache/conftool/dbconfig/20230206-125228-root.json [12:55:17] (03PS4) 10Ilias Sarantopoulos: ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328916) [12:57:00] (03PS1) 10Muehlenhoff: Point DHCP server in eqiad to install1004 [homer/public] - 10https://gerrit.wikimedia.org/r/886888 (https://phabricator.wikimedia.org/T327867) [12:57:57] (03PS1) 10Muehlenhoff: Move webproxy in eqiad to install1004 [dns] - 10https://gerrit.wikimedia.org/r/886889 (https://phabricator.wikimedia.org/T327867) [12:58:40] (03CR) 10AikoChou: [C: 03+1] ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328916) (owner: 10Ilias Sarantopoulos) [13:00:46] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:02:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43665 and previous config saved to /var/cache/conftool/dbconfig/20230206-130230-root.json [13:02:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43666 and previous config saved to /var/cache/conftool/dbconfig/20230206-130235-root.json [13:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43667 and previous config saved to /var/cache/conftool/dbconfig/20230206-130250-root.json [13:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43668 and previous config saved to /var/cache/conftool/dbconfig/20230206-130256-root.json [13:03:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43669 and previous config saved to /var/cache/conftool/dbconfig/20230206-130313-root.json [13:03:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43670 and previous config saved to /var/cache/conftool/dbconfig/20230206-130319-root.json [13:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43671 and previous config saved to /var/cache/conftool/dbconfig/20230206-130346-root.json [13:04:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43672 and previous config saved to /var/cache/conftool/dbconfig/20230206-130414-root.json [13:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43673 and previous config saved to /var/cache/conftool/dbconfig/20230206-130429-root.json [13:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43674 and previous config saved to /var/cache/conftool/dbconfig/20230206-130442-root.json [13:05:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43675 and previous config saved to /var/cache/conftool/dbconfig/20230206-130521-root.json [13:05:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43676 and previous config saved to /var/cache/conftool/dbconfig/20230206-130530-root.json [13:05:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43677 and previous config saved to /var/cache/conftool/dbconfig/20230206-130534-root.json [13:05:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43678 and previous config saved to /var/cache/conftool/dbconfig/20230206-130542-root.json [13:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43679 and previous config saved to /var/cache/conftool/dbconfig/20230206-130547-root.json [13:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43680 and previous config saved to /var/cache/conftool/dbconfig/20230206-130603-root.json [13:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43681 and previous config saved to /var/cache/conftool/dbconfig/20230206-130608-root.json [13:06:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2329 is CRITICAL: etcd last index (2370987) is outdated compared to the master one (2370999) https://wikitech.wikimedia.org/wiki/Etcd [13:06:10] PROBLEM - MediaWiki EtcdConfig up-to-date on parse2007 is CRITICAL: etcd last index (2370993) is outdated compared to the master one (2370999) https://wikitech.wikimedia.org/wiki/Etcd [13:07:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43682 and previous config saved to /var/cache/conftool/dbconfig/20230206-130733-root.json [13:07:58] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2329 is OK: etcd last index (2371005) matches the master one (2371005) https://wikitech.wikimedia.org/wiki/Etcd [13:07:58] RECOVERY - MediaWiki EtcdConfig up-to-date on parse2007 is OK: etcd last index (2371005) matches the master one (2371005) https://wikitech.wikimedia.org/wiki/Etcd [13:12:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/886888 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:15:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:16:11] (03CR) 10Muehlenhoff: [C: 04-1] P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [13:17:08] (03CR) 10Klausman: [C: 03+2] ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328916) (owner: 10Ilias Sarantopoulos) [13:17:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43683 and previous config saved to /var/cache/conftool/dbconfig/20230206-131734-root.json [13:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43684 and previous config saved to /var/cache/conftool/dbconfig/20230206-131740-root.json [13:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43685 and previous config saved to /var/cache/conftool/dbconfig/20230206-131755-root.json [13:18:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43686 and previous config saved to /var/cache/conftool/dbconfig/20230206-131801-root.json [13:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43687 and previous config saved to /var/cache/conftool/dbconfig/20230206-131818-root.json [13:18:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43688 and previous config saved to /var/cache/conftool/dbconfig/20230206-131824-root.json [13:18:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43689 and previous config saved to /var/cache/conftool/dbconfig/20230206-131851-root.json [13:19:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43690 and previous config saved to /var/cache/conftool/dbconfig/20230206-131918-root.json [13:19:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43691 and previous config saved to /var/cache/conftool/dbconfig/20230206-131934-root.json [13:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43692 and previous config saved to /var/cache/conftool/dbconfig/20230206-131947-root.json [13:19:51] jouncebot: nowandnext [13:19:52] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [13:19:52] In 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1400) [13:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43693 and previous config saved to /var/cache/conftool/dbconfig/20230206-132026-root.json [13:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43694 and previous config saved to /var/cache/conftool/dbconfig/20230206-132035-root.json [13:20:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43695 and previous config saved to /var/cache/conftool/dbconfig/20230206-132039-root.json [13:20:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43696 and previous config saved to /var/cache/conftool/dbconfig/20230206-132047-root.json [13:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43697 and previous config saved to /var/cache/conftool/dbconfig/20230206-132051-root.json [13:21:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43698 and previous config saved to /var/cache/conftool/dbconfig/20230206-132108-root.json [13:21:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43699 and previous config saved to /var/cache/conftool/dbconfig/20230206-132113-root.json [13:22:18] (03Merged) 10jenkins-bot: ml-services: remove additional drafttopic deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886869 (https://phabricator.wikimedia.org/T328916) (owner: 10Ilias Sarantopoulos) [13:22:22] (03PS1) 10Nicolas Fraison: admin: create nfraison user and add it to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/886890 [13:22:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43700 and previous config saved to /var/cache/conftool/dbconfig/20230206-132238-root.json [13:23:01] (03CR) 10CI reject: [V: 04-1] admin: create nfraison user and add it to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/886890 (owner: 10Nicolas Fraison) [13:23:10] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info/home (redirect to the home page) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond bad request for an unsupported format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:23:55] (03PS1) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 [13:23:55] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:24:31] (03CR) 10CI reject: [V: 04-1] Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (owner: 10Nicolas Fraison) [13:24:36] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [13:26:04] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:26:18] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:26:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:29:05] (03CR) 10Muehlenhoff: Add nfraison to ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886891 (owner: 10Nicolas Fraison) [13:29:10] (03CR) 10Vgutierrez: [C: 03+2] aptrepo: Add missing Suite for ceph-quincy [puppet] - 10https://gerrit.wikimedia.org/r/886842 (https://phabricator.wikimedia.org/T326945) (owner: 10Vgutierrez) [13:29:38] jbond: ok to merge f2d9f2c9b2? [13:31:18] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005008 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:31:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:32:23] (03CR) 10Alexandros Kosiaris: DNM: Showcase row-level mesh in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [13:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43701 and previous config saved to /var/cache/conftool/dbconfig/20230206-133239-root.json [13:32:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43702 and previous config saved to /var/cache/conftool/dbconfig/20230206-133247-root.json [13:33:00] (03CR) 10Alexandros Kosiaris: DNM: Showcase row-level mesh in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [13:33:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43703 and previous config saved to /var/cache/conftool/dbconfig/20230206-133300-root.json [13:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43704 and previous config saved to /var/cache/conftool/dbconfig/20230206-133306-root.json [13:33:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43705 and previous config saved to /var/cache/conftool/dbconfig/20230206-133323-root.json [13:33:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43706 and previous config saved to /var/cache/conftool/dbconfig/20230206-133329-root.json [13:33:44] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:33:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43707 and previous config saved to /var/cache/conftool/dbconfig/20230206-133356-root.json [13:34:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43708 and previous config saved to /var/cache/conftool/dbconfig/20230206-133423-root.json [13:34:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43709 and previous config saved to /var/cache/conftool/dbconfig/20230206-133439-root.json [13:34:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43710 and previous config saved to /var/cache/conftool/dbconfig/20230206-133451-root.json [13:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43711 and previous config saved to /var/cache/conftool/dbconfig/20230206-133531-root.json [13:35:34] !log add confd to bookworm repos [13:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43712 and previous config saved to /var/cache/conftool/dbconfig/20230206-133540-root.json [13:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43713 and previous config saved to /var/cache/conftool/dbconfig/20230206-133544-root.json [13:35:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43714 and previous config saved to /var/cache/conftool/dbconfig/20230206-133552-root.json [13:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43715 and previous config saved to /var/cache/conftool/dbconfig/20230206-133556-root.json [13:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43716 and previous config saved to /var/cache/conftool/dbconfig/20230206-133613-root.json [13:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43717 and previous config saved to /var/cache/conftool/dbconfig/20230206-133618-root.json [13:36:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:37:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10Ottomata) Approved! We might need some other SRE access too, but not sure if that belongs in a different ticket or not. [13:37:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43718 and previous config saved to /var/cache/conftool/dbconfig/20230206-133743-root.json [13:37:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:41:18] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39398/console" [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:41:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:42:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10fgiunchedi) [13:42:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:43:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10fgiunchedi) Just to highlight the fact that @AKhatun_WMF had access previously, I'm going to be bold here and assume approvals also carried over. I'll submit a patch fo... [13:44:09] (03PS3) 10Sharvaniharan: New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 (https://phabricator.wikimedia.org/T324167) [13:45:04] (03PS4) 10Sharvaniharan: New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 (https://phabricator.wikimedia.org/T324167) [13:45:22] (03PS2) 10Nicolas Fraison: admin: create nfraison user and add it to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/886890 [13:45:24] (03PS2) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 [13:47:35] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) So currently we can't take down all the osds on rack C8 (14), as we don't have enough space to allocate t... [13:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43719 and previous config saved to /var/cache/conftool/dbconfig/20230206-134744-root.json [13:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43720 and previous config saved to /var/cache/conftool/dbconfig/20230206-134752-root.json [13:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43721 and previous config saved to /var/cache/conftool/dbconfig/20230206-134805-root.json [13:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43722 and previous config saved to /var/cache/conftool/dbconfig/20230206-134811-root.json [13:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43723 and previous config saved to /var/cache/conftool/dbconfig/20230206-134828-root.json [13:48:31] (03PS1) 10Filippo Giunchedi: admin: add access for Aisha Khatun [puppet] - 10https://gerrit.wikimedia.org/r/886894 (https://phabricator.wikimedia.org/T328733) [13:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43724 and previous config saved to /var/cache/conftool/dbconfig/20230206-134833-root.json [13:49:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43725 and previous config saved to /var/cache/conftool/dbconfig/20230206-134901-root.json [13:49:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43726 and previous config saved to /var/cache/conftool/dbconfig/20230206-134928-root.json [13:49:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) [13:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43727 and previous config saved to /var/cache/conftool/dbconfig/20230206-134944-root.json [13:49:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43728 and previous config saved to /var/cache/conftool/dbconfig/20230206-134956-root.json [13:50:03] (03PS7) 10Slyngshede: P:openldap Extend wmf-user schema with global account. [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) [13:50:12] (03CR) 10Slyngshede: P:openldap Extend wmf-user schema with global account. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [13:50:34] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AKhatun - https://phabricator.wikimedia.org/T328734 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi {{done}} thank you @Dzahn ! [13:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43729 and previous config saved to /var/cache/conftool/dbconfig/20230206-135036-root.json [13:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43730 and previous config saved to /var/cache/conftool/dbconfig/20230206-135044-root.json [13:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43731 and previous config saved to /var/cache/conftool/dbconfig/20230206-135049-root.json [13:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43732 and previous config saved to /var/cache/conftool/dbconfig/20230206-135057-root.json [13:51:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43733 and previous config saved to /var/cache/conftool/dbconfig/20230206-135101-root.json [13:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43734 and previous config saved to /var/cache/conftool/dbconfig/20230206-135118-root.json [13:51:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43735 and previous config saved to /var/cache/conftool/dbconfig/20230206-135122-root.json [13:51:26] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, related change needs some more discussion. One nit in-line." [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:51:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2365 is CRITICAL: etcd last index (2371317) is outdated compared to the master one (2371323) https://wikitech.wikimedia.org/wiki/Etcd [13:52:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) a:03BTullis I'll add Nicolas' account as I'm working closely with him. [13:52:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43736 and previous config saved to /var/cache/conftool/dbconfig/20230206-135248-root.json [13:52:55] btullis: ^ cheers! happy to help/review/etc as part of clinic duty [13:53:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) [13:53:18] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2365 is OK: etcd last index (2371329) matches the master one (2371329) https://wikitech.wikimedia.org/wiki/Etcd [13:54:47] (03CR) 10Btullis: "Nicolas, you can add yourself to the ops group as well in this CR, or you can do it with a subsequent patch." [puppet] - 10https://gerrit.wikimedia.org/r/886890 (owner: 10Nicolas Fraison) [13:55:22] godog: Thanks. Have added you as a reviewer to https://gerrit.wikimedia.org/r/c/operations/puppet/+/886890 [13:55:59] (03CR) 10Jelto: [V: 03+1 C: 03+2] jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:56:32] (03PS12) 10Jelto: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:56:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3300 [13:56:57] btullis: *nod* will take a look [13:57:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3300 [13:58:01] (03PS1) 10Jbond: postgresql::user: add documentation and fix minor lint errors [puppet] - 10https://gerrit.wikimedia.org/r/886895 (https://phabricator.wikimedia.org/T321783) [13:58:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39399/console" [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1400). [14:00:04] sharvani_ : A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] i can deploy today [14:00:22] sharvani_: hi! are you here? [14:00:44] here! [14:00:58] awesome [14:01:04] Thank you :) [14:01:10] (03CR) 10Urbanecm: [C: 03+2] New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [14:01:14] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline re: commit message" [puppet] - 10https://gerrit.wikimedia.org/r/886890 (owner: 10Nicolas Fraison) [14:01:34] sharvani_: can this patch be tested in some way, please? [14:01:53] (03Merged) 10jenkins-bot: New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [14:01:55] yes i can test it [14:01:58] okay [14:02:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:881918|New config entries for migrated android schemas (T324167)]] [14:02:23] T324167: Convert remaining Android app eventlogging schemas to MEP. - https://phabricator.wikimedia.org/T324167 [14:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43737 and previous config saved to /var/cache/conftool/dbconfig/20230206-140249-root.json [14:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43738 and previous config saved to /var/cache/conftool/dbconfig/20230206-140257-root.json [14:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43739 and previous config saved to /var/cache/conftool/dbconfig/20230206-140310-root.json [14:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43740 and previous config saved to /var/cache/conftool/dbconfig/20230206-140316-root.json [14:03:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43741 and previous config saved to /var/cache/conftool/dbconfig/20230206-140333-root.json [14:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43742 and previous config saved to /var/cache/conftool/dbconfig/20230206-140338-root.json [14:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43743 and previous config saved to /var/cache/conftool/dbconfig/20230206-140405-root.json [14:04:07] !log urbanecm@deploy1002 urbanecm and sharvaniharan: Backport for [[gerrit:881918|New config entries for migrated android schemas (T324167)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:04:19] sharvani_: your patch is at mwdebug1001. can you test it there please? [14:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43744 and previous config saved to /var/cache/conftool/dbconfig/20230206-140433-root.json [14:04:37] tested... looking good! thank you. [14:04:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43745 and previous config saved to /var/cache/conftool/dbconfig/20230206-140449-root.json [14:05:01] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@b798462] (releasing): (no justification provided) [14:05:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43746 and previous config saved to /var/cache/conftool/dbconfig/20230206-140501-root.json [14:05:27] great, syncing! [14:05:35] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@b798462] (releasing): (no justification provided) (duration: 00m 33s) [14:05:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43747 and previous config saved to /var/cache/conftool/dbconfig/20230206-140541-root.json [14:05:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43748 and previous config saved to /var/cache/conftool/dbconfig/20230206-140549-root.json [14:05:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43749 and previous config saved to /var/cache/conftool/dbconfig/20230206-140554-root.json [14:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43750 and previous config saved to /var/cache/conftool/dbconfig/20230206-140602-root.json [14:06:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43751 and previous config saved to /var/cache/conftool/dbconfig/20230206-140606-root.json [14:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43752 and previous config saved to /var/cache/conftool/dbconfig/20230206-140623-root.json [14:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43753 and previous config saved to /var/cache/conftool/dbconfig/20230206-140627-root.json [14:07:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43754 and previous config saved to /var/cache/conftool/dbconfig/20230206-140753-root.json [14:09:08] (03CR) 10Ayounsi: DNM: Showcase row-level mesh in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) (owner: 10Alexandros Kosiaris) [14:09:25] !log fetch HAProxy 2.4.21 for buster and bullseye (apt.wm.o) [14:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:49] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [14:10:11] (03CR) 10Ayounsi: [C: 03+1] "You will have to run homer on mr1-eqiad as well as the core routers for mgmt provisioning automation." [homer/public] - 10https://gerrit.wikimedia.org/r/886888 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [14:10:24] (03PS11) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) [14:10:52] Thank you Martin! I will be signing off. [14:11:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:881918|New config entries for migrated android schemas (T324167)]] (duration: 09m 19s) [14:11:41] (03PS1) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [14:11:42] T324167: Convert remaining Android app eventlogging schemas to MEP. - https://phabricator.wikimedia.org/T324167 [14:12:14] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) omething and so just writing it here: - The PXE boot worked fine for us in cases with the old firmware as well; the DHCP in d-i failed and t... [14:12:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886894 (https://phabricator.wikimedia.org/T328733) (owner: 10Filippo Giunchedi) [14:12:32] sharvani_: no problem. it should be live now. [14:12:52] (03CR) 10Jbond: [C: 03+2] postgresql::user: add documentation and fix minor lint errors [puppet] - 10https://gerrit.wikimedia.org/r/886895 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:12:59] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add access for Aisha Khatun [puppet] - 10https://gerrit.wikimedia.org/r/886894 (https://phabricator.wikimedia.org/T328733) (owner: 10Filippo Giunchedi) [14:13:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886799 (https://phabricator.wikimedia.org/T148048) (owner: 10Slyngshede) [14:13:17] !log testing HAProxy 2.4.21 in cp4052 and cp4044 [14:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:24] jbond: merging your change too [14:16:34] (03CR) 10David Caro: [C: 04-1] replica_cnf_web: add functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [14:17:11] godog: thanks [14:18:07] (03PS50) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [14:18:29] (03PS3) 10Nicolas Fraison: admin: create nfraison user and add it to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/886890 (https://phabricator.wikimedia.org/T328915) [14:18:31] (03PS3) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915) [14:20:02] (03PS4) 10Btullis: admin: create nfraison user and add it to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/886890 (https://phabricator.wikimedia.org/T328915) (owner: 10Nicolas Fraison) [14:20:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:42] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:20:45] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:21:01] (03CR) 10Btullis: [C: 03+2] admin: create nfraison user and add it to analytics_privatedata_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886890 (https://phabricator.wikimedia.org/T328915) (owner: 10Nicolas Fraison) [14:22:34] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:25:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) I have merged the patch to `data.yaml` so the `nfraison` account will be created on all servers now, with `ops` and `analytics-privatedata-users... [14:27:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10fgiunchedi) I've created the Kerberos principal, @AKhatun_WMF you should have the temporary password in your inbox. I believe all pieces are in... [14:28:10] (03CR) 10Filippo Giunchedi: [C: 03+1] configmaster: Cleanup disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/886839 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [14:29:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) I have added `nfraison` to the `cn=wmf` `cn=ops` LDAP groups. ` btullis@mwmaint1002:~$ ldapsearch -x member=uid=nfraison,ou=people,dc=wikimedia,... [14:29:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) [14:30:50] (03PS1) 10Andrew Bogott: puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 [14:31:58] (03PS1) 10Marostegui: Revert "dbproxy1016,dbproxy1020: Place db1164 as secondary" [puppet] - 10https://gerrit.wikimedia.org/r/886822 [14:32:28] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1016,dbproxy1020: Place db1164 as secondary" [puppet] - 10https://gerrit.wikimedia.org/r/886822 (owner: 10Marostegui) [14:32:52] (03PS14) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [14:34:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10BTullis) I have created Nicolas' kerberos principal. ` btullis@krb1001:~$ sudo manage_principals.py get nfraison get_principal: Principal does not exist... [14:37:16] (03CR) 10Jbond: [C: 03+1] puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 (owner: 10Andrew Bogott) [14:37:43] !log installing imagemagick security updates on buster [14:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:33] (03PS2) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [14:39:42] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MatthewVernon) If we're "just" depooling codfw it's worth noting we will still need to depool the affected ms-fe* nodes (since mw always tries to write t... [14:45:16] (03PS3) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [14:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:18] (03PS1) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [14:48:23] (03PS2) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [14:49:01] (03PS15) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [14:49:05] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:49:08] (03PS1) 10Marostegui: mariadb: Promote db1164 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) [14:50:35] (03CR) 10CI reject: [V: 04-1] Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:42] (03PS2) 10Marostegui: mariadb: Promote db1164 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) [14:51:48] (03CR) 10Clément Goubert: [C: 03+2] configmaster: Cleanup disc_desired_state [puppet] - 10https://gerrit.wikimedia.org/r/886839 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [14:52:33] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [14:53:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10Ottomata) Oh! Approved. Sorry this one slipped by my notice somehow. Thanks for the extra ping. [14:54:52] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:55:30] (03PS4) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [14:55:31] (03PS3) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [14:55:35] (03PS4) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915) [14:57:04] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [14:57:10] RECOVERY - IPMI Sensor Status on mw2326 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:57] (03CR) 10CDanis: [C: 03+1] "+10 to what Filippo said. Thank you!!!" [puppet] - 10https://gerrit.wikimedia.org/r/883151 (https://phabricator.wikimedia.org/T239862) (owner: 10Clément Goubert) [15:00:47] (03PS16) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [15:05:23] (03PS5) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [15:05:24] (03PS4) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [15:08:09] (03PS1) 10Daniel Kinzler: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) [15:10:52] !log rolling upgrade to HAProxy 2.4.21 in ulsfo cp nodes [15:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:26] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:17:44] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:19:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:19:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [15:23:02] (03PS5) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [15:27:00] (03CR) 10Clément Goubert: [C: 03+2] P:monitoring: Absent hardcoded statsd host entry [puppet] - 10https://gerrit.wikimedia.org/r/883151 (https://phabricator.wikimedia.org/T239862) (owner: 10Clément Goubert) [15:28:12] (03PS17) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [15:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:36] rzl, topranks, volans, heads up, merged removal of the hardcoded statsd entry, I'm watching the DNS dashboards [15:31:43] (03CR) 10Elukey: "Tested on my local env on cumin1001 with dry-run, everything seems working nicely." [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:31:59] claime: ack [15:32:09] do you need a hand for anything? [15:32:30] I'm watching number of questions and slow answers on the dash [15:32:42] Do you see anything else I should be looking out for? [15:33:20] (03PS1) 10Urbanecm: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886910 (https://phabricator.wikimedia.org/T328929) [15:33:29] but now you just merged the removal of the check? [15:33:47] since when mediawiki is using the IP? [15:35:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:28] volans: I removed a /etc/hosts entry for statsd, mediawiki uses the IP since https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661732 (merged 2 years ago) [15:37:56] got it [15:38:37] (03PS1) 10Jelto: jenkins: fix directory and restrict sudo rules to jenkins jars [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) [15:38:41] Everything looks peachy for now [15:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:39:14] ack good to hear :) [15:41:13] (03PS6) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [15:41:15] (03PS1) 10Jbond: postgrers::user::hba: drop hba_label and use title instead [puppet] - 10https://gerrit.wikimedia.org/r/886912 (https://phabricator.wikimedia.org/T321783) [15:41:40] (03CR) 10CI reject: [V: 04-1] postgrers::user::hba: drop hba_label and use title instead [puppet] - 10https://gerrit.wikimedia.org/r/886912 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:51] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [15:44:53] 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10SLyngshede-WMF) 05Open→03In progress [15:45:10] (03PS2) 10Jbond: postgrers::user::hba: drop hba_label and use title instead [puppet] - 10https://gerrit.wikimedia.org/r/886912 (https://phabricator.wikimedia.org/T321783) [15:45:12] (03PS7) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [15:45:14] (03PS6) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [15:45:52] 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10SLyngshede-WMF) 05In progress→03Open [15:45:55] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [15:46:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39406/console" [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:46:41] 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10SLyngshede-WMF) 05Open→03In progress [15:46:43] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [15:48:48] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:49:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics cluster for Nicolas Fraison - https://phabricator.wikimedia.org/T328915 (10nfraison) @BTullis I confirm that I have well received the instruction for kerberos and have been able to kinit from one of the hadoop client [15:52:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:52:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, although we could just as well directly remove the user/group and save one removal cycle, it's fine for unused system users to" [puppet] - 10https://gerrit.wikimedia.org/r/886478 (https://phabricator.wikimedia.org/T111899) (owner: 10Majavah) [15:53:07] (03PS1) 10AOkoth: vrts: enable/disable daemon depending on active host [puppet] - 10https://gerrit.wikimedia.org/r/886914 (https://phabricator.wikimedia.org/T323515) [15:53:19] (03PS8) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [15:53:39] (03CR) 10JMeybohm: Add sre.k8s.upgrade-cluster (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:57:14] (03PS2) 10AOkoth: vrts: enable/disable daemon depending on active host [puppet] - 10https://gerrit.wikimedia.org/r/886914 (https://phabricator.wikimedia.org/T323515) [15:58:32] (03PS9) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [15:58:34] (03PS7) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [15:58:41] (03PS2) 10Alexandros Kosiaris: DNM: Showcase row-level mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/886321 (https://phabricator.wikimedia.org/T328523) [15:59:07] (03CR) 10Dzahn: [C: 03+1] sre.gitlab.upgrade: check for unknown version [cookbooks] - 10https://gerrit.wikimedia.org/r/886841 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:59:12] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/886914/39410/" [puppet] - 10https://gerrit.wikimedia.org/r/886914 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:02:01] 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Htriedman) @Vgutierrez thanks so much! taking a look now [16:03:23] (03PS1) 10Elukey: ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) [16:10:24] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:11:27] (03PS1) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [16:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:18] (03PS18) 10Elukey: Add sre.k8s.upgrade-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) [16:16:22] (03CR) 10Elukey: Add sre.k8s.upgrade-cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:42] (03PS2) 10Elukey: ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) [16:20:44] (03PS2) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [16:20:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:40] (03PS4) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [16:30:05] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1630). [16:41:07] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:42:13] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:42:48] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:45:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911 (10Clement_Goubert) [16:49:09] (03CR) 10Elukey: ml-services: add autoscaling to en/wikidata for goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:51:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [16:53:34] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:53:39] (03CR) 10BCornwall: [C: 03+1] "😭" [puppet] - 10https://gerrit.wikimedia.org/r/886840 (https://phabricator.wikimedia.org/T262996) (owner: 10Vgutierrez) [16:55:16] (03CR) 10Klausman: [C: 03+1] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:55:55] (03CR) 10Klausman: ml-services: add autoscaling to en/wikidata for goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:55:57] (03CR) 10BCornwall: [C: 03+1] mediawiki: drop pybal-check user [puppet] - 10https://gerrit.wikimedia.org/r/886478 (https://phabricator.wikimedia.org/T111899) (owner: 10Majavah) [16:57:19] (03PS3) 10Elukey: ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) [16:57:22] (03PS3) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [16:57:35] (03CR) 10Elukey: ml-services: add autoscaling to en/wikidata for goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [17:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:05] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Jhancock.wm) @Papaul I've swapped CPU1 and CPU2. The problem is still on CPU2. CPU 2 MEM345 VPP voltage is outside of range. MON 06 Feb 2023 16:58:31 The chassis is closed while the power is off. MON... [17:11:27] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Papaul) @Jhancock.wm please remove CPU2 and try to boot the server without CPU2 [17:16:29] 10SRE: followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10CDanis) [17:21:57] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [17:23:48] (03CR) 10JMeybohm: [C: 03+1] "👏 Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:25:54] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:32:04] (03CR) 10Jaime Nuche: [C: 04-1] jenkins: fix directory and restrict sudo rules to jenkins jars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [17:41:21] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) Task follow-up: * Tech news announcement: https://meta.wikimedia.org/w/index.php?title=Tech/News/2023/0... [17:41:46] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) 05Open→03In progress [17:41:48] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [17:42:20] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) Apart from multi-DC, the other possibly notable thing is that a Gitlab switchover will also be per... [17:45:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:29] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Krinkle) [17:54:56] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10kostajh) >>! In T324675#858... [17:56:03] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10kostajh) >>! In T324675#856... [17:56:45] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10kostajh) [17:59:44] (03PS5) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1800) [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T1800). [18:08:42] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Jhancock.wm) After pulling out CPU2 the server boots. We replaced CPU2 and and pulled all of the DIMM for CPU2. The server boots like this. Tested each DIMM to find the one that is causing the server to not boot.... [18:09:56] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) > I guess the main question is for the hosts in eqsin that failed where you restarted the cookbook, was there a firmware upgrade in betwee... [18:30:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:14] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10BBlack) 05Open→03Resolved a:03BBlack [18:34:07] (03PS7) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [18:35:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:40] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Papaul) @Jhancock.wm thank you I will request a replacement DIMM from Dell. [18:39:54] (03PS2) 10Andrew Bogott: puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 [18:42:16] (03CR) 10Andrew Bogott: [C: 03+2] Bump pontoon default openstack version to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/886471 (owner: 10Andrew Bogott) [18:48:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:48:16] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:48:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:50:27] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mw2420 [18:50:38] (03CR) 10Milimetric: [C: 03+1] "late, but just signaling that we checked downstream and this change (or its possible future re-reverting) is safe for the rest of the webr" [puppet] - 10https://gerrit.wikimedia.org/r/886119 (https://phabricator.wikimedia.org/T257893) (owner: 10Phuedx) [18:51:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw2420 [18:51:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2420 DNS - pt1979@cumin2002" [18:52:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2420 DNS - pt1979@cumin2002" [18:52:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2420.mgmt.codfw.wmnet with reboot policy FORCED [18:53:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [18:56:43] (03PS2) 10Andrew Bogott: Remove unused files for OpenStack version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/886472 [18:57:13] (03CR) 10CI reject: [V: 04-1] Remove unused files for OpenStack version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/886472 (owner: 10Andrew Bogott) [19:06:17] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Aklapper) [19:08:29] (03PS3) 10Andrew Bogott: Remove unused files for OpenStack version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/886472 [19:09:33] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused files for OpenStack version 'xena' [puppet] - 10https://gerrit.wikimedia.org/r/886472 (owner: 10Andrew Bogott) [19:14:28] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) It is worth communicating anything that disturbs one's habits. :) Better safe than sorry! [19:15:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:32] jouncebot: nowandnext [19:16:32] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [19:16:32] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T2100) [19:17:02] !log urbanecm@deploy1002 backport aborted: (duration: 00m 01s) [19:17:06] (03CR) 10Urbanecm: [C: 03+2] Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886910 (https://phabricator.wikimedia.org/T328929) (owner: 10Urbanecm) [19:17:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886910 (https://phabricator.wikimedia.org/T328929) (owner: 10Urbanecm) [19:17:49] (03Merged) 10jenkins-bot: Add a new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886910 (https://phabricator.wikimedia.org/T328929) (owner: 10Urbanecm) [19:18:06] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886910|Add a new throttle rule (T328929)]] [19:18:10] T328929: Request a throttle lift for a Czech course for students – 2023-02-07, 2023-02-08 - https://phabricator.wikimedia.org/T328929 [19:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:50] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886910|Add a new throttle rule (T328929)]] (duration: 07m 43s) [19:25:53] T328929: Request a throttle lift for a Czech course for students – 2023-02-07, 2023-02-08 - https://phabricator.wikimedia.org/T328929 [19:25:54] * urbanecm done [19:27:35] !log zabe@deploy1002 backport aborted: (duration: 00m 23s) [19:28:03] ? (I just tried out scap backport --list) [19:28:24] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00450: FAILED: stashfailed: An unknown error occurred in storage backend "local-swift-codfw".` while uploading https://co... [19:29:06] PROBLEM - Check systemd state on mw2350 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:40] !log [urbanecm@mwmaint1002 ~]$ mwscript resetAuthenticationThrottle.php --wiki=metawiki --signup --ip 92.62.231.190 # T328929 [19:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:43] zabe: what does that one do? [19:30:11] !log zabe@deploy1002 backport aborted: (duration: 00m 00s) [19:30:24] "list the available backports" [19:30:37] i mean what does it show :D [19:31:12] https://phabricator.wikimedia.org/P43755 [19:31:55] ty [19:32:16] !log zabe@deploy1002 say aborted: (duration: 00m 39s) [19:33:00] scap should not log any command just because it is being aborted, no ones cares that I ran scap say [19:33:53] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) On https://commons.wikimedia.org/wiki/File:Muzumdar_-_Gandhi_versus_the_Empire.pdf `00546: FAILED: internal_api_error_UploadChunk... [19:35:39] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) This is a really big problem. I have to upload each book 2 or 3 times... :(((( [19:39:37] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [19:40:14] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [19:44:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2420.mgmt.codfw.wmnet with reboot policy FORCED [19:45:49] 10SRE-swift-storage, 10Commons: stashfailed: An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T328905 (10Aklapper) Adding #SRE-swift-storage as this is about `swift` per error message (please add such project tags if possible - thanks). Also see {T328872} [19:49:40] Is someone looking into codfw swift issues, like https://phabricator.wikimedia.org/T328872 / https://phabricator.wikimedia.org/T328905 ? TIA [19:51:16] (03PS1) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [19:51:36] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [19:52:21] Emperor: see Andre’s message [20:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:45] (03PS4) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [20:03:01] (03PS2) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [20:03:21] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:04:40] 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) p:05Triage→03High [20:05:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:07] (03PS3) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [20:07:11] 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) [20:07:46] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:08:00] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:08:04] (03PS51) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [20:08:05] 10SRE-swift-storage, 10Commons: stashfailed: An unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T328905 (10Zabe) [20:08:30] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Zabe) [20:10:24] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:11:32] (03PS4) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [20:11:53] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:13:20] (03PS5) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [20:13:22] (03PS3) 10Raymond Ndibe: puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 (owner: 10Andrew Bogott) [20:13:41] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:14:13] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10bking) Thanks Alex! Closing out now... [20:14:22] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10bking) 05Open→03Resolved [20:14:45] 10SRE-swift-storage, 10Commons: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Mike_Peel) Copying over my comment from the task that was merged in (I didn't see this one, wasn't expecting Yann to have created two on... [20:15:25] (03PS6) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [20:17:19] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:36:56] (03PS12) 10Raymond Ndibe: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [20:40:06] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:41:49] (03PS52) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [20:41:51] (03PS13) 10Raymond Ndibe: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [20:44:29] (03CR) 10jenkins-bot: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:45] (03CR) 10Raymond Ndibe: replica_cnf_web: add functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [20:49:21] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Aklapper) [20:50:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:23] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00075: FAILED: internal_api_error_UploadChunkFileException: [684a57fd-4431-45dd-9851-d7864... [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and tsepoThoabala : #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:20] I agree! :) [21:01:23] A quiet evening for you then TheresNoTime [21:01:34] Well night, it’s 9pm if you’re back here [21:02:14] (03CR) 10Ottomata: [C: 03+1] Create scap deployment source for search airflow v2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [21:02:30] I am back in the UK ^^ finally over the jetlag I think! [21:03:17] TheresNoTime: nice [21:03:47] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Aklapper) Thanks, it's not needed to add more examples. [21:04:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:04:44] TheresNoTime with/without jetlag :P [21:05:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:20] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:36] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:00:04] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230206T2200) [22:00:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2420.mgmt.codfw.wmnet with reboot policy FORCED [22:01:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:05:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2420.mgmt.codfw.wmnet with reboot policy FORCED [22:05:33] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2421 DNS - pt1979@cumin2002" [22:06:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2421 DNS - pt1979@cumin2002" [22:06:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:50] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mw2421 [22:08:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw2421 [22:09:18] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Enable leveling up features on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886342 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [22:09:22] (03PS1) 10Aklapper: Fix interwiki prefix for generic wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) [22:10:03] (03Merged) 10jenkins-bot: GrowthExperiments: Enable leveling up features on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886342 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [22:10:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2421.mgmt.codfw.wmnet with reboot policy FORCED [22:11:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [22:17:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2421.mgmt.codfw.wmnet with reboot policy FORCED [22:19:29] (03CR) 10Zabe: [C: 03+1] Fix interwiki prefix for generic wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886949 (https://phabricator.wikimedia.org/T327575) (owner: 10Aklapper) [22:27:54] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:42:19] !log bking@cumin2002 banning Elastic nodes from cluster in preparation for T327925 [22:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:23] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [22:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:27] !log T327925 Banned `elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076]` on codfw elastic [22:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:30] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [22:50:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: switch upgrade [22:51:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: switch upgrade [22:51:53] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0e96453-af13-467f-a75e-ebd1c4122a32) set by bking@cumin2002 for 1 day, 0:00:00 on 13 ho... [22:55:13] !log T327925 Depooled codfw wdqs hosts: `ryankemper@cumin2002:~$ sudo -E cumin -b 3 'wdqs[2003-2004,2009]*' 'sudo depool'` [22:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:17] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [23:01:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2421.mgmt.codfw.wmnet with reboot policy FORCED [23:03:56] 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Dzahn) I can't really confirm the issue. For me opening open https://upload.wikimedia.org/wikipedia/commo... [23:12:25] 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) Ok, pulling some peak numbers off the idrac https interface which includes weekly graphing and peaks. cp4037... [23:13:20] (03CR) 10Dzahn: [C: 03+1] "has group approval on ticket now, can be merged, imho" [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron) [23:17:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2421.mgmt.codfw.wmnet with reboot policy FORCED [23:18:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [23:21:48] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Jclark-ctr) Confirmed: Service Request 161724698 was successfully submitted. [23:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [23:35:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:38] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:54:04] (03CR) 10Kevin Bazira: [C: 03+1] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)