[00:30:35] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T0100) [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:57] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.10 [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788450 [02:07:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.10 [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788450 (owner: 10TrainBranchBot) [02:07:47] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:10:05] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:04] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Milimetric) @BBlack: this was never our pipeline. It looks like @dr0ptp4kt's [[ https://lists.wikimedia.org/pipermail/analytics/2015-February/003426.html | original idea ]]... [02:24:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.10 [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788450 (owner: 10TrainBranchBot) [02:27:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:29:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:08:57] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:19:59] (03CR) 10Winston Sung: Localisation updates from https://translatewiki.net. (031 comment) [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) (owner: 10Winston Sung) [03:36:31] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:35] PROBLEM - Check unit status of geoip_update_main on puppetmaster1001 is CRITICAL: CRITICAL: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:45:22] (03PS1) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) [04:47:07] (03PS1) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (https://phabricator.wikimedia.org/T299377) [04:47:19] (03CR) 10jerkins-bot: [V: 04-1] Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [04:48:21] (03PS1) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) [05:22:09] (03PS1) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) [05:27:41] (03PS1) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (https://phabricator.wikimedia.org/T299377) [05:27:57] (03CR) 10jerkins-bot: [V: 04-1] Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [05:28:52] (03PS1) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) [05:30:45] (03PS2) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) [05:31:38] (03PS2) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) [05:34:38] (03PS1) 10Winston Sung: Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) [05:39:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:46:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:53:10] (03PS2) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) [05:53:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [05:53:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [05:57:38] (03PS3) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) [05:59:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:04] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T0600). [06:01:58] (03CR) 10jerkins-bot: [V: 04-1] Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) (owner: 10Winston Sung) [06:09:32] (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [06:09:45] (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [06:11:54] (03CR) 10Winston Sung: "recheck" [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [06:16:47] (03PS2) 10Winston Sung: Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) [06:23:16] (03CR) 10Winston Sung: "recheck" [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [06:24:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 107 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:26:15] (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [06:26:24] (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [06:33:10] (03CR) 10Ayounsi: "Awesome! that cut the load on the proxies by about half." [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [06:36:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:41:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:49:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:49:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 73 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:59:39] (03PS5) 10Winston Sung: Localisation updates from https://translatewiki.net. [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) [07:00:04] Amir1, awight, and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T0700) [07:00:05] Winston_Sung: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:32] o/ I can deploy today [07:01:30] Winston_Sung[m]: may I ask why those patches are being backported? [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:42] The first one is to fix an issue and check if there's any relationship with another issue. [07:03:28] I think others might be cancelled due to the branching to 1.39.0-wmf.10. [07:03:44] 10SRE, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Aklapper) All patches merged. Is this still an issue? Should this still remain open? [07:08:08] taavi: Please let me know if there's time left in the window, I have a minor config change left over from yesterday: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/787689 [07:08:15] I can self-deploy... [07:08:23] Emm... testing on patchdemo didn't work as the TOC didn't use the left one. [07:08:23] https://patchdemo.wmflabs.org/wikis/78b1fc873f/wiki/Project:Main_Page?useskin=vector-2022 [07:09:14] Winston_Sung[m]: sorry, not sure if I understand this correctly - the first one is linked to a bug that just seems to be about missing translations? unless I'm missing something, that seems to be just a normal case of a new feature shipping before it was translated on twn, and those are usually not worth the pain of translation backports [07:09:15] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:10:00] awight: sure, I'm still trying to figure out which of Winston_Sung[m]'s patches to backport and why, feel free to sneak your patch in right now and let me know when you're done with deploying [07:10:14] taavi: kk will do, thanks! [07:10:17] (03PS1) 10Muehlenhoff: Remove access for nikkin [puppet] - 10https://gerrit.wikimedia.org/r/788667 [07:11:00] (03PS1) 10Phedenskog: admin: Add ssh key for phedenskog. [puppet] - 10https://gerrit.wikimedia.org/r/788668 (https://phabricator.wikimedia.org/T307079) [07:11:12] awight: taavi: hi, I am the one running the train this morning at 8:00 UTC (immediately after the backport window). If there is any need to extend the window it is fine to me [07:11:22] Is it possible to backport on Patch Demo and let the TOC display in the place of sidebar to check whether it's related to [07:11:23] https://phabricator.wikimedia.org/T306862 [07:11:23] ? [07:11:25] we can delay the train a bit [07:11:35] hashar: ack, thanks [07:11:38] (03CR) 10jerkins-bot: [V: 04-1] Remove access for nikkin [puppet] - 10https://gerrit.wikimedia.org/r/788667 (owner: 10Muehlenhoff) [07:11:59] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787689 (https://phabricator.wikimedia.org/T307110) (owner: 10Awight) [07:12:20] 10SRE, 10Patch-For-Review: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10Peter) Thanks a lot @fgiunchedi !!!! [07:12:43] (03Merged) 10jenkins-bot: Enable the versioned mapdata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787689 (https://phabricator.wikimedia.org/T307110) (owner: 10Awight) [07:12:50] ooh that was fast. [07:12:55] Winston_Sung[m]: as far as I'm aware, patch demo is independent of what's running on production, so you should be able to create a patch demo wiki for a patch without a production deployment [07:15:15] https://patchdemo.wmflabs.org/wikis/78b1fc873f/wiki/Project:Main_Page?useskin=vector-2022 [07:15:15] Didn't work for the TOC. [07:15:37] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:787689|Enable the versioned mapdata API (T307110)]] (duration: 00m 48s) [07:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:42] T307110: Enable versioned maps "backend" support everywhere - https://phabricator.wikimedia.org/T307110 [07:16:05] taavi: Done, thank you! [07:16:11] thanks! [07:16:11] Is it okay to apply the fix to check the relationship with another Phabricator ticket? [07:16:11] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/788329 [07:16:36] (03Abandoned) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299398) (owner: 10Aqu) [07:17:43] > the first one is linked to a bug that just seems to be about missing translations? [07:17:43] The reason is to distinguish whether it's only caused by localization or caused by language conversion. [07:18:17] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Nikki Nikkhoui out of all services on: 513 hosts [07:18:19] I'm uncomfortable deploying anything (especially translations related, since those are much slower to deploy or rollback) to production to "check if it fixes something" without someone very familiar with that code base being present to troubleshoot if necessary [07:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nikki Nikkhoui out of all services on: 513 hosts [07:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:42] (03Abandoned) 10Winston Sung: Localisation updates from https://translatewiki.net. [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) (owner: 10Winston Sung) [07:19:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:18] Okay. Then let's cancel it. [07:19:27] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Nikki Nikkhoui out of all services on: 1224 hosts [07:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:50] ok, sorry that I couldn't be more helpful :/ [07:20:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nikki Nikkhoui out of all services on: 1224 hosts [07:20:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:17] !log UTC morning deploys done [07:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:35] (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:20:47] (03Abandoned) 10Winston Sung: Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) (owner: 10Winston Sung) [07:20:58] Winston_Sung[m]: Is it possible to test this on the beta cluster? [07:21:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:21:03] (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:23] awight: Should be possible. [07:22:51] (03PS2) 10Muehlenhoff: Remove access for nikkin [puppet] - 10https://gerrit.wikimedia.org/r/788667 [07:22:55] You could put your experimental feature behind a feature flag, and only enable that flag on testwiki or the beta cluster. [07:23:09] If there's zh sites or page language set to zh on the beta cluster. [07:25:27] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nikkin [puppet] - 10https://gerrit.wikimedia.org/r/788667 (owner: 10Muehlenhoff) [07:25:31] Emmm... I'm no familiar with beta cluster. [07:26:43] Trying to figure out. [07:27:41] Ok. I found it. [07:28:10] Tested and confirmed it's only caused by outdated translations. [07:30:17] Thanks again for pointing out. [07:33:32] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores2001.codfw.wmnet with OS buster [07:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:38] Excellent to hear! [07:39:27] (03PS3) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 [07:39:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:40:57] (03CR) 10Majavah: "Thank you for your feedback! Adjusted the code based on that, and also changed the code (which was based on check_keyholder) a bit so the " [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [07:43:33] (03PS4) 10Majavah: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 [07:50:29] (03PS4) 10Majavah: P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) [07:50:31] (03PS1) 10Majavah: P:toolforge::prometheus: support Cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/788672 [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:35] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2001.codfw.wmnet with reason: host reimage [07:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] hashar and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T0800). [08:01:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2001.codfw.wmnet with reason: host reimage [08:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:07] o/ [08:03:02] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) [08:04:04] there are so many blockers [08:04:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1132 T301879', diff saved to https://phabricator.wikimedia.org/P27350 and previous config saved to /var/cache/conftool/dbconfig/20220503-080421-marostegui.json [08:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:26] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [08:05:52] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) [08:06:25] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10MoritzMuehlenhoff) [08:06:57] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10MoritzMuehlenhoff) [08:07:32] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) [08:09:43] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: support Cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/788672 (owner: 10Majavah) [08:14:24] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: stop using 'document' to make tests pass [software] - 10https://gerrit.wikimedia.org/r/788297 (owner: 10Filippo Giunchedi) [08:14:27] !log Starting MediaWiki train deployment using `scap stage-train 1.39.0-wmf.10` # T305216 [08:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:32] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:14:51] hashar: good luck! [08:15:16] after that I guess I will find folks to assist with the 3 blockers :D [08:15:19] RECOVERY - Host ms-be1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [08:19:21] (03PS1) 10Hashar: testwikis wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788674 [08:19:23] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788674 (owner: 10Hashar) [08:19:44] jnuche: looks good so far [08:20:06] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788674 (owner: 10Hashar) [08:20:25] hashar: magnifique [08:21:06] !log hashar@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.10 refs T305216 [08:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:24:15] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1002.eqiad.wmnet [08:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:27:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:59] (03PS1) 10Majavah: openstack: update tools-redis to a 'new' style name [puppet] - 10https://gerrit.wikimedia.org/r/788675 (https://phabricator.wikimedia.org/T278541) [08:29:25] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1002.eqiad.wmnet [08:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:43:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10LSobanski) [08:43:42] ah sync-apaches progress in the terminal seems to be rather reactive that is great [08:44:02] (03CR) 10Volans: "As a general comment I'm wondering how much is worth to keep putting effort on this javascript that has to be run manually instead of sett" [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [08:44:33] !log rolling upgrade of HAProxy in esams [08:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2001.codfw.wmnet with OS buster [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:30] at cdb rebuild [08:50:53] (03CR) 10Gehel: "Minor comments inline. Otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:51:17] (03PS38) 10Gehel: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:51:36] (03PS39) 10Gehel: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:51:50] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.10 refs T305216 (duration: 30m 44s) [08:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:55] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [08:58:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:00] (03PS5) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [08:59:02] (03PS1) 10David Caro: wmcs: isort and black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788676 [09:00:11] (03PS1) 10Jaime Nuche: add dummy ssh key pair for new scap Keyholder identity [labs/private] - 10https://gerrit.wikimedia.org/r/788677 (https://phabricator.wikimedia.org/T307351) [09:00:15] (03CR) 10Gehel: [C: 04-1] "Minor comment inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [09:00:22] (03CR) 10David Caro: wmcs: Fix types and associated code refactor (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [09:02:07] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [09:03:55] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 2 others: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This is now used by miscweb and documented at https://wikitech.wikimedia.... [09:04:01] (03PS6) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [09:05:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:05:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general https://wikitech.wikimedia.or... [09:11:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:03] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) @Jgiannelos something that occurred to me while deleting `swift-tegola-container` (still in progress, will take a whi... [09:12:20] (03PS1) 10Slyngshede: Adding my account and SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/788678 [09:14:23] !log Disable puppet on clouddb1013 clouddb1016 clouddb1020T305974 [09:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:29] !log Disable puppet on clouddb1013 clouddb1016 clouddb1020 T305974 [09:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:33] T305974: Provide wmf-pt-kill on Debian Bullseye - https://phabricator.wikimedia.org/T305974 [09:18:05] (03CR) 10Muehlenhoff: "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/788678 (owner: 10Slyngshede) [09:18:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "Haven't tested it but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [09:19:04] RECOVERY - Host ms-be1052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [09:19:04] RECOVERY - Host ms-be1053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [09:19:04] RECOVERY - Host ms-be1054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [09:19:04] RECOVERY - Host ms-be1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [09:19:04] RECOVERY - Host ms-be1056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [09:19:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [09:19:05] RECOVERY - Host ms-be1057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:19:05] RECOVERY - Host ms-be1058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [09:19:06] RECOVERY - Host ms-be1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [09:19:25] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: set dlq output and template_version [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [09:20:06] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [09:20:28] (03CR) 10Muehlenhoff: [C: 03+2] Adding my account and SSH key. [puppet] - 10https://gerrit.wikimedia.org/r/788678 (owner: 10Slyngshede) [09:30:36] PROBLEM - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:58] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 14.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:31:03] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) Hey @cmooney, is there any input needed from WMCS on this? (just want to make sure you are not blocked) [09:31:10] (03PS1) 10Slyngshede: Adding myself to OPS group [puppet] - 10https://gerrit.wikimedia.org/r/788681 [09:32:31] (03CR) 10Slyngshede: "Adding myself (slyngshede / Simon Lyngshede) to OPS group." [puppet] - 10https://gerrit.wikimedia.org/r/788681 (owner: 10Slyngshede) [09:32:59] !log rolling upgrade of HAProxy in eqiad [09:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:06] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:35:19] (03PS3) 10Filippo Giunchedi: clinic-duty: add Orange support [software] - 10https://gerrit.wikimedia.org/r/788296 [09:37:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [09:38:52] !log resetting BMC on relforge1003 and relforge1004 - https://wikitech.wikimedia.org/wiki/Management_Interfaces#From_local_IPMI [09:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] (03CR) 10Filippo Giunchedi: clinic-duty: add Orange support (032 comments) [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [09:40:50] RECOVERY - Host relforge1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [09:40:50] RECOVERY - Host relforge1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [09:40:58] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host clouddb2001-dev.codfw.wmnet [09:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:26] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:00] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb2001-dev.codfw.wmnet [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:04] PROBLEM - ores on ores2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 215 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [10:02:14] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [10:02:14] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:52] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Yes filenames are kept the same. On each tile pregeneration we send a PUT request for the same filename but different... [10:11:40] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:27] ores2001 is me, downtiming [10:25:57] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#7898948, @Jgiannelos wrote: > Yes filenames are kept the same. On each tile pregeneration we send a PU... [10:29:34] (03CR) 10Ladsgroup: [C: 03+1] Skip first line of output from `db.run_sql` [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788325 (owner: 10Kormat) [10:30:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/788681 (owner: 10Slyngshede) [10:35:33] (03CR) 10David Caro: "This broke stuff (the ensure seems to have issues):" [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:39:52] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jbond) > As for the puppet-merge on the puppetmasters, does the datacenter-ops have +2 on the operations/puppet repository on Gerrit? To be explicit +... [10:45:35] (03PS9) 10Jbond: profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [10:46:55] !log downgrade haproxy 2.4 package to version 2.4.15 on apt.wm.o (buster-wikimedia) [10:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:05] (03PS1) 10Muehlenhoff: Add new profile to build the CAS debs [puppet] - 10https://gerrit.wikimedia.org/r/788691 [10:51:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [10:54:16] (03PS2) 10Muehlenhoff: Add new profile to build the CAS debs [puppet] - 10https://gerrit.wikimedia.org/r/788691 [10:55:46] (03CR) 10Majavah: maintain-views: Drop views on revision_actor_temp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [10:57:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10jbond) [10:57:41] (03CR) 10Jbond: profile::installserver::proxy: update squid template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [10:57:44] (03CR) 10Jbond: [C: 03+2] profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [10:57:57] !log restrict ports allowed via squid [10:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] !log rolling downgrade of HAProxy to version 2.4.15 on text - T307444 [10:58:01] (03PS1) 10David Caro: openstack_exporter: don't use ensure twice for service [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) [10:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:04] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [10:59:47] (03CR) 10jerkins-bot: [V: 04-1] openstack_exporter: don't use ensure twice for service [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) (owner: 10David Caro) [11:00:17] (03PS2) 10David Caro: openstack_exporter: don't use ensure twice for service [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) [11:00:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:04] (03PS3) 10David Caro: openstack_exporter: don't use ensure twice for service [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) [11:04:07] (03CR) 10Ladsgroup: maintain-views: Drop views on revision_actor_temp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [11:06:09] (03CR) 10Muehlenhoff: "But wmf-laptop-sre has a dependency on colordiff, so I don't see the issue?" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 (owner: 10Andrea Denisse) [11:07:50] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35034/console" [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) (owner: 10David Caro) [11:11:42] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack_exporter: don't use ensure twice for service [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) (owner: 10David Caro) [11:11:56] (03CR) 10David Caro: [V: 03+1 C: 03+2] "Merging as this is making puppet fail on cloudcontrol nodes" [puppet] - 10https://gerrit.wikimedia.org/r/788694 (https://phabricator.wikimedia.org/T302178) (owner: 10David Caro) [11:12:38] (03PS2) 10Jbond: admin: Add ssh key for phedenskog. [puppet] - 10https://gerrit.wikimedia.org/r/788668 (https://phabricator.wikimedia.org/T307079) (owner: 10Phedenskog) [11:12:42] (03CR) 10Jbond: [C: 03+2] admin: Add ssh key for phedenskog. [puppet] - 10https://gerrit.wikimedia.org/r/788668 (https://phabricator.wikimedia.org/T307079) (owner: 10Phedenskog) [11:17:46] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:17:48] PROBLEM - haproxy process on cp3058 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [11:17:58] (03PS3) 10Jbond: Add new profile to build the CAS debs [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [11:18:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [11:18:42] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:19:30] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:19:30] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:46] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3058 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:20:07] hmm looking [11:21:38] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3058 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 279501 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-06-03 16:37:41 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:21:38] RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:54] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3058 is OK: SSL OK - OCSP staple validity for wikipedia.org has 590168 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2022-11-17 23:59:59 +0000 (expires in 198 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:21:58] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp3058 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 279481 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2022-06-03 16:37:49 +0000 (expires in 31 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:22:02] RECOVERY - haproxy process on cp3058 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [11:22:56] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp3058 is OK: SSL OK - OCSP staple validity for wikipedia.org has 569585 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 191 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:25:34] (03PS12) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [11:39:00] (03PS4) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [11:39:02] (03PS5) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [11:39:04] (03PS7) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [11:44:10] (03PS1) 10Majavah: wikireplicas: add 'section' to meta_p.wiki [puppet] - 10https://gerrit.wikimedia.org/r/788697 [11:46:06] (03CR) 10jerkins-bot: [V: 04-1] wikireplicas: add 'section' to meta_p.wiki [puppet] - 10https://gerrit.wikimedia.org/r/788697 (owner: 10Majavah) [11:46:45] (03PS2) 10Majavah: wikireplicas: add 'section' to meta_p.wiki [puppet] - 10https://gerrit.wikimedia.org/r/788697 [11:48:21] (03PS1) 10Jbond: puppet_compiler: manage facts cache dir [puppet] - 10https://gerrit.wikimedia.org/r/788698 [11:48:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: manage facts cache dir [puppet] - 10https://gerrit.wikimedia.org/r/788698 (owner: 10Jbond) [11:53:01] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:08:01] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:22:17] (03PS7) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [12:22:32] (03CR) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [12:25:58] (03PS1) 10Marostegui: Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/788618 [12:26:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/788618 (owner: 10Marostegui) [12:27:47] (03PS1) 10Marostegui: Revert "db1132: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/788619 [12:28:02] (03CR) 10Kosta Harlan: [C: 03+1] Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [12:28:25] (03CR) 10Marostegui: [C: 03+2] Revert "db1132: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/788619 (owner: 10Marostegui) [12:30:38] (03CR) 10Muehlenhoff: [C: 03+2] Add new profile to build the CAS debs [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [12:36:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/788705 [12:42:01] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) >>! In T295691#7894680, @Papaul wrote: > @Jgreen hello do you think this can be done on May the 16th? @Papaul, yes that sounds good. We can plan for downt... [12:46:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10jbond) this is likely related to https://wikitech.wikimedia.org/wiki/Performance/Graphite/Synthetic_Instance [12:48:42] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:37] (03PS1) 10Jcrespo: install_server: Wipe backup1002 completely [puppet] - 10https://gerrit.wikimedia.org/r/788706 (https://phabricator.wikimedia.org/T305446) [12:52:39] (03PS2) 10Jcrespo: install_server: Wipe backup1002 completely [puppet] - 10https://gerrit.wikimedia.org/r/788706 (https://phabricator.wikimedia.org/T305446) [12:53:21] (03CR) 10Herron: [C: 03+1] team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:54:33] (03CR) 10Jcrespo: [C: 03+2] install_server: Wipe backup1002 completely [puppet] - 10https://gerrit.wikimedia.org/r/788706 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [12:54:47] (03CR) 10Herron: [C: 03+1] logstash: add target index validation step [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [12:55:45] (03CR) 10Herron: [C: 03+1] logstash: set dlq output and template_version [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [12:56:02] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10ayounsi) Note that it's possible to do damages with Homer and Netbox write access, so it needs to be treated carefully. That said I'm fine with John... [12:56:30] (03CR) 10Herron: [C: 03+1] profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [12:57:40] !log rolling downgrade of HAProxy to version 2.4.15 on upload - T307444 [12:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:46] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [12:58:00] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1026.eqiad.wmnet [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2026.codfw.wmnet [12:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:43] 10SRE, 10Traffic: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) p:05High→03Medium Lowering the priority as after downgrading text we aren't experiencing more issues [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T1300). [13:00:05] nemo-yiannis and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] o/ [13:00:18] 👋 [13:01:12] o/ [13:01:51] nemo-yiannis: the commit message of your change sounds a bit strange to me, would you mind changing “deprecate” to “remove”? [13:01:56] or convince me that “deprecate” makes sense ^^ [13:02:00] sure [13:02:05] but to me that would imply that the stream is still there and functional [13:02:22] no its not used and should be deleted from the config [13:02:27] ok [13:02:54] (03PS1) 10Jcrespo: install_server: Update backup-format recipe to install on sdb/sdc [puppet] - 10https://gerrit.wikimedia.org/r/788707 (https://phabricator.wikimedia.org/T305446) [13:03:01] ok, codesearch suggests tiles_change is indeed the only one used in deployment-charts [13:03:16] event/primary still has references to tile_change but might be cleaned up later [13:03:33] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1026.eqiad.wmnet [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:58] (03PS6) 10Jgiannelos: Remove unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) [13:04:00] (03PS2) 10Jcrespo: install_server: Update backup-format recipe to install on sdb/sdc [puppet] - 10https://gerrit.wikimedia.org/r/788707 (https://phabricator.wikimedia.org/T305446) [13:04:02] (03PS3) 10Zabe: graphite: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751208 (https://phabricator.wikimedia.org/T273673) [13:04:39] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2026.codfw.wmnet [13:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:06] (03CR) 10Jcrespo: [C: 03+2] install_server: Update backup-format recipe to install on sdb/sdc [puppet] - 10https://gerrit.wikimedia.org/r/788707 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [13:05:09] Lucas_WMDE: done, do you want me to rebase the patch ? it looks like there is a conflict [13:05:44] (03PS1) 10Kormat: auto_schema: Supply -N to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/788709 [13:05:47] I’ll try just clicking the rebase button [13:05:54] Gerrit sometimes reports conflicts where there really aren’t any [13:05:59] (03PS7) 10Lucas Werkmeister (WMDE): Remove unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [13:06:46] Lucas_WMDE: sounds good, thanks [13:06:46] (03CR) 10Lucas Werkmeister (WMDE): Remove unused maps event stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [13:06:56] ok, just waiting for CI now [13:07:20] (03CR) 10Muehlenhoff: [C: 03+2] "I've also updated" [puppet] - 10https://gerrit.wikimedia.org/r/788691 (owner: 10Muehlenhoff) [13:07:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [13:07:29] nemo-yiannis: do you know if the change can be tested on mwdebug? [13:08:03] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [13:08:04] (03Merged) 10jenkins-bot: Remove unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [13:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:09] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [13:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:51] ok, the stream change is on mwdebug1001 now [13:08:58] !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1002.eqiad.wmnet with OS bullseye [13:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:07] I’ll at least check that nothing obvious breaks [13:09:14] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [13:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:34] Lucas_WMDE: i am not sure how the eventstreamconfig extension works internally [13:10:05] * nemo-yiannis checks mwdebug [13:11:48] looks OK on mwdebug, the API doesnt list the old stream [13:11:59] I’m not noticing any obvious breakage, at least [13:12:08] tried visiting some pages with kartographer maps, looked fine [13:12:38] 👍 [13:13:13] ok, syncing [13:13:21] (03PS2) 10Lucas Werkmeister (WMDE): Use "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788356 [13:13:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747196|Remove unused maps event stream (T293366)]] (duration: 01m 04s) [13:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:16] T293366: Performance considerations about the current usage of EventPlatform from maps - https://phabricator.wikimedia.org/T293366 [13:14:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788356 (owner: 10Lucas Werkmeister (WMDE)) [13:16:10] (03PS9) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [13:16:28] (03PS1) 10Muehlenhoff: Add a few packages needed for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/788711 [13:16:35] (03Merged) 10jenkins-bot: Use "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788356 (owner: 10Lucas Werkmeister (WMDE)) [13:16:51] (03CR) 10Jbond: [C: 04-1] "-1: this topic is still being discussed" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [13:17:10] (03CR) 10jerkins-bot: [V: 04-1] Add a few packages needed for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/788711 (owner: 10Muehlenhoff) [13:19:04] (03CR) 10LMata: [C: 03+1] "lgtm ツ" [puppet] - 10https://gerrit.wikimedia.org/r/788681 (owner: 10Slyngshede) [13:19:09] seems to work, syncing my change [13:19:09] (03PS10) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [13:20:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:28] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788356|Use "unexpectedUnconnectedPage" page prop everywhere]] (duration: 00m 51s) [13:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2043.codfw.wmnet with OS bullseye [13:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:15] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2043.codfw.wmnet with OS bullseye [13:25:19] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1002.eqiad.wmnet with reason: host reimage [13:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:25] !log UTC afternoon backport window done [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:01] (03PS2) 10Muehlenhoff: Add a few packages needed for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/788711 [13:26:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:11] (03PS2) 10Kormat: auto_schema: Supply -N to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/788709 [13:27:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:09] (03CR) 10Kormat: "I've reviewed all the existing schema changes in software/schema-changes.git, and the only ones that need updating are the ones i'm workin" [software] - 10https://gerrit.wikimedia.org/r/788709 (owner: 10Kormat) [13:28:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:39] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1002.eqiad.wmnet with reason: host reimage [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] (03CR) 10Muehlenhoff: [C: 03+2] Add a few packages needed for CAS build [puppet] - 10https://gerrit.wikimedia.org/r/788711 (owner: 10Muehlenhoff) [13:36:47] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2043.codfw.wmnet with reason: host reimage [13:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:13] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2043.codfw.wmnet with reason: host reimage [13:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] (03PS2) 10Muehlenhoff: Apply role::webperf::processors_and_site to webperf1003/2003 [puppet] - 10https://gerrit.wikimedia.org/r/785115 (https://phabricator.wikimedia.org/T305460) [13:44:29] (03PS21) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [13:45:03] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1002.eqiad.wmnet with OS bullseye [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:26] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:45:47] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1027.eqiad.wmnet [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2027.codfw.wmnet [13:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:55] !log stopped/maske coal/navtiming on webperf1001/webperf2001 T305460 [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:59] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [13:48:46] (03CR) 10Muehlenhoff: [C: 03+2] Apply role::webperf::processors_and_site to webperf1003/2003 [puppet] - 10https://gerrit.wikimedia.org/r/785115 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [13:51:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:57] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2027.codfw.wmnet [13:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:45] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1027.eqiad.wmnet [13:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:04] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2028.codfw.wmnet [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:45] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:02:53] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2028.codfw.wmnet [14:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:36] !log upgrade haproxy to 2.4.16 on cp3050 - T307444 [14:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:40] T307444: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 [14:06:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:06:19] (03PS2) 10Muehlenhoff: Switch webperf1001/1003 for eventual removal [puppet] - 10https://gerrit.wikimedia.org/r/785116 (https://phabricator.wikimedia.org/T205460) [14:07:26] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase101[6-8].eqiad.wmnet: Restarting for cert refresh - hnowlan@cumin1001 [14:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:03] (03PS22) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [14:11:39] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1028.eqiad.wmnet [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:53] (03PS8) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [14:18:16] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1028.eqiad.wmnet [14:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, thanks @jbond and @Peter [14:20:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10fgiunchedi) Yes valid indeed, see also {T231870} and {T304583} [14:25:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2043.codfw.wmnet with OS bullseye [14:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:20] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2043.codfw.wmnet with OS bullseye completed: - ms-be2043 (**PASS**) - Downtim... [14:26:44] (03CR) 10Muehlenhoff: "One comment inline, otherwise looks good to me" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/786376 (https://phabricator.wikimedia.org/T306911) (owner: 10Bking) [14:27:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1041.eqiad.wmnet with OS bullseye [14:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:31] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be1041.eqiad.wmnet with OS bullseye [14:29:30] (03PS4) 10Filippo Giunchedi: clinic-duty: add Orange support [software] - 10https://gerrit.wikimedia.org/r/788296 [14:30:34] PROBLEM - Check systemd state on backup1002 is CRITICAL: CRITICAL - degraded: The following units failed: proc-sys-fs-binfmt_misc.automount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:47] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7892221, @Ottomata wrote: >> perhaps this is a client browser opening a connection but send... [14:33:12] PROBLEM - Check systemd state on backup2002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:52] RECOVERY - Check systemd state on backup1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:13] (03CR) 10Muehlenhoff: postgresql: migrate backup crons to systemd timer jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:35:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase101[6-8].eqiad.wmnet: Restarting for cert refresh - hnowlan@cumin1001 [14:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2026.codfw.wmnet with reason: reboot [14:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2026.codfw.wmnet with reason: reboot [14:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:40:34] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2012.codfw.wmnet: Restarting for cert refresh - hnowlan@cumin1001 [14:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:30] (03PS40) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:42:44] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1029.eqiad.wmnet [14:42:46] RECOVERY - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-a valid until 2024-05-02 13:53:09 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:09] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:43:54] (03CR) 10Bking: Elastic: Use OS major version for GC flags (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:46:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet [14:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:12] RECOVERY - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is OK: SSL OK - Certificate restbase2012-c valid until 2024-05-02 13:53:14 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:46:14] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:18] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be1041.eqiad.wmnet with OS bullseye [14:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:25] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be1041.eqiad.wmnet with OS bullseye executed with errors: - ms-be1041 (**FAIL**)... [14:46:28] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:47:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1041.eqiad.wmnet with OS bullseye [14:47:17] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be1041.eqiad.wmnet with OS bullseye [14:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:05] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1029.eqiad.wmnet [14:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2012.codfw.wmnet: Restarting for cert refresh - hnowlan@cumin1001 [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:30] (03PS41) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:52:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet [14:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:56] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It possible that the request aborted errors are actually requests being terminated mid-flight by the clie... [14:54:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:57:33] (03PS1) 10Filippo Giunchedi: sre: port NEL alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) [14:57:36] (03PS1) 10Herron: "private" add prometheus.wm.o placeholder key [labs/private] - 10https://gerrit.wikimedia.org/r/788721 (https://phabricator.wikimedia.org/T301944) [14:58:09] RECOVERY - Check systemd state on backup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1041.eqiad.wmnet with reason: host reimage [14:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] (03CR) 10Ahmon Dancy: [C: 03+1] add dummy ssh key pair for new scap Keyholder identity [labs/private] - 10https://gerrit.wikimedia.org/r/788677 (https://phabricator.wikimedia.org/T307351) (owner: 10Jaime Nuche) [15:01:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1041.eqiad.wmnet with reason: host reimage [15:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:49] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks for the cleanup :)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788676 (owner: 10David Caro) [15:03:21] mforns: airflow sync? [15:03:26] (03CR) 10Andrew Bogott: [C: 03+1] "Function comments are just what I wanted, thank you!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [15:04:26] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn I mentioned over email, but I t... [15:05:21] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2029.codfw.wmnet [15:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:56] (03CR) 10Herron: [V: 03+2 C: 03+2] "private" add prometheus.wm.o placeholder key [labs/private] - 10https://gerrit.wikimedia.org/r/788721 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:07:30] * andrewbogott waves to sukhe [15:07:34] hi andrewbogott [15:07:59] starting by disabling puppet on dns-rec and wikidough hosts [15:08:30] !log disable puppet on A:dns-rec to deploy CR 779936 [15:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:44] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) [15:09:02] !log disable puppet on A:wikidough to deploy CR 779936 [15:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:50] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [15:10:27] andrewbogott: change merged [15:10:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2029.codfw.wmnet [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:00] great, will apply and test on my secondary [15:11:00] PROBLEM - very high load average likely xfs on ms-be2043 is CRITICAL: CRITICAL - load average: 103.44, 106.22, 92.03 https://wikitech.wikimedia.org/wiki/Swift [15:13:48] (03PS11) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:14:22] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:14:40] andrewbogott: everything looks fine on dnsrecursor and Wikidough hosts [15:14:58] I have merged on dns1002 [15:15:03] (puppet disabled elsewhere) [15:15:40] ok. I'm seeing some weird things but I suspect they predated... [15:15:48] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) @jhathaway Yes and no. What I definitely d... [15:16:10] nope, there's a problem on my end. Looks like a firewall issue [15:16:26] don't need to revert yet, give me a minute... [15:17:31] (03CR) 10Gehel: Elastic: Use OS major version for GC flags (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:17:39] sure [15:17:44] take your time please [15:17:48] PROBLEM - Disk space on backup1002 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops [15:18:26] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) >>! In T300977#7836272, @Volans wrote: > If I may add my use case too, I would like to be able to restrict the acce... [15:18:27] I guess not firewall, but my clients can't talk to the recursor after that patch is applied [15:18:33] (03CR) 10Gehel: Elastic: Use OS major version for GC flags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:18:35] oh? [15:18:37] (03PS12) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:19:01] just saw this in #wikimedia-cloud-feed: 18:16:20 <+icinga-wm> PROBLEM - Recursive DNS on 208.80.154.24 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:19:10] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:19:11] oh yep [15:19:19] checking [15:19:41] (03CR) 10Ahmon Dancy: [C: 03+2] "Looks the same as some other files in the same directory." [labs/private] - 10https://gerrit.wikimedia.org/r/788677 (https://phabricator.wikimedia.org/T307351) (owner: 10Jaime Nuche) [15:19:46] sukhe: that server is on cloudservices1004.wikimedia.org [15:20:17] service seems fine. let me see the pcc diff [15:21:16] you can see the issue yourself if you run "dig @208.80.154.24 util-abogott-bullseye.testlabs.eqiad1.wikimedia.cloud" on cloudservices1004 [15:21:16] override also applied [15:21:24] pdns_recursor is only bound on 127.0.0.1:53 [15:21:33] not sure why [15:22:12] May 03 15:11:40 cloudservices1004 pdns_recursor[226955]: May 03 15:11:40 Unable to open /etc/powerdns/recursor.conf [15:22:28] (03PS13) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:23:22] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:23:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:24:08] (03CR) 10Ahmon Dancy: [V: 03+2 C: 03+2] add dummy ssh key pair for new scap Keyholder identity [labs/private] - 10https://gerrit.wikimedia.org/r/788677 (https://phabricator.wikimedia.org/T307351) (owner: 10Jaime Nuche) [15:24:18] oh [15:24:20] I see what happened [15:24:21] fixing [15:25:00] (03PS14) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:25:04] (03PS1) 10Majavah: dnsrecursor: fix config file group [puppet] - 10https://gerrit.wikimedia.org/r/788724 [15:25:07] sukhe: ^^ should be fixed by that [15:25:11] ha thanks [15:25:42] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:25:50] but wait [15:26:33] (03PS15) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:26:34] I diffed working/broken config and don't see anything interesting... [15:26:45] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:03] RECOVERY - very high load average likely xfs on ms-be2043 is OK: OK - load average: 54.85, 67.01, 79.63 https://wikitech.wikimedia.org/wiki/Swift [15:27:04] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35044/console" [puppet] - 10https://gerrit.wikimedia.org/r/788724 (owner: 10Majavah) [15:27:06] oh yeah, permissions are different... [15:27:09] (03CR) 10jerkins-bot: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:27:17] sukhe: want me to merge taavi's change? [15:27:19] yeah, May 3 15:11:40 cloudservices1004 puppet-agent[225720]: (/Stage[main]/Dnsrecursor/File[/etc/powerdns/recursor.conf]/mode) mode changed '0444' to '0440' [15:27:54] (03PS16) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:28:06] let's merge, since taavi has taken care of the redundant line as well [15:28:21] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: fix config file group [puppet] - 10https://gerrit.wikimedia.org/r/788724 (owner: 10Majavah) [15:28:41] running agent on cloudservices1004 [15:29:21] I'm doing that too but looks like I started it a bit too early [15:29:31] yep, just merged on puppetmaster [15:29:38] you can try now [15:29:45] I will check the dnsrec hosts [15:29:53] (which were fine, but since we changed the perms...) [15:30:02] (03PS17) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:30:35] yep, better. Thank you taavi [15:30:36] and things seem to work fine again [15:30:40] thanks taavi! [15:30:47] checking on dnsrec [15:32:14] andrewbogott: will you take care of 1003 too? [15:32:25] all good on A:dns-rec and A:wikidough [15:32:30] I am going to wait for a bit to renable Puppet [15:32:33] (03PS18) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:32:38] please share here if there are any other concerns/issues [15:32:42] 10SRE, 10Traffic, 10Upstream: HAProxy 2.4.16 shows internal errors on text cluster - https://phabricator.wikimedia.org/T307444 (10Vgutierrez) reported to upstream: https://github.com/haproxy/haproxy/issues/1684 [15:33:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35048/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:35:25] (I will wait for the confirmation for 1003 since I notice Puppet is still disabled htere) [15:35:42] icinga still sees ns-recursor1.eqiad1 (cloudservices1004) as down [15:36:58] taavi: which line are you seeing in icinga? I'm looking... [15:37:53] (03PS19) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:38:05] no clue how to link to an icinga entry, but for me it says "208.80.154.24 Recursive DNS" is "DNS_QUERY CRITICAL - query timed out (for 0d 0h 24m 42s)" [15:38:09] (03PS42) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [15:38:29] maybe you just need to re-schedule it somehow? [15:38:43] yeah, now it recovered [15:38:43] yeah, trying... [15:39:05] ok, there we go [15:39:14] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:39:28] (03PS20) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:39:41] sukhe: I think we're all good here [15:39:47] (03CR) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:39:52] nice, thank you andrewbogott and thanks taavi for the patch [15:39:56] much appreciated, both [15:40:10] thanks for the advance notice :) [15:40:12] I know refactors are always tricky so thanks for your patience :) [15:40:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35050/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:41:13] (03CR) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:41:45] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:02] !log enable puppet on A:dns-rec and A:wikidough [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:23] (03PS21) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:43:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35051/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:44:46] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:45:53] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10jhathaway) @Dzahn that makes sense, so I assume i... [15:46:58] (03PS22) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [15:47:21] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:53] (03PS43) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [15:48:42] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:51:22] (03PS1) 10AikoChou: ml-services: update values.yaml for articlequality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/788747 (https://phabricator.wikimedia.org/T301766) [15:51:40] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1041.eqiad.wmnet with OS bullseye [15:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:43] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be1041.eqiad.wmnet with OS bullseye completed: - ms-be1041 (**WARN**) - Removed... [15:51:56] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7899762, @Ottomata wrote: > It possible that the request aborted errors are actually reques... [15:53:09] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7883435, @elukey wrote: > We have discussed this issue in the #serviceops channel yesterday, and the i... [15:54:20] (03PS44) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [15:55:06] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:58:11] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:40] (03CR) 10Btullis: [C: 03+1] "Looks OK to me." [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [15:59:56] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) to be accurate, the remote client talks to HAProxy over a TLS connection and HAProxy handles the traffi... [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:08] (03CR) 10CDanis: sre: port NEL alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:04:26] (03CR) 10JHathaway: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [16:05:14] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) It does seem that a 400 bad request is being sent to the client. I think that perhaps the 500 reported b... [16:08:13] (03PS1) 10Filippo Giunchedi: thanos: aggregate varnish requests availability [puppet] - 10https://gerrit.wikimedia.org/r/788751 (https://phabricator.wikimedia.org/T305847) [16:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:09:45] (03PS3) 10Hnowlan: add image-suggestion.discovery.wmnet and point to ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [16:13:06] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 14.10 [puppet] - 10https://gerrit.wikimedia.org/r/788752 (https://phabricator.wikimedia.org/T307471) [16:16:02] (03CR) 10Razzi: [C: 03+2] Configure superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [16:17:03] (03CR) 10Hnowlan: add image-suggestion.discovery.wmnet and point to ingress-wikikube (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [16:17:17] (03CR) 10Razzi: "Need this dns record as well." [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [16:18:19] (03CR) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [16:20:19] (03CR) 10Filippo Giunchedi: sre: port NEL alert to Alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/788720 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [16:32:56] (03PS1) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [16:34:30] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) The part that we don't (can't actually) re... [16:34:45] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @Vgutierrez for the clarification on that. I hadn't picked up on the progress of the HAProxy migrat... [16:34:49] (03PS2) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [16:35:29] (03CR) 10JHathaway: P:installserver::proxy: Add global whitelist and list mappings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [16:44:07] (03CR) 10Andrew Bogott: [C: 03+2] graphite: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751208 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:48:29] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:52:14] (03PS45) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [16:53:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:55:21] (03PS1) 10Stang: Re-enable disabled Special pages on medium wiks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788754 (https://phabricator.wikimedia.org/T48094) [16:56:19] (03CR) 10Bking: [C: 03+2] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:58:29] (03PS2) 10Stang: Re-enable disabled Special pages on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788754 (https://phabricator.wikimedia.org/T48094) [17:02:05] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdc1.mount,srv-swift\x2dstorage-sdg1.mount,srv-swift\x2dstorage-sdk1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 106 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:16:13] jynus: ^ interesting. should I look at gerrit1001 about that? [17:16:41] nah, I had an issue and backups are a bit slower than they should [17:16:48] alrighty [17:17:15] problem is it is a single alert for all backups [17:17:22] will ack for 12 hours [17:17:38] *nod*, thanks [17:18:05] should be fixed in a few hours, when the queue uncloggs [17:19:33] sounds good, yep [17:20:18] !log install1003 - apt-get remove geoip-database libgeoip1 and running puppet. I don't see why these are installed here [17:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:13] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [17:23:37] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:28] ^ May 03 17:18:13 webperf1002 arclamp-compress-logs[30667]: Object HEAD failed: https://ms-fe.svc.eqiad.wmnet/v1/AUTH_performance/arclamp-logs-hourly/2022-04-29_16.excimer-wall.load.log.gz 503 Service Unavailab [17:25:06] !log [webperf1002:~] $ sudo systemctl status arclamp_compress_logs [17:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:45] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:58] !log [webperf1002:~] $ sudo systemctl start arclamp_compress_logs (was failed with https://ms-fe.svc.eqiad.wmnet/... returning 503) but worked fine when manually starting it [17:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:36] (03PS1) 10Ssingh: P:wikidough: do not automatically restart the pdns service [puppet] - 10https://gerrit.wikimedia.org/r/788758 [17:29:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35052/console" [puppet] - 10https://gerrit.wikimedia.org/r/788758 (owner: 10Ssingh) [17:30:01] !log install2003 - apt-get remove geoip-databases [17:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:53] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: do not automatically restart the pdns service [puppet] - 10https://gerrit.wikimedia.org/r/788758 (owner: 10Ssingh) [17:30:58] dpifke: godog: ^ arclamp_compress logs failed because ms-fe.svc tem returned 503 but when I started it manually everything was ok again. [17:32:54] !log removing geoip-database from all install hosts with [cumin2002:~] $ sudo cumin 'install*' 'apt-get remove geoip-database' [17:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:38] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1030.eqiad.wmnet [17:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:00] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2030.codfw.wmnet [17:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:31] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2030.codfw.wmnet [17:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:43] (03CR) 10Ottomata: [C: 03+1] Add superset-next domain CNAME [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [17:56:01] (03CR) 10Razzi: [C: 03+2] Add superset-next domain CNAME [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [17:57:05] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1030.eqiad.wmnet [17:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:36] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1031.eqiad.wmnet [17:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:40] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2031.codfw.wmnet [17:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] hashar and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T1800). [18:00:57] o/ [18:03:17] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1031.eqiad.wmnet [18:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:23] !log train 1.39.0-wmf.10 (T305216): train is still blocked on T307019, although in practice that blocker doesn't prevent us from going ahead safely. i'm going unavoidably afk for a couple of hours; plan to move train to group1 on my return. [18:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:27] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [18:04:28] T307019: PHP Notice: Undefined offset: 2 in WikimediaEvents\PageSplitter\PageSplitterInstrumentation->getBucket - https://phabricator.wikimedia.org/T307019 [18:04:46] !log start ttmserver-export.php from Translate against codfw search cluster for T306811 [18:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:51] T306811: Check for indices that are not compatible with elastic 7.x in production clusters - https://phabricator.wikimedia.org/T306811 [18:08:22] !log train 1.39.0-wmf.10 (T305216): amending prior logline: planning to move to _group0_ on return [18:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:57] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:14:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:14:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P27354 and previous config saved to /var/cache/conftool/dbconfig/20220503-181457-ladsgroup.json [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:10] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:16:16] (03PS1) 10Majavah: P:openstack::puppetmaster: add 8143 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/788761 [18:17:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:17:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:09] PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_1@production-elk7-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:53] PROBLEM - OpenSearch health check for shards on 9200 on logstash2031 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:20:24] (03PS1) 10Razzi: superset: set x_forwarded_proto to https on staging [puppet] - 10https://gerrit.wikimedia.org/r/788762 (https://phabricator.wikimedia.org/T275575) [18:21:54] (03CR) 10Razzi: [C: 03+2] superset: set x_forwarded_proto to https on staging [puppet] - 10https://gerrit.wikimedia.org/r/788762 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [18:22:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: T306560', diff saved to https://phabricator.wikimedia.org/P27355 and previous config saved to /var/cache/conftool/dbconfig/20220503-182357-ladsgroup.json [18:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:30:02] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) >>! In T305568#7897589, @Papaul wrote: > @Eevans this is complete Thank you! [18:33:38] (03PS1) 10Razzi: superset: add caching for superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/788766 (https://phabricator.wikimedia.org/T275575) [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:39:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: T306560', diff saved to https://phabricator.wikimedia.org/P27356 and previous config saved to /var/cache/conftool/dbconfig/20220503-183901-ladsgroup.json [18:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:41:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:41:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:32] (03CR) 10Razzi: "The domain is working and does the auth redirect but static files don't load. I noticed there was a key for superset.wikimedia.org in thes" [puppet] - 10https://gerrit.wikimedia.org/r/788766 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:42:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [18:42:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [18:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:07] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:45] (03CR) 10Ladsgroup: [C: 03+1] auto_schema: Supply -N to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/788709 (owner: 10Kormat) [18:57:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:57:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:21] (03PS1) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:01:04] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:03:09] (03PS2) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:03:24] (03CR) 10Razzi: [C: 03+2] superset: add caching for superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/788766 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [19:04:49] (03PS3) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:04:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:08:48] (03PS1) 10Ssingh: dnsrecursor: test no restart change (do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/788771 [19:09:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [19:09:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [19:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35055/console" [puppet] - 10https://gerrit.wikimedia.org/r/788771 (owner: 10Ssingh) [19:11:08] (03Abandoned) 10Ssingh: dnsrecursor: test no restart change (do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/788771 (owner: 10Ssingh) [19:14:19] (03PS4) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:15:04] (03PS5) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:15:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:16:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [19:16:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [19:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:06] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 69): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35054/console" [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:19:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:19:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P27357 and previous config saved to /var/cache/conftool/dbconfig/20220503-191909-ladsgroup.json [19:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:13] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P27358 and previous config saved to /var/cache/conftool/dbconfig/20220503-192119-ladsgroup.json [19:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:13] (03PS6) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:25:46] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:28:24] (03PS7) 10Bking: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) [19:29:25] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:33:41] RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:32] !log herron@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host logstash2031.codfw.wmnet [19:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:35] RECOVERY - OpenSearch health check for shards on 9200 on logstash2031 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 10, discovered_master: True, active_primary_shards: 463, active_shards: 1065, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [19:35:35] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27359 and previous config saved to /var/cache/conftool/dbconfig/20220503-193624-ladsgroup.json [19:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:35] (03PS8) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:40:21] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1032.eqiad.wmnet [19:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:30] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2033.codfw.wmnet [19:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:37] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:41:40] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:42:00] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:44:10] (03PS9) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:44:44] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:45:17] (03PS10) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:45:53] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:46:00] (03PS11) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:46:19] (03PS1) 10Ebernhardson: translate: Move ttmserver queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788773 (https://phabricator.wikimedia.org/T306811) [19:46:34] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:47:07] (03PS12) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:47:40] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:48:35] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2033.codfw.wmnet [19:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:23] (03PS13) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:49:31] (03PS1) 10Razzi: superset-next: disable require_u2f for now [puppet] - 10https://gerrit.wikimedia.org/r/788774 (https://phabricator.wikimedia.org/T275575) [19:49:31] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Tsevener) Hi team, Just a heads up - we are planning on going out on phased release with this today. Here are the rollout percentages we can expect for automatic updates. We won't be able t... [19:49:48] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host logstash1032.eqiad.wmnet [19:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:58] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:50:13] PROBLEM - Check systemd state on logstash1032 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:28] (03PS1) 10Gergő Tisza: Duplicate eswiki Growth campaign config to itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788776 [19:51:04] (03PS2) 10Ebernhardson: translate: Move ttmserver queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788773 (https://phabricator.wikimedia.org/T306811) [19:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27360 and previous config saved to /var/cache/conftool/dbconfig/20220503-195129-ladsgroup.json [19:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:27] (03PS14) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:57:24] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35058/console" [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:57:32] (03CR) 10jerkins-bot: [V: 04-1] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:59:16] (03PS1) 10EllenR: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220503T2000). [20:00:04] luke_bow, koi, ebernhardson, and Juan_90264: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:03:35] for those around, I can deploy today [20:03:42] oh hi [20:05:28] here [20:05:35] luke_bow: if you're here, I can start with your patch - if not, I'll start with koi's (feel free to ping if/when you are here) [20:05:40] here [20:05:41] (03PS15) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:06:22] luke_bow: looks like a manual rebase is needed - can you take care of that? i couldn't do it from gerrit [20:06:35] sure, i'll take a look now. thanks [20:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P27361 and previous config saved to /var/cache/conftool/dbconfig/20220503-200634-ladsgroup.json [20:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:07:40] RECOVERY - Check systemd state on logstash1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:45] (03PS1) 10Andrew Bogott: prometheus-node-cloudvirt-libvirt-stats.py: handle newer VM xml data [puppet] - 10https://gerrit.wikimedia.org/r/788781 [20:08:16] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:08:44] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35059/console" [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:09:07] (03CR) 10Clare Ming: [C: 03+2] Re-enable disabled Special pages on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788754 (https://phabricator.wikimedia.org/T48094) (owner: 10Stang) [20:09:52] (03Merged) 10jenkins-bot: Re-enable disabled Special pages on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788754 (https://phabricator.wikimedia.org/T48094) (owner: 10Stang) [20:10:24] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1033.eqiad.wmnet [20:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:29] (03PS16) 10Ryan Kemper: elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [20:10:32] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2034.codfw.wmnet [20:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:13] (03Abandoned) 10Ryan Kemper: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787106 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:11:13] koi: can you check your change on mwdebug1001? [20:11:28] hmm, nothing to test for this patch IMO [20:11:46] (03CR) 10Ryan Kemper: [C: 03+2] elastic: use java version to choose GC flags [puppet] - 10https://gerrit.wikimedia.org/r/788768 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [20:11:57] koi: ok then - syncing [20:12:38] (03CR) 10Clare Ming: [C: 03+2] translate: Move ttmserver queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788773 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [20:13:06] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788754|Re-enable disabled Special pages on medium wikis (T48094)]] (duration: 00m 55s) [20:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:10] T48094: Re-enable disabled Special pages on medium wikis (wikis in medium.dblist) - https://phabricator.wikimedia.org/T48094 [20:13:25] (03Merged) 10jenkins-bot: translate: Move ttmserver queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788773 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [20:14:39] hi ebernhardson: is it possible to test your change on mwdebug1001? otherwise i can go ahead and sync [20:14:55] koi: your change is live [20:15:02] cjming: checking [20:15:09] ack and thanks [20:16:12] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2034.codfw.wmnet [20:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:21] cjming: looks to be good [20:16:23] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1033.eqiad.wmnet [20:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:01] ebernhardson: cool - syncing [20:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:39] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:788773|translate: Move ttmserver queries to codfw (T306811)]] (duration: 00m 50s) [20:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:43] T306811: Check for indices that are not compatible with elastic 7.x in production clusters - https://phabricator.wikimedia.org/T306811 [20:18:25] ebernhardson: your stuff should be live [20:18:37] hi cjming - don't want to distract you during backport window, but when you're done we should chat about that undefined offset patch if you've got a minute. [20:18:43] cjming: thanks! [20:19:07] (grateful for the patch there, it looks to me like we can probably go ahead with train without rushing review on that though) [20:19:10] hi brennen: for sure [20:19:35] Juan_90264: are you around for your patches? [20:20:16] (03PS7) 10Ottomata: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [20:20:32] !log start ttmserver-export.php from Translate against eqiad search cluster for T306811 [20:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:40] luke_bow: cjming i rebased that change [20:20:47] should be good to go now [20:20:55] luke_bow: great - lgtm [20:21:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:31] (03CR) 10Clare Ming: [C: 03+2] Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [20:22:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:14] (03Merged) 10jenkins-bot: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [20:23:04] (03PS2) 10Andrew Bogott: prometheus-node-cloudvirt-libvirt-stats.py: handle newer VM xml data [puppet] - 10https://gerrit.wikimedia.org/r/788781 [20:23:14] luke_bow: is your patch verifiable on mwdebug1001? [20:23:42] no, it's not [20:24:06] then I will go ahead and sync [20:24:12] thanks! [20:25:13] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787749|Image Suggestions Feedback Stream]] (duration: 00m 50s) [20:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:23] luke_bow: should be live! [20:25:39] thanks, I see it. Looks good [20:27:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:55] Juan_90264: I'll hang out for a bit longer - please ping if/when you're here and we can do your patches - otherwise please schedule for next B&C window [20:30:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:30:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [20:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:31:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:36:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:03] ottomata: thanks for the rebase (correcting attribution) [20:38:27] yup! [20:38:42] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:39:14] Hello [20:39:38] Are you still deploying? [20:39:57] If yes, I'm ready for deployment [20:40:05] hi Juan_90264 - i was just about to close the window - nick of time lol [20:40:40] (03PS8) 10Clare Ming: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [20:42:57] (03CR) 10Clare Ming: [C: 03+2] Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [20:43:39] (03Merged) 10jenkins-bot: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [20:43:50] PROBLEM - Check systemd state on cloudweb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:59] Excellent merged! [20:44:31] ACKNOWLEDGEMENT - Check systemd state on cloudweb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott this host shouldnt even be able to page https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:01] Juan_90264: can you check on mwdebug1001? [20:46:24] cjming: Yes, I can check and I will [20:47:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:15] (03PS1) 10Andrea Denisse: admin: Add denisse to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/788793 [20:52:19] Juan_90264: for your 2nd patch, I believe it just needs a merge - unless you want to backport, there's nothing to deploy [20:52:30] cjming: I tested and approved [20:52:43] cool - syncing now [20:53:56] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:785208|Fix: Enable '$wgCopyUploadsDomains' to viwiki (T303577)]] (duration: 00m 50s) [20:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:00] T303577: "uploader" group for viwiki - https://phabricator.wikimedia.org/T303577 [20:54:02] Juan_90264: 1st patch should be live now [20:54:16] (03CR) 10JHathaway: [C: 03+2] admin: Add denisse to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/788793 (owner: 10Andrea Denisse) [20:55:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [20:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:43] The first change is already working now [20:57:37] Juan_90264: great re: 1st patch -- did you see my msg about your 2nd patch above? [20:59:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:59:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:58] I'm going to go ahead and close this backport window [21:01:03] !log end of UTC late backport & config window [21:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:41] cjming: thanks [21:01:54] np! [21:02:40] !log train 1.39.0-wmf.10 (T305216): no current blockers, proceeding to group0 [21:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:45] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [21:06:41] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788797 [21:06:43] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788797 (owner: 10Brennen Bearnes) [21:07:22] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788797 (owner: 10Brennen Bearnes) [21:07:24] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:30] cjming:  I saw the message, and yes I want to backport [21:08:20] Juan_90264: you will need to get review on that patch and schedule for another window. [21:08:49] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.10 refs T305216 [21:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:54] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [21:09:59] rolling back to testwikis. [21:12:19] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.9" [21:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:13:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:15:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:52] Juan_90264: what brennen said -- once the patch is merged (as a non-Kashmiri(?) reader, I'm not qualified to +2), you can schedule a backport to the appropriate release branch [21:16:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:00] cjming: Okay [21:19:07] Thanks for deploying, cjming! At least the first change [21:19:44] np - ty! [21:27:21] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Dzahn) [21:28:57] 10SRE, 10Data-Engineering, 10Traffic, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464 (10Dzahn) But what isn't is that there seems to be a... [21:32:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:32:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:01] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The following units failed: session-328634.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:49] (03PS1) 10Stang: ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) [21:48:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:48:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:51:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [21:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:20] (03PS1) 10Brennen Bearnes: Add class alias for TitleBlacklist and bump cache version [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788745 (https://phabricator.wikimedia.org/T307513) [21:52:46] (03CR) 10Brennen Bearnes: [C: 03+2] Add class alias for TitleBlacklist and bump cache version [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788745 (https://phabricator.wikimedia.org/T307513) (owner: 10Brennen Bearnes) [21:54:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:10] zabe: thanks for the patch [21:55:36] yw [21:55:51] (03Merged) 10jenkins-bot: Add class alias for TitleBlacklist and bump cache version [extensions/TitleBlacklist] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788745 (https://phabricator.wikimedia.org/T307513) (owner: 10Brennen Bearnes) [21:56:17] any thoughts on testing that? i guess we'll know just about immediately if it fixed the issue on rolling forward. [21:57:14] yes, it is untested and I don't really see a better way than trying it out [21:59:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:48] (03CR) 10Samtar: [C: 03+1] ptwikinews: Enable extension MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788803 (https://phabricator.wikimedia.org/T299872) (owner: 10Stang) [21:59:54] cool, syncing. [22:02:40] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/TitleBlacklist: Backport: [[gerrit:788745|Add class alias for TitleBlacklist and bump cache version (T307513)]] (duration: 00m 50s) [22:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:44] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [22:03:09] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:45] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:05:23] (03PS1) 10Brennen Bearnes: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788808 [22:05:25] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788808 (owner: 10Brennen Bearnes) [22:05:32] helps to push the previous revert commit first. [22:06:08] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788808 (owner: 10Brennen Bearnes) [22:06:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:06:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:50] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: install mariadb-server instead of mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788809 (https://phabricator.wikimedia.org/T301719) [22:07:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:33] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788810 [22:07:35] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788810 (owner: 10Brennen Bearnes) [22:08:15] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.10 refs T305216 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788810 (owner: 10Brennen Bearnes) [22:08:34] (03CR) 10jerkins-bot: [V: 04-1] profile::openstack::codfw1dev::db: install mariadb-server instead of mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788809 (https://phabricator.wikimedia.org/T301719) (owner: 10Andrew Bogott) [22:09:31] (03PS2) 10Andrew Bogott: profile::openstack::codfw1dev::db: mariadb-server instead of mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788809 (https://phabricator.wikimedia.org/T301719) [22:09:40] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.10 refs T305216 [22:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:44] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [22:10:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [22:10:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [22:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:29] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: mariadb-server instead of mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788809 (https://phabricator.wikimedia.org/T301719) (owner: 10Andrew Bogott) [22:12:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:12:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:34] there were still a few wmf.9 errors, I guess we also need a forward alias [22:13:38] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: remove a ref to mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788811 (https://phabricator.wikimedia.org/T301719) [22:13:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:36] zabe: seeing a bunch in .10 as well [22:15:33] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: remove a ref to mysql-server [puppet] - 10https://gerrit.wikimedia.org/r/788811 (https://phabricator.wikimedia.org/T301719) (owner: 10Andrew Bogott) [22:15:36] ah yes, hmm [22:16:06] (03PS1) 10Stang: id_internalwikimedia: Enable extension UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T299872) [22:17:28] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.9" [22:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] (03PS2) 10Stang: id_internalwikimedia: Enable extension UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T304291) [22:18:02] (03PS1) 10Dzahn: add service records for new service image-suggestion [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) [22:20:25] (03PS1) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [22:20:59] (03CR) 10jerkins-bot: [V: 04-1] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [22:21:15] (03PS1) 10Zabe: wmf.9 HACK: add forward class alias for TitleBlacklist [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788816 (https://phabricator.wikimedia.org/T307513) [22:23:07] (03PS2) 10Bking: elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) [22:23:16] (03PS2) 10Dzahn: add service records for new service image-suggestion [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) [22:23:32] (03CR) 10Dzahn: "replacing this with https://gerrit.wikimedia.org/r/c/operations/dns/+/788814" [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [22:23:41] (03CR) 10jerkins-bot: [V: 04-1] elastic: enable/disable ssl_ecdhe_curve based on OS version [puppet] - 10https://gerrit.wikimedia.org/r/788815 (https://phabricator.wikimedia.org/T307510) (owner: 10Bking) [22:24:36] (03CR) 10Dzahn: [C: 04-2] add image-suggestion.discovery.wmnet and point to ingress-wikikube (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [22:25:48] (03PS3) 10Dzahn: add svc and discovery records for new service image-suggestion [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) [22:26:41] (03PS1) 10Andrea Denisse: admin: Add Andrea Denisse to icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/788817 [22:27:41] (03PS4) 10Dzahn: add svc and discovery records for new service image-suggestion [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) [22:28:11] (03PS1) 10Ladsgroup: Set mediawikiwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788818 (https://phabricator.wikimedia.org/T306673) [22:28:28] jouncebot: nowandnext [22:28:28] No deployments scheduled for the next 8 hour(s) and 31 minute(s) [22:28:29] In 8 hour(s) and 31 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220504T0700) [22:28:42] Amir1: train state currently slightly broken [22:28:59] brennen: ohnoes, is there anything I can help with? [22:29:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [22:29:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [22:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:13] https://phabricator.wikimedia.org/T307513 [22:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:59] i think zabe's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TitleBlacklist/+/788816/ is probably correct short term fix for the errors on .9 [22:30:54] (03CR) 10Ladsgroup: [C: 03+2] wmf.9 HACK: add forward class alias for TitleBlacklist [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788816 (https://phabricator.wikimedia.org/T307513) (owner: 10Zabe) [22:31:04] (03CR) 10Dzahn: [C: 03+1] "looks good to me, you should try it though after merging because capitalization matters and there are different LDAP fields (uid,cn,sn). Y" [puppet] - 10https://gerrit.wikimedia.org/r/788817 (owner: 10Andrea Denisse) [22:31:21] brennen: cool, gonna deploy now, wanna grab popcorn? [22:31:27] haha, sure [22:31:56] (03PS1) 10Stang: labswiki: Enable extension SubPageList3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788819 (https://phabricator.wikimedia.org/T304181) [22:32:02] if that takes care of those errors i think i may park the train where it's at until h.ashar's morning. :) [22:33:09] (03CR) 10Dzahn: [C: 03+2] add svc and discovery records for new service image-suggestion [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [22:33:21] (03Merged) 10jenkins-bot: wmf.9 HACK: add forward class alias for TitleBlacklist [extensions/TitleBlacklist] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788816 (https://phabricator.wikimedia.org/T307513) (owner: 10Zabe) [22:33:32] (03PS1) 10Ebernhardson: Revert "translate: Move ttmserver queries to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788820 (https://phabricator.wikimedia.org/T306811) [22:34:30] I am not sure if the design in the code is optimal, it always fetches the cache for the same key and only then tries to compare the class versions. Maybe the class version should rather be used to generate the cache key. [22:34:44] Amir1: there's a revert commit on deploy1002 i need to push, one sec [22:35:09] sure [22:36:26] !log ns0: authdns-update - deploying DNS change,add new svc and discovery records for image-suggestion T304891 [22:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:31] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [22:36:44] (03PS1) 10Ebernhardson: cirrus: Move query traffic to codfw for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788822 (https://phabricator.wikimedia.org/T306811) [22:37:40] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) ` OK - authdns-update successful on all nodes! [authdns1001:~] $ host image-suggestion.discovery.wmnet... [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:49] James_F: btw, mediawikiwiki is already populated, s3 is on mkwiki atm [22:39:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:39:01] going to turn it on now [22:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:23] (03PS1) 10Brennen Bearnes: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788823 [22:39:25] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788823 (owner: 10Brennen Bearnes) [22:39:29] (03CR) 10Dzahn: [C: 03+2] "OK - authdns-update successful on all nodes!" [dns] - 10https://gerrit.wikimedia.org/r/788814 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [22:40:24] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.10 refs T305216" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788823 (owner: 10Brennen Bearnes) [22:40:39] ok, should be clean state on deploy box now. [22:40:47] cool [22:41:46] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:52] (03CR) 10Ladsgroup: [C: 03+2] Set mediawikiwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788818 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [22:42:00] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.9/extensions/TitleBlacklist: Backport: [[gerrit:788816|wmf.9 HACK: add forward class alias for TitleBlacklist (T307513)]] (duration: 00m 50s) [22:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:04] T307513: MediaWiki\Extension\TitleBlacklist\TitleBlacklist::load(): The script tried to execute a method or access a property of an incomplete object. - https://phabricator.wikimedia.org/T307513 [22:42:11] brennen: pushed [22:42:37] (03Merged) 10jenkins-bot: Set mediawikiwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788818 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [22:43:12] cool, and logs look normal. calling it here on the train for the day, deployment's yours if needed. [22:43:37] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) [22:43:49] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788818|Set mediawikiwiki to READ NEW for templatelinks migration (T306673)]] (duration: 00m 50s) [22:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:53] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [22:43:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:45:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:54] (03Abandoned) 10Dzahn: add image-suggestion.discovery.wmnet and point to ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [22:46:50] !log train 1.39.0-wmf.10 (T305216): T307513 doesn't seem quite resolved - parking the train at testwikis until european morning [22:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:55] T305216: 1.39.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T305216 [22:48:12] (03CR) 10Dzahn: "I just added the DNS records." [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [22:48:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:48:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:25] i'm intending to deploy some config patches to shift search traffic around for maintenance, looks like everything is clear? [22:49:38] clear on my side [22:49:40] ebernhardson: you should be good from my end. [22:49:59] ok, thanks! [22:50:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:50:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:49] (03CR) 10Ebernhardson: [C: 03+2] Revert "translate: Move ttmserver queries to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788820 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [22:51:38] (03Merged) 10jenkins-bot: Revert "translate: Move ttmserver queries to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788820 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [22:51:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:56] (03PS2) 10Ebernhardson: cirrus: Move query traffic to codfw for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788822 (https://phabricator.wikimedia.org/T306811) [22:52:02] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Move query traffic to codfw for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788822 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [22:52:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:52:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:50] (03Merged) 10jenkins-bot: cirrus: Move query traffic to codfw for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788822 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [22:54:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:54:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:47] !log ebernhardson@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:788820|Revert "translate: Move ttmserver queries to codfw" (T306811)]] (duration: 00m 50s) [22:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:51] T306811: Check for indices that are not compatible with elastic 7.x in production clusters - https://phabricator.wikimedia.org/T306811 [22:55:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:55:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:35] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:57:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:57:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:57:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:58] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788822|cirrus: Move query traffic to codfw for maintenance (T306811)]] (duration: 00m 49s) [22:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:42] !log added image-suggestion to kube_services.certs.yaml in private repo, generated new certs and git committed them T304891 [22:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:45] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [22:59:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:59:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [22:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:04:10] (03PS1) 10Stang: mediawikiwiki: Change wgSitename from "MediaWiki" to "MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788825 (https://phabricator.wikimedia.org/T299458) [23:06:02] (03CR) 10jerkins-bot: [V: 04-1] mediawikiwiki: Change wgSitename from "MediaWiki" to "MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788825 (https://phabricator.wikimedia.org/T299458) (owner: 10Stang) [23:06:44] (03PS2) 10Stang: mediawikiwiki: Change wgSitename from "MediaWiki" to "MediaWiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788825 (https://phabricator.wikimedia.org/T299458) [23:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:15:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:16] (03PS1) 10Ebernhardson: Revert "cirrus: Move query traffic to codfw for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788869 (https://phabricator.wikimedia.org/T306811) [23:19:24] (03PS1) 10Clare Ming: Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788849 (https://phabricator.wikimedia.org/T307019) [23:20:10] (03CR) 10Ebernhardson: [C: 03+2] Revert "cirrus: Move query traffic to codfw for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788869 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [23:20:12] (03PS1) 10Clare Ming: Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788850 (https://phabricator.wikimedia.org/T307019) [23:21:32] (03Merged) 10jenkins-bot: Revert "cirrus: Move query traffic to codfw for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788869 (https://phabricator.wikimedia.org/T306811) (owner: 10Ebernhardson) [23:21:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:54] (03CR) 10Dzahn: "also created certificates for the services_proxy covering these names" [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [23:23:13] (03CR) 10Jdlrobson: "I'd be tempted to skip patching wmf9 since Hebrew Wikipedia will be on wmf10 on Wednesday and Basque on Thursday (and volume of errors on " [extensions/WikimediaEvents] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788850 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:23:16] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788869|Revert "cirrus: Move query traffic to codfw for maintenance" (T306811)]] (duration: 00m 56s) [23:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:21] T306811: Check for indices that are not compatible with elastic 7.x in production clusters - https://phabricator.wikimedia.org/T306811 [23:23:24] (03CR) 10Jdlrobson: [C: 03+1] Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788849 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:26:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:27:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:10] (03CR) 10Clare Ming: Fix undefined offset error (031 comment) [extensions/WikimediaEvents] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788850 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:28:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:37] (03Abandoned) 10Clare Ming: Fix undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788850 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:32:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:33:35] (03CR) 10Stang: [C: 04-1] "wait" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788825 (https://phabricator.wikimedia.org/T299458) (owner: 10Stang) [23:35:47] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at reading authorization packet, system error: 104 Connection reset by peer https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:09] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) @WDoranWMF @hnowlan docs have been updated. https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_... [23:36:17] (03PS1) 10Clare Ming: Test for undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788851 (https://phabricator.wikimedia.org/T307019) [23:36:27] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:45] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:45] (03CR) 10jerkins-bot: [V: 04-1] Test for undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788851 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:38:03] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:43] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:01] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:03] (03CR) 10Clare Ming: "I'll rebase (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/788849 needs to go first for tests to pass) after dep" [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788851 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:41:54] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:41:57] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:42:00] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:43:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:44:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T307525)', diff saved to https://phabricator.wikimedia.org/P27362 and previous config saved to /var/cache/conftool/dbconfig/20220503-234451-ladsgroup.json [23:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:55] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:48:26] (03PS1) 10STran: Enable IPInfo instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788872 (https://phabricator.wikimedia.org/T296480) [23:48:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:48:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:50:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:50:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:53:33] (03Abandoned) 10Jdlrobson: Test for undefined offset error [extensions/WikimediaEvents] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788851 (https://phabricator.wikimedia.org/T307019) (owner: 10Clare Ming) [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:56:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: T307525', diff saved to https://phabricator.wikimedia.org/P27363 and previous config saved to /var/cache/conftool/dbconfig/20220503-235701-ladsgroup.json [23:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:05] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525