[00:00:04] brennen: Dear deployers, time to do the UTC late backport and config training deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220107T0000). [00:00:05] nn1l2: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:10] hi [00:00:25] nn1l2: hi [00:00:41] I have a patch with -1 jenkis bot [00:00:42] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [00:00:53] I don't know what's wrong with jenkins [00:01:09] Could you please have a look [00:01:30] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/752036 [00:01:50] nn1l2: "Unexpected space in '100' namespace title for viwiktionary, use underscores instead" "Failed asserting that 'Phụ lục' does not contain " "" [00:02:01] https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-test-docker/15416/console [00:02:25] give me a sec and I'll fix it [00:02:50] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [00:03:40] (03PS3) 104nn1l2: viwiktionary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) [00:05:17] Good to go [00:06:17] cdanis: docs say I should depool any servers with lag greater than an hour [00:06:59] where can I find docs on how to do that? [00:07:56] Is B&C going on? [00:08:23] nn1l2: I can deploy your change, looking now [00:08:37] thanks [00:09:37] (03CR) 10Thcipriani: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) (owner: 104nn1l2) [00:09:49] I have restarted wdqs-blazegraph.service on all the laggy nodes [00:10:20] ok, that was just a short time ago was it? [00:10:21] (03Merged) 10jenkins-bot: viwiktionary: add namespaces “Appendix” and “Appendix talk” [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752036 (https://phabricator.wikimedia.org/T298289) (owner: 104nn1l2) [00:10:30] topranks: yup [00:10:42] ok let's see how ti progresses. [00:10:42] lag seems to be dropping fast, according to the graphs? [00:11:17] yeah if it is making good progress I would leave that [00:11:24] btw 'sudo depool' on each host is the easiest way [00:11:33] cdanis: noted, thanks [00:11:49] nn1l2: namespace change is live on mwdebug1002, check please [00:14:31] thcipriani: I don't see any thing on https://vi.wiktionary.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Ti%E1%BB%81n_t%E1%BB%91 [00:14:53] when I open the drop down menu [00:15:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:04] It's shot down on wqds1006/1004. Still relatively high on wqds1012 but it's not increasing at least [00:15:26] I expect to see the new namespace "Phụ lục", but I can't see it [00:15:51] topranks: yeah [00:16:06] nn1l2: indeed, checking that everything synced [00:16:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:24] definitely looking a lot healthier [00:17:34] yeah I think so as well [00:17:48] anything else we should check before stepping away? [00:17:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:32] Looking at the OpenSearch page I can see the queries from AWS still there. [00:18:44] https://logstash.wikimedia.org/app/dashboards#/view/259a4460-8e7e-11e7-9846-4f694cbd6a14?_g=h@a91e569&_a=h@e4186ec [00:19:00] It's working now [00:19:05] So maybe that's gonna eventually knock something out of whack again [00:19:10] https://vi.wiktionary.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Thay_%C4%91%E1%BB%95i_g%E1%BA%A7n_%C4%91%C3%A2y?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [00:19:16] nn1l2: great thanks for confirming :) [00:20:44] topranks: that logstash url doesn't load correctly, it says to use the share button? [00:21:00] topranks: I think nn1l2 is probably talking about different things :) (also doing a quick backport) [00:21:13] nn1l2: cool, yeah, seeing it, too, going live [00:21:19] ok try 2: https://logstash.wikimedia.org/goto/22072eac35d8a1785258521fd2cc27c8 [00:21:52] yup that worked, thanks [00:22:35] jhathaway: ok thanks, that answers my question of "how is this Vietnamese wiktionary somehow related to wikidata" question anyway! [00:23:08] thcipriani: apologies pasted wrong nick :) [00:23:48] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752036|viwiktionary: add namespaces "Appendix" and "Appendix talk" (T298289)]] (duration: 00m 59s) [00:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:51] T298289: Add namespace “Appendix” and “Appendix talk” to Vietnamese Wiktionary - https://phabricator.wikimedia.org/T298289 [00:24:06] ^ nn1l2 should be live everywhere here shortly [00:24:30] Yeah, it's live now [00:24:33] Thanks [00:24:39] jhathaway: I'm not sure of anything else we should do. Those lag metrics and everything else that shot up have returned to better levels than they were earlier. [00:25:11] yeah I agree, the bad queries appear to not be coming back, at least at the moment [00:25:38] If it happens again we might need to see if we could rate-limit those incoming queries based on user-agent or something. [00:26:02] But let's hope it stays as it is [00:26:08] yeah I saw that mentioned in the docs, I'm going to step away and cook dinner, but feel free to page me if something pops up again, thanks for your help! [00:34:22] (03PS1) 10Addshore: planet: add wikidatacon tag to my blog feed [puppet] - 10https://gerrit.wikimedia.org/r/752040 [00:38:28] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:50:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:30:34] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:34] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:40] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:39:26] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) Is the procedure the one documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments ? [03:41:52] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:31] (03CR) 10Ladsgroup: passwords: Add ladsgroup to the cloud root (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [05:39:18] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) Yep, you'll need to create a commit like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/751918 for shellbox-media, +2 it, wait for the cron to... [05:47:38] !log rename wikishared.wikimedia_editor_tasks_targets_passed on db1120 T264225 [05:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:41] T264225: Drop table wikimedia_editor_tasks_targets_passed on wmf wikis - https://phabricator.wikimedia.org/T264225 [06:08:21] (03PS1) 10Marostegui: Revert "dbproxy200[1,2]: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752011 [06:08:27] (03PS1) 10Marostegui: Revert "dbproxy2003: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752012 [06:11:42] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy200[1,2]: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752011 (owner: 10Marostegui) [06:11:48] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy2003: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752012 (owner: 10Marostegui) [06:14:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db[2076,2095].codfw.wmnet with reason: Maintenance [06:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db[2076,2095].codfw.wmnet with reason: Maintenance [06:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2089.codfw.wmnet with reason: Maintenance [06:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2089.codfw.wmnet with reason: Maintenance [06:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [06:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [06:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [06:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [06:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [06:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:54] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:41:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18409 and previous config saved to /var/cache/conftool/dbconfig/20220107-064119-marostegui.json [06:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:22] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [06:42:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18410 and previous config saved to /var/cache/conftool/dbconfig/20220107-064228-marostegui.json [06:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10odimitrijevic) Approved [06:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P18411 and previous config saved to /var/cache/conftool/dbconfig/20220107-065733-marostegui.json [06:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:46] (03CR) 10Marostegui: [C: 03+1] "testing looks good" [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:00:45] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Automatic detection of active dc [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:01:20] (03Merged) 10jenkins-bot: auto_schema: Automatic detection of active dc [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P18412 and previous config saved to /var/cache/conftool/dbconfig/20220107-071237-marostegui.json [07:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:02] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18413 and previous config saved to /var/cache/conftool/dbconfig/20220107-072742-marostegui.json [07:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:46] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:56:36] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220107T0800) [08:17:51] (03PS1) 10Gehel: icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 [08:23:05] jhathaway: thanks for taking care of blazegraph! <3 [08:46:38] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:56:37] (03CR) 10Hashar: Refactor git-daemon use in profile::zuul::merger (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [08:57:40] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:09:00] (03CR) 10JMeybohm: [C: 03+1] kubernetes: point to new kubestage node [dns] - 10https://gerrit.wikimedia.org/r/751976 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [09:27:52] (03CR) 10David Caro: [C: 03+2] c:kafka:broker:jmxtrans: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751085 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:29:21] (03CR) 10David Caro: [C: 03+2] osm: remove unused profile/role [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:30:22] (03CR) 10David Caro: {p,r}:gerrit:migration/migration_base: remove unused role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:30:31] (03Abandoned) 10David Caro: {p,r}:gerrit:migration/migration_base: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:34:39] (03PS6) 10Jbond: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [09:35:16] (03CR) 10jerkins-bot: [V: 04-1] exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [09:35:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:36:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [09:36:59] (03PS7) 10Jbond: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [09:40:02] (03CR) 10Jbond: [C: 03+1] exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [09:43:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) (owner: 10Aqu) [09:47:48] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:48:08] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:52:00] (03CR) 10Btullis: [C: 03+2] admin: create shell user aqu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/751956 (https://phabricator.wikimedia.org/T298657) (owner: 10Aqu) [09:53:19] 10SRE, 10Performance-Team, 10Traffic, 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10ema) [09:53:26] 10SRE, 10Performance-Team, 10Traffic, 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10ema) p:05Triage→03Medium [09:57:36] (03CR) 10David Caro: "Waiting for review from @mpopov, when he's back from paternal leave" [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:57:45] (03CR) 10David Caro: "Waiting for review from @mpopov, when he's back from paternal leave" [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:04:21] (03PS1) 10Joal: Update AQS druid datasource for new month [puppet] - 10https://gerrit.wikimedia.org/r/752132 [10:04:36] btullis: Heya - I posted that for when you have a minute --^ [10:05:18] joal: Will do this morning. Thanks. [10:19:37] (03CR) 10Btullis: [C: 03+2] Update AQS druid datasource for new month [puppet] - 10https://gerrit.wikimedia.org/r/752132 (owner: 10Joal) [10:20:46] (03PS1) 10Majavah: Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 [10:33:16] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] (03CR) 10Marostegui: "Adding Amir as he is more capable than me to review MW code :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [10:39:52] (03CR) 10Jelto: [C: 03+2] gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:40:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) [10:50:36] (03PS1) 10Jelto: gitlab_runner: fix missing url in registration command [puppet] - 10https://gerrit.wikimedia.org/r/752137 (https://phabricator.wikimedia.org/T295481) [10:56:16] (03CR) 10Jelto: [C: 03+2] gitlab_runner: fix missing url in registration command [puppet] - 10https://gerrit.wikimedia.org/r/752137 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:03:33] (03PS1) 10Jelto: gitlab_runner: fix missing parameters in registration command [puppet] - 10https://gerrit.wikimedia.org/r/752138 (https://phabricator.wikimedia.org/T295481) [11:07:01] (03CR) 10Jelto: [C: 03+2] gitlab_runner: fix missing parameters in registration command [puppet] - 10https://gerrit.wikimedia.org/r/752138 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:14:15] (03PS1) 10RhinosF1: Revert "Use strict equality when safe to do so" [extensions/Flow] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752014 [11:14:41] taavi, kostajh: ^ [11:16:55] 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) I have added `aqu` to the `wmf` LDAP group as per: https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group ` btullis@mwmaint1002:~$ sudo m... [11:18:01] (03PS2) 10Kosta Harlan: Revert "Use strict equality when safe to do so" [extensions/Flow] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752014 (https://phabricator.wikimedia.org/T298760) (owner: 10RhinosF1) [11:22:00] 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) Maybe I jumped the gun here. I think that perhaps this ought to have been more correctly handled by the person on SRE clinic duty. https://wikitech.w... [11:30:16] I think T298694 is seeking for an emergency deployment (it could have been a train blocker imo) [11:30:16] T298694: ProofreadPage: zoom/pan not working in side-by-side editing mode - https://phabricator.wikimedia.org/T298694 [11:32:27] zabe: I'm happy to deploy as long as the patch author is available and gets releng+sre approval [11:34:43] (03CR) 10Kosta Harlan: [C: 03+1] Revert "Use strict equality when safe to do so" [extensions/Flow] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752014 (https://phabricator.wikimedia.org/T298760) (owner: 10RhinosF1) [11:35:30] (03CR) 10Hashar: [C: 03+2] Revert "Use strict equality when safe to do so" [extensions/Flow] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752014 (https://phabricator.wikimedia.org/T298760) (owner: 10RhinosF1) [11:37:06] hello! I'm not the patch author, but I am able to test it [11:38:58] hashar: could we do an emergency deployment for T298694 aswell? [11:38:58] T298694: ProofreadPage: zoom/pan not working in side-by-side editing mode - https://phabricator.wikimedia.org/T298694 [11:45:01] (03CR) 10Hnowlan: [C: 03+2] maps: correctly template swift credentials [puppet] - 10https://gerrit.wikimedia.org/r/751928 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [11:47:39] (03PS1) 10Hnowlan: maps: fix incorrect variable reference [puppet] - 10https://gerrit.wikimedia.org/r/752140 (https://phabricator.wikimedia.org/T292700) [11:52:03] (03Merged) 10jenkins-bot: Revert "Use strict equality when safe to do so" [extensions/Flow] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752014 (https://phabricator.wikimedia.org/T298760) (owner: 10RhinosF1) [11:52:11] zabe: what is the change ? [11:52:14] I'll sync that Flow patch out [11:52:24] +1 [11:52:46] hashar: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/751843 [11:52:59] oh it is attached to the task [11:53:00] ;D [11:53:16] (03CR) 10Hashar: [C: 03+2] Makes sure $imgContHorizontal is always initialized [extensions/ProofreadPage] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751843 (https://phabricator.wikimedia.org/T298694) (owner: 10Tpt) [11:53:20] +2ed [11:53:39] it is, but now if it's the wrong one you can blame someone :-D [11:54:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:54] tested the flow patch, syncing [11:56:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:56:00] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/Flow: Backport: [[gerrit:752014|Revert "Use strict equality when safe to do so" (T298760)]] (duration: 01m 00s) [11:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:04] T298760: Flow\Exception\FlowException: A required post has not been loaded: tn9fp3z7fq89497j - https://phabricator.wikimedia.org/T298760 [11:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:24] ;) [11:56:30] hashar: Thank you! [11:56:50] I'm around if you need someone to test the change [11:57:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:14] Tpt: we need someone to test it once Jenkins is happy with it [11:58:01] great! I have a Firefox instance around with the WikimediaDebug extension [11:58:49] (03CR) 10Hnowlan: [C: 03+2] maps: fix incorrect variable reference [puppet] - 10https://gerrit.wikimedia.org/r/752140 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [11:59:20] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/751843 is still in CI [12:07:33] (03CR) 10Jbond: [C: 03+1] elasticsearch:decommission: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751088 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:11:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10mfossati) Does //shell access// mean regular or **production** one? I don't have the latter yet. [12:12:12] (03Merged) 10jenkins-bot: Makes sure $imgContHorizontal is always initialized [extensions/ProofreadPage] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751843 (https://phabricator.wikimedia.org/T298694) (owner: 10Tpt) [12:12:14] finally [12:12:55] Tpt: inductiveload: the patch is live on mwdebug1002, could you test please? [12:13:28] yep that's working [12:13:42] great, syncing [12:14:28] \o/ [12:14:37] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/ProofreadPage/modules/page: Backport: [[gerrit:751843|Makes sure $imgContHorizontal is always initialized (T298694)]] (duration: 00m 59s) [12:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:40] T298694: ProofreadPage: zoom/pan not working in side-by-side editing mode - https://phabricator.wikimedia.org/T298694 [12:14:56] (03CR) 10Jelto: [V: 03+1 C: 03+2] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [12:15:04] the patch is now live [12:15:06] anything else? [12:15:10] ¡hola! xover, jsut in time [12:15:31] Indeed. [12:15:33] not from me, thank you very much for the backport [12:15:35] thank you! [12:15:35] I believe eveything else is fine on Wikisource [12:16:08] great [12:17:19] (03PS3) 10Jelto: P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) [12:17:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:13] inductiveload, Tpt: verified. all the issues I noticed / saw reported appear to be fixed. [12:20:55] taavi: thank you for the backports deployments! [12:21:57] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) While looking at Open Manage Enterprise i noticed that it appeared to download the informa... [12:25:58] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 408 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:26:12] (03CR) 10Matthias Mullie: [C: 03+1] "Other patch has been approved; this is good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) (owner: 10Matthias Mullie) [12:27:04] (03PS2) 10Matthias Mullie: Add MediaSearch profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) [12:28:14] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:37:46] (03PS1) 10Jgiannelos: Disable tilerator in all envs maps are deployed [puppet] - 10https://gerrit.wikimedia.org/r/752145 (https://phabricator.wikimedia.org/T298246) [12:38:01] (03PS1) 10Ssingh: hieradata: add durum cluster [puppet] - 10https://gerrit.wikimedia.org/r/752146 [12:47:49] 10SRE, 10Move-Files-To-Commons, 10Wikimedia-Extension-setup, 10Patch-For-Review, 10Wikimedia-extension-review-queue: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10thiemowmde) 05Open→03Resolved Deployed to all wikis since T213425. Not a Beta feature any more si... [13:11:29] 10SRE: Add user nmaphophe@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298770 (10ntsako) [13:24:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10RhinosF1) Just production shell access [13:26:38] 10SRE: Add user nmaphophe@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298770 (10Aklapper) (For the records, this is not a mailing list. It's an alias, see T289807.) [13:26:50] 10SRE: Add user nmaphophe@wikimedia.org to the analytics-alerts mail alias - https://phabricator.wikimedia.org/T298770 (10Aklapper) [13:45:25] (03PS1) 10Ema: Use libunwind for backtraces [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752151 (https://phabricator.wikimedia.org/T298758) [13:56:14] 10SRE, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10ema) >>! In T298758#7604333, @gerritbot wrote: > Change 752151 had a related patch set uploaded (by Ema; author: Ema): > %%%[operations/debs/varnish4@debia... [13:56:39] 10SRE, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10ema) [14:04:33] (03CR) 10Ladsgroup: [C: 03+1] "Looks straightforward enough to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [14:05:40] !log upgrade varnish on deployment-cache-text06 to 6.0.9 T298758 [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:42] 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Ottomata) No I think any SRE can do the work; IIUC clinic duty exists to make sure things like this don't fall through the cracks. Proceed! [14:05:43] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [14:06:16] (03PS2) 10Majavah: Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 [14:06:34] (03CR) 10Majavah: Update wikitech etcd readonly exemption (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [14:07:56] 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) p:05Triage→03Medium a:03BTullis [14:08:14] (03CR) 10Ladsgroup: [C: 03+1] Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [14:09:07] (03CR) 10Ema: [V: 03+2 C: 03+2] Use libunwind for backtraces [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752151 (https://phabricator.wikimedia.org/T298758) (owner: 10Ema) [14:10:28] (03PS1) 10Ema: Release 6.0.9-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752153 (https://phabricator.wikimedia.org/T293879) [14:11:30] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) [14:12:46] (03PS2) 10Ema: Release 6.0.9-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752153 (https://phabricator.wikimedia.org/T298758) [14:14:17] 10SRE, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10User-ema: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 (10ema) Smoke testing of 6.0.9 is fine on deployment-prep, I'll start upgrading production nodes next week. [14:38:35] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) I have created a Kerberos principal for Antoine. ` btullis@krb1001:~$ sudo manage_principals.py get aqu get_principal: P... [14:43:39] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.9-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/752153 (https://phabricator.wikimedia.org/T298758) (owner: 10Ema) [14:54:52] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:03:07] 10SRE, 10Move-Files-To-Commons, 10Wikimedia-Extension-setup, 10Patch-For-Review, 10Wikimedia-extension-review-queue: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 (10thiemowmde) [15:04:36] 10SRE, 10Move-Files-To-Commons, 10Wikimedia-Extension-setup, 10Patch-For-Review, 10Wikimedia-extension-review-queue: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10thiemowmde) [15:08:19] !log creeating mediainfo-streaming-updater.mutation topics on kafka main-eqiad and main-codfw and setting retention to 30 days - T296470 [15:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:22] T296470: Initialize WCQS production servers - https://phabricator.wikimedia.org/T296470 [15:11:01] 10SRE: Adding aquhen@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298778 (10BTullis) I can verify this request. Antoine has recently joined our team. [15:18:50] !log reset email address for Ollie Shotton developer account per T298779 [15:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:53] T298779: Account recovery help needed for Developer account Ollie Shotton - https://phabricator.wikimedia.org/T298779 [15:25:42] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:49] ^ that seems to flap once in a while and then self-recovers after an hour or so, but I'm still curious on why it fails occasionally [15:30:07] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) I believe that this is now complete, but feel free to respond on this ticket Antoine if anything doesn't behave as you'd... [15:31:19] (03PS8) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [15:35:18] (03CR) 10AOkoth: [C: 03+2] kubernetes: point to new kubestage node [dns] - 10https://gerrit.wikimedia.org/r/751976 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [15:35:26] (03CR) 10AOkoth: [C: 03+2] kubernetes: remove kubestage1001 & kubestage1002 [puppet] - 10https://gerrit.wikimedia.org/r/751752 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [15:43:14] 10SRE, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review: Deploy TwoColConflict extension to beta - https://phabricator.wikimedia.org/T154927 (10thiemowmde) [15:49:26] 10SRE, 10Two-Column-Edit-Conflict-Merge, 10Patch-For-Review, 10Wikimedia-extension-review-queue: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184 (10thiemowmde) [16:00:28] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:04:14] (03CR) 10AOkoth: [C: 03+2] kubernetes: remove kubestage1001 & kubestage1002 [homer/public] - 10https://gerrit.wikimedia.org/r/751754 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [16:11:04] (03CR) 10Andrew Bogott: [C: 03+2] Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [16:12:43] (03PS2) 10JHathaway: sodium: change role to insetup, to prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/751990 [16:14:08] (03Merged) 10jenkins-bot: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [16:20:44] (03PS3) 10JHathaway: sodium: change role to spare::system, to prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/751990 [16:21:07] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) @Antoine_Quhen - I notice that you haven't added yourself to the `analytics-admins` group in `data.yaml`, only the `anal... [16:21:10] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751990 (owner: 10JHathaway) [16:23:26] (03PS4) 10JHathaway: sodium: change role to spare::system, to prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/751990 [16:27:41] (03PS1) 10Urbanecm: Do not delete the suppress group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752162 (https://phabricator.wikimedia.org/T112147) [16:27:43] (03PS1) 10Urbanecm: Remove the oversight group hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752163 (https://phabricator.wikimedia.org/T112147) [16:27:51] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10Patch-For-Review: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10colewhite) Index curation is affected as well because python's datetime formatter doesn't... [16:31:31] (03CR) 10JHathaway: [C: 03+2] sodium: change role to spare::system, to prep for decom [puppet] - 10https://gerrit.wikimedia.org/r/751990 (owner: 10JHathaway) [16:32:31] 10SRE-Access-Requests: Requesting access to analytics cluster for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Snwachukwu) [16:34:10] 10SRE-Access-Requests: Requesting access to analytics cluster for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Snwachukwu) [16:38:11] 10SRE-Access-Requests: Requesting access to analytics cluster for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Snwachukwu) [16:40:27] 10SRE-Access-Requests: Requesting access to analytics cluster for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Snwachukwu) [16:44:49] 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Snwachukwu) [16:46:22] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10odimitrijevic) Approved [16:46:54] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10mfossati) [16:47:11] (03CR) 10Andrew Bogott: [C: 03+2] designate sink: fix proxy cleanup when proxy domain == project domain [puppet] - 10https://gerrit.wikimedia.org/r/751963 (https://phabricator.wikimedia.org/T298681) (owner: 10Andrew Bogott) [16:47:51] 10SRE-swift-storage, 10serviceops, 10Patch-For-Review: Allow maps2009/maps1009 (master nodes) access thanos-swift - https://phabricator.wikimedia.org/T292700 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [16:57:06] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:58:19] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Papaul) 05Open→03Resolved Disk replaced [16:58:45] (03PS1) 10David Caro: check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 [16:59:24] (03CR) 10jerkins-bot: [V: 04-1] check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [16:59:29] (03PS2) 10David Caro: check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 [17:00:06] (03CR) 10jerkins-bot: [V: 04-1] check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 (owner: 10David Caro) [17:01:36] (03PS3) 10David Caro: check_haproxy: improve failover output [puppet] - 10https://gerrit.wikimedia.org/r/752170 [17:05:30] (03PS1) 10Elukey: Use a flag to deploy log4j-extras on Hadoop-related nodes [puppet] - 10https://gerrit.wikimedia.org/r/752171 [17:06:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33154/console" [puppet] - 10https://gerrit.wikimedia.org/r/752171 (owner: 10Elukey) [17:11:45] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Aklapper) a:05Snwachukwu→03None [17:12:14] 10SRE, 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) 05Open→03Resolved Checked the server today no error so far on DIMM B1, closing the task. if we have the problem we can re-open the task. [17:12:16] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/752171 (owner: 10Elukey) [17:13:27] (03CR) 10Elukey: [V: 03+1 C: 03+2] Use a flag to deploy log4j-extras on Hadoop-related nodes [puppet] - 10https://gerrit.wikimedia.org/r/752171 (owner: 10Elukey) [17:18:14] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Marostegui) Thanks - it is rebuilding: ` Enclosure Device ID: 32 Slot Number: 4 Drive's position: DiskGroup: 0, Span: 0, Arm: 4 Enclosure position: 1 Device Id: 4 WWN: 55cd2e41537dbf9a Sequence Number: 13 Media... [17:31:25] (03PS2) 10Cwhite: logstash: update weekly indexes to use weekyear pattern syntax [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) [17:31:27] (03PS2) 10Cwhite: prometheus: update affected es-exporter configs to use weekyear [puppet] - 10https://gerrit.wikimedia.org/r/751766 (https://phabricator.wikimedia.org/T298619) [17:32:02] (03CR) 10Cwhite: logstash: update weekly indexes to use weekyear pattern syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [17:32:04] (03CR) 10jerkins-bot: [V: 04-1] logstash: update weekly indexes to use weekyear pattern syntax [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [17:35:44] (03PS3) 10Cwhite: logstash: update weekly indexes to use weekyear pattern syntax [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) [17:35:46] (03PS3) 10Cwhite: prometheus: update affected es-exporter configs to use weekyear [puppet] - 10https://gerrit.wikimedia.org/r/751766 (https://phabricator.wikimedia.org/T298619) [17:41:05] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10Ottomata) Approved. [17:41:45] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10odimitrijevic) Approved [17:47:35] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [17:58:17] (03PS1) 10Andrew Bogott: wmf_sink base: fix the calculation of proxy parent zone [puppet] - 10https://gerrit.wikimedia.org/r/752181 (https://phabricator.wikimedia.org/T298681) [18:00:39] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink base: fix the calculation of proxy parent zone [puppet] - 10https://gerrit.wikimedia.org/r/752181 (https://phabricator.wikimedia.org/T298681) (owner: 10Andrew Bogott) [18:01:19] (03PS1) 10BryanDavis: wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 [18:08:29] (03PS10) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [18:08:48] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [18:10:32] (03PS11) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [18:11:09] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [18:12:27] (03PS12) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [18:15:14] (03PS1) 10Urbanecm: Growth: Add GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752187 (https://phabricator.wikimedia.org/T298792) [18:15:58] (03CR) 10Reedy: [C: 03+1] wikitech: Remove password clear on block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752185 (owner: 10BryanDavis) [18:16:35] (03CR) 10Herron: prometheus: add blackbox generic http/s static check support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [18:27:24] 10SRE, 10vm-requests: eqiad/codfw: 2 VMs requested for apifeatureusage - https://phabricator.wikimedia.org/T298794 (10herron) p:05Triage→03Medium [18:29:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS bullseye [18:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:40] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye [18:51:47] RECOVERY - MegaRAID on db2147 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:10:54] 10SRE, 10vm-requests: eqiad/codfw: 2 VMs requested for apifeatureusage - https://phabricator.wikimedia.org/T298794 (10herron) a:03herron [19:11:01] (03CR) 10Dzahn: [C: 03+2] planet: add wikidatacon tag to my blog feed [puppet] - 10https://gerrit.wikimedia.org/r/752040 (owner: 10Addshore) [19:11:26] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host apifeatureusage1001.eqiad.wmnet [19:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:13] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS bullseye [19:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:31] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye executed with errors: - elastic2051... [19:18:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS bullseye [19:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:59] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye [19:21:03] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host apifeatureusage2001.codfw.wmnet [19:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:59] 10SRE, 10vm-requests: eqiad/codfw: 2 VMs requested for apifeatureusage - https://phabricator.wikimedia.org/T298794 (10herron) [19:25:02] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) [19:25:14] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10Dzahn) [19:26:58] ACKNOWLEDGEMENT - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T298800 [19:29:23] ACKNOWLEDGEMENT - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:29:23] ACKNOWLEDGEMENT - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:30:00] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) db2063 and db2068 were affected today [19:32:30] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) for the record: I have absolutely no idea why contint2001.mgmt disappeared... [19:33:22] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) a:05Dzahn→03None [19:34:05] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:35:16] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:36:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS bullseye [19:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:18] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2051.codfw.wmnet with OS bullseye executed with errors: - elastic2051... [19:36:34] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @Papaul Do you know about contint2001.mgmt status? [19:37:39] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) @Dzahn no [19:40:15] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:40:50] Can any deployers purge this change: https://gerrit.wikimedia.org/r/c/751530? The user requesting the task in Phabricator tells me he can't see the logo of this change on the wiki, he says he can only see the old one yet. [19:40:58] Can any deployers purge this change: https://gerrit.wikimedia.org/r/c/751530 ? The user requesting the task in Phabricator tells me he can't see the logo of this change on the wiki, he says he can only see the old one yet. [19:41:45] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host apifeatureusage1001.eqiad.wmnet [19:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10Dzahn) 05Open→03In progress [19:46:06] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10Dzahn) @bking Is it working? [19:46:40] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops, 10observability: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [19:47:01] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops, 10observability: contint2001.mgmt disappeared from Icinga (was: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [19:49:18] Can any deployers purge this change: https://gerrit.wikimedia.org/r/c/751530 ? [19:49:53] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host apifeatureusage2001.codfw.wmnet [19:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:17] 10SRE: Adding aquhen@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298778 (10Dzahn) @BTullis You can do this self-service with your root privileges: It's in the private puppet repo. See: puppetmaster1001:/srv/private/modules/privateexim/files/wikimedia.org This is... [19:51:59] 10SRE: Add user nmaphophe@wikimedia.org to the analytics-alerts mail alias - https://phabricator.wikimedia.org/T298770 (10Dzahn) @BTullis same as T298778#7606120 [19:53:15] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [19:54:28] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10razzi) I'm going to go ahead and grant the permissions [19:54:55] Urbanecm: Can any deployers purge this change: https://gerrit.wikimedia.org/r/c/751530 ? [19:55:17] Juan_90264: what do you mean by "purge"? [20:00:36] urbanecm: purgeList presumably [20:00:40] 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10Eevans) [20:01:22] Reedy: well, unless you did it, https://en.wikipedia.org/static/images/project-logos/zhwikinews.png returns the new image on my end [20:01:29] I didn't [20:01:39] It's possible not all DCs do though [20:02:41] In mine also returns the new logo, but others this is not happening [20:02:59] which one doesn't work? [20:03:01] Yes, it's exactly the PurgeList [20:04:41] Urbanecm: The user on Phabricator seems to have reported that this is happening in the zh version [20:05:22] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10Dzahn) verified user via https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Verifying_WMF_developer_accounts @MarkTraceur can you approve? [20:05:23] (The zh-hans version he hasn't tested yet, but I believe it's the same for him because it's part of the same change) [20:06:03] well, i can do a purgeList, no problem [20:06:12] but keep in mind that /static is cached in browser too [20:06:28] Okay [20:06:46] so changes to files in /static generally take weeks to take effect everywhere (until each and every cache in browsers expire) [20:06:51] Ctrl+Shift+R gets rid of that [20:07:19] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10Dzahn) 05Open→03In progress [20:08:52] !log Purge https://en.wikipedia.org/static/images/project-logos/{zhwikinews,zhwikinews-1.5x,zhwikinews-2x,zhwikinews-hans,zhwikinews-hans-1.5x,zhwikinews-hans-2x}.png via purgeList.php [20:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:26] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) to SRE on clinic duty: could not verify user, no defined manager triggers bug in check script? ` Username: JVargas Verified Email: 20220106184544 Traceback (most recent call last):... [20:09:35] Juan_90264: done [20:10:25] Perfect thanks [20:10:43] np [20:10:56] (03PS1) 10Ssingh: durum: use the correct directive to disable error_logging [puppet] - 10https://gerrit.wikimedia.org/r/752197 [20:11:36] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) 05Open→03In progress [20:11:53] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33155/console" [puppet] - 10https://gerrit.wikimedia.org/r/752197 (owner: 10Ssingh) [20:14:02] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: use the correct directive to disable error_logging [puppet] - 10https://gerrit.wikimedia.org/r/752197 (owner: 10Ssingh) [20:14:50] 10SRE: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10Dzahn) [20:15:17] 10SRE: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10Dzahn) [20:15:49] 10SRE: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10Dzahn) [20:17:18] 10SRE, 10Infrastructure-Foundations: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10Dzahn) [20:18:38] (03PS1) 10Jbond: profile::logstash::gelf_relay: pass correct package name [puppet] - 10https://gerrit.wikimedia.org/r/752200 [20:18:56] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) a:05Dzahn→03jgleeson [20:19:45] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) @jgleeson We can either resolve this if it works for you or keep using it for the other people that need to... [20:20:14] (03CR) 10jerkins-bot: [V: 04-1] profile::logstash::gelf_relay: pass correct package name [puppet] - 10https://gerrit.wikimedia.org/r/752200 (owner: 10Jbond) [20:20:36] (03PS2) 10Jbond: profile::logstash::gelf_relay: pass correct package name [puppet] - 10https://gerrit.wikimedia.org/r/752200 [20:21:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33157/console" [puppet] - 10https://gerrit.wikimedia.org/r/752200 (owner: 10Jbond) [20:22:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33158/console" [puppet] - 10https://gerrit.wikimedia.org/r/752200 (owner: 10Jbond) [20:28:25] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::logstash::gelf_relay: pass correct package name [puppet] - 10https://gerrit.wikimedia.org/r/752200 (owner: 10Jbond) [20:35:19] (03CR) 10jerkins-bot: [V: 04-1] check_user: catch manager being None [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) (owner: 10RhinosF1) [20:35:23] mutante: ^ [20:36:30] (03PS3) 10RhinosF1: check_user: catch manager being None [puppet] - 10https://gerrit.wikimedia.org/r/752018 (https://phabricator.wikimedia.org/T298808) [20:39:41] RhinosF1: thanks! [20:39:55] mutante: np [20:42:58] (03PS1) 10Herron: install_server: add dhcp/netboot entries for apifeatureusage VMs [puppet] - 10https://gerrit.wikimedia.org/r/752207 (https://phabricator.wikimedia.org/T298794) [20:44:18] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Sorry Dan I forgot to check in on this today and have finished work as I'm working from the UK. I'll tes... [20:44:56] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10Papaul) 05Open→03Resolved This is ready but according to @jbond, it still has some puppet errors but that looks like it is related to this puppet policy not being read... [20:45:18] (03CR) 10Herron: [C: 03+2] install_server: add dhcp/netboot entries for apifeatureusage VMs [puppet] - 10https://gerrit.wikimedia.org/r/752207 (https://phabricator.wikimedia.org/T298794) (owner: 10Herron) [20:46:02] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) Yes yes, there was no expectation that this happens right now or you work on the weekend. This was for Monda... [20:46:33] 10SRE, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) p:05Triage→03Medium [20:50:19] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10jbond) >>! In T298674#7606290, @Papaul wrote: > This is ready but according to @jbond, it still has some puppet errors but that looks like it is related to this puppet pol... [21:00:46] searching for "Error: error" in Phabricator: = Query Error why? unknown search function "Error" [21:01:12] (03PS1) 10Herron: assign role::apifeatureusage::logstash to apifeatureusages[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [21:01:28] (03PS2) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [21:01:48] (03PS1) 10Aaron Schulz: Simplify comments and stubs for etcd-defined DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 [21:03:42] (03PS8) 10Herron: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [21:04:35] (03PS3) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [21:10:43] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) [21:10:52] 10SRE, 10vm-requests: eqiad/codfw: 2 VMs requested for apifeatureusage - https://phabricator.wikimedia.org/T298794 (10herron) 05Open→03Resolved `apifeatureusage[12]001.(eqiad|codfw).wmnet` are now online [21:25:09] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:36:59] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:45:47] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,nginx.service,prometheus-elasticsearch-exporter-9200.service,prometheus-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:38] (03CR) 10Cwhite: [C: 03+2] logstash: update weekly indexes to use weekyear pattern syntax [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [21:48:01] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [21:50:13] mutante: you need to put it into "'s. Then it should work. [21:51:36] (03CR) 10Cwhite: [C: 03+2] prometheus: update affected es-exporter configs to use weekyear [puppet] - 10https://gerrit.wikimedia.org/r/751766 (https://phabricator.wikimedia.org/T298619) (owner: 10Cwhite) [21:53:59] urbanecm: confirmed. quoting with ' works :) thanks, I was more sharing this as a curiosity [21:54:14] Error: is a magic word [21:54:16] thought you're looking for a fix :) [21:54:18] apparently it is [21:55:09] No, no, I am also going to actually use that [21:55:17] it just wasn't "Error: error" [21:55:28] but "Error: somethingelse" from some other ticket [21:55:58] and then I noticed "Error: " is a different error from "Error: error" [21:58:19] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10razzi) [22:00:36] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10razzi) According to the [access requests procedure](https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Production_shell_access), we need... [22:05:57] (03CR) 10Dzahn: "the change in data.yaml seems to have started some "Malformed membership for ops user ..., has additional group(s): {'deployment-ci-admins" [puppet] - 10https://gerrit.wikimedia.org/r/751166 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [22:16:07] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:24:57] 10SRE, 10ops-codfw: host ps1-d1-codfw down since a long time but still monitored - https://phabricator.wikimedia.org/T298800 (10Dzahn) p:05Triage→03Medium [22:32:33] 10SRE, 10Discovery-Search (Current work): Get familiar with ES non-prod enviroments - https://phabricator.wikimedia.org/T298817 (10bking) [23:04:32] mutante: easy fix [23:07:58] (03CR) 10RhinosF1: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/752022 (https://phabricator.wikimedia.org/T298815) (owner: 10RhinosF1) [23:08:07] mutante: ^ should do it [23:09:36] RhinosF1: cool, thank you [23:09:53] not merging it right now, but appreciate it [23:10:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-RhinosF1: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10RhinosF1) [23:11:30] mutante: np, i also have https://gerrit.wikimedia.org/r/c/operations/puppet/+/749875 waiting review [23:16:50] mutante: can I bribe you into being a phab admin and disabling H394? It's far too broad. [23:17:05] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:26] cc zabe [23:17:28] I think it is meant to subscribe instead of assigning [23:18:43] https://phabricator.wikimedia.org/T298818 [23:18:50] zabe: title says that [23:19:28] zabe: it's also anytime a rule matches [23:28:06] (03CR) 10Zabe: cross-validate-accounts: add deployment-ci-admins to ops expected list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752022 (https://phabricator.wikimedia.org/T298815) (owner: 10RhinosF1) [23:29:39] (03PS3) 10RhinosF1: cross-validate-accounts: add deployment-ci-admins to ops expected list [puppet] - 10https://gerrit.wikimedia.org/r/752022 (https://phabricator.wikimedia.org/T298815) [23:29:48] zabe: good spot [23:37:11] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) I could debug the gzip encoding issue and at the same time test using my production image in cloud VPS in k8splay project. `dzahn@dzahn:~$ curl --compressed dzah... [23:41:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-RhinosF1: check_user - KeyError: 'relations' - https://phabricator.wikimedia.org/T298808 (10Dzahn) p:05Triage→03Medium [23:45:48] RhinosF1: for the dumps change, please ask on https://phabricator.wikimedia.org/T273585 or during Europe time [23:46:00] RhinosF1: I see H349 as already disabled (now at least) [23:46:12] kind of busy on something else [23:47:08] ah, wrong H I was looking at [23:47:31] RhinosF1: I don't have permission to disable that [23:48:03] not without some shell procedure, so I'd like to let other admins do that [23:48:19] unless it's emergency [23:50:52] please start by asking https://phabricator.wikimedia.org/p/Lens0021/ directly .it's their rule [23:51:02] /ac/ac [23:51:36] Lens0021: ^ :)