[00:01:21] (03PS2) 10Krinkle: mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) [00:01:42] !log zabe@deploy1002 Finished scap: T334295 (duration: 06m 53s) [00:01:46] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [00:05:33] (03CR) 10Krinkle: "Bryan and I verified this by locally patching cloudweb in effectively the same way, and confirming that (without the change) Logstash is s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [00:06:54] (03PS2) 10Catrope: beta: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912417 (https://phabricator.wikimedia.org/T319064) [00:18:40] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:32] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10tstarling) [00:26:18] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10tstarling) Will the GitHub mirrors be switched over to replicate from GitLab? This is necessary for libraries like Shellbox that use a GitHub webhook... [00:31:36] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:40:55] (03PS3) 10Krinkle: speed-tests: Test selector changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912366 [01:53:22] (03CR) 10Dzahn: [C: 03+2] planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [01:54:22] (03CR) 10Dzahn: [C: 04-2] planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [02:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:00] (PowerSupply) firing: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:15:52] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:17:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2330:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2330 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:19:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2331:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2331 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:24] PROBLEM - SSH on cloudbackup2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:55:10] RECOVERY - SSH on cloudbackup2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:01:23] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:14] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.139% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:12:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [03:14:07] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [04:36:30] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:52:36] (03PS8) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [05:28:01] (03PS1) 10Ayounsi: Add HE in BBIX [homer/public] - 10https://gerrit.wikimedia.org/r/912440 (https://phabricator.wikimedia.org/T327284) [05:29:19] (03CR) 10Ayounsi: [C: 03+2] Add HE in BBIX [homer/public] - 10https://gerrit.wikimedia.org/r/912440 (https://phabricator.wikimedia.org/T327284) (owner: 10Ayounsi) [05:29:54] (03Merged) 10jenkins-bot: Add HE in BBIX [homer/public] - 10https://gerrit.wikimedia.org/r/912440 (https://phabricator.wikimedia.org/T327284) (owner: 10Ayounsi) [05:42:21] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/912307 (https://phabricator.wikimedia.org/T334180) (owner: 10Cathal Mooney) [05:48:57] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) [05:51:26] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:51:53] !log downgrade SGIX RS BGP sessions to non-primary [05:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:17] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:56:45] !log Configure 1:1 NAT for new fr-tech hosts - T335441 [05:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600) [06:00:04] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0600). [06:01:47] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:04:56] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) @ERayfield please sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document. [06:05:07] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) [06:09:22] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) uid: erayfield uidNumber: 34606 ssh key not being used in WMCS Waiting to verify the ssh key [06:09:45] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) [06:10:17] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) p:05Triageβ†’03Medium [06:11:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) uid: andrewtavis-wmde uidNumber: 44010 [06:15:03] (PowerSupply) firing: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:15:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) Also, @AndrewTavis_WMDE you need to provide a NEW ssh key, as the current one you've provided is being used in WMCS. You need an unique one which cannot be shared with an... [06:15:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) p:05Triageβ†’03Medium [06:16:00] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:17:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2330:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2330 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:19:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on mw2331:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw2331 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:29:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [06:30:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) @AndrewTavis L3 isn't signed, can yo do so too? [06:37:36] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @Ladsgroup During this operation, replication codfw -> eqiad is still active, so as there are codfw masters involved (even if codfw will be d... [06:50:51] (03PS1) 10Marostegui: install_server: Do not reimage db1222 [puppet] - 10https://gerrit.wikimedia.org/r/912698 [06:51:53] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1222 [puppet] - 10https://gerrit.wikimedia.org/r/912698 (owner: 10Marostegui) [06:55:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912376 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [07:00:06] Amir1, apergos, and jnuche: Time to snap out of that daydream and deploy UTC morning backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0700). [07:00:06] Thiemo_WMDE: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] morning! [07:00:37] we have one trainee signed up for today actually, I'm going to hang out in the google meet and see if they turn up, it ws very short notice [07:00:58] but [07:01:29] why is the bot speaking now, instead of at the usual time (in about 15 minutes)? [07:02:52] I'm here. [07:03:22] hello! we're all a bit early today, the window doesn't actually begin until the top of the hour [07:03:23] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.09% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:03:56] well maybe my laptop clock is off somehow [07:03:56] anyways [07:04:32] we have a trainee signed up, let me see whether they are here [07:04:32] are you self-deploying or do you need assistance, Thiemo_WMDE? [07:05:55] I don't understand. [07:05:55] I follow https://wikitech.wikimedia.org/wiki/Backport_windows [07:06:18] don't worry [07:06:18] I think my laptop clock is out of sync with everything else [07:06:56] are you self-deploying or do you need someone to deploy for you during this window? note that if you do self-deploy, I will ask you to share your screen in the google meet, if our trainee shows up, and do things step by step as I explain them. [07:07:30] I wasn't aware this is how the backport windows work. [07:08:08] when there is a trainee signed up, then we do walk through the deployment with explanations, as part of the training [07:08:13] if no one is scheduled for training, then it proceeds normally [07:08:13] What step am I missing here? https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy [07:09:30] apergos: πŸ‘‹ morning, I'm around if I can help [07:09:37] https://wikitech.wikimedia.org/wiki/Deployments/Training [07:09:47] this is the step you are missing, Thiemo_WMDE ^^ [07:10:18] there are two windows that are marked as training windows each week, this is one of them. most of the time no one is actually signed up to be trained, but today, someone is scheduled. [07:10:27] hey jnuche! [07:10:33] I have a patch. It's a super trivial 1-liner. It would be great to have it backported to wmf.6. I don't think it makes sense to backport it to wmf.5 as this will be obsolete in just an hour or so. [07:10:52] yep [07:11:10] and in another 7 minutes if our trainee has not arrived, we'll proceed anyways [07:11:22] do you need someone to +2 your backport? [07:11:38] I can +2 it myself if that helps. [07:11:47] yes it does; I doubt I have +2 in that repo. [07:12:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [07:12:23] if you do not have dpeloyment rights, I will be happy to do the deployment on your behalf; that's why we are here. if you do have them and prefer to self-deploy, that's fine too; many people prefer that. [07:13:12] That's confusing. There is no +2 button in this patch. This doesn't make sense to me. I do have +2 rights in this repo. [07:14:31] Thiemo_WMDE: hi, these are the instructions to deploy if you want to go down that road: https://deploy-commands.toolforge.org/bacc/911796 [07:14:38] I wasn't aware certain branches behave different. I have never seen that before. [07:14:40] the backport command would +2 the patch [07:16:05] don't worry, one of us will get it [07:16:20] Sorry, might have been a misunderstanding. I'm not here for training. I have a sick child at home and really not the time to do anything but verifying that the backport works as intended. [07:16:26] we know [07:16:32] the trainee did not show up [07:16:38] so we're going to proceed [07:17:01] the trainee board is a separate board, we keep track of that apart from the deployers with patches [07:17:20] so we'll go ahead and +2 the .6 cheery pick [07:17:27] *cherry [07:19:09] (03CR) 10Jaime Nuche: [C: 03+2] Hide wrong "this reference is used 0 times" in citation dialog [extensions/Cite] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911796 (https://phabricator.wikimedia.org/T241885) (owner: 10Thiemo Kreuz (WMDE)) [07:19:19] and we'll leave the .5 cherry pick alone as you have suggested Thiemo_WMDE [07:20:34] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:20:35] (03Abandoned) 10Thiemo Kreuz (WMDE): Hide wrong "this reference is used 0 times" in citation dialog [extensions/Cite] (wmf/1.41.0-wmf.5) - 10https://gerrit.wikimedia.org/r/911797 (https://phabricator.wikimedia.org/T241885) (owner: 10Thiemo Kreuz (WMDE)) [07:21:06] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:22:53] now we watch the grass grow, I mean zuul... [07:23:38] !log uploaded debmonitor-client 0.3.2-1+deb12u1 to bookworm-wikimedia T330495 [07:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:43] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [07:26:21] Yea. I'm one of the (few?) people that try to find and fix bottlenecks in the tests and CI jobs. But no matter what we do it always grows back to 5 to 10 minutes after a short time when the next slow test is added somewhere. We should really start cutting off tests that take to long and make them fail hard. That would make many people unhappy, but [07:26:21] maybe for the good. [07:28:23] it would be nice if there was a person or two whose job it was to wrangle these slower tests, either by chasing down the right people on other teams or whatever else [07:28:52] I don't think that's in the cards though, given the refocusing of the org and all [07:29:21] (03CR) 10Slyngshede: [C: 03+2] C:idm Configure social pipeline for MediaWiki auth. [puppet] - 10https://gerrit.wikimedia.org/r/912263 (owner: 10Slyngshede) [07:29:47] The good thing is: it used to be 15 minutes. Now it's 5. [07:30:19] 10 minutes shaved off is a lot [07:31:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15169 [07:35:34] (03Merged) 10jenkins-bot: Hide wrong "this reference is used 0 times" in citation dialog [extensions/Cite] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911796 (https://phabricator.wikimedia.org/T241885) (owner: 10Thiemo Kreuz (WMDE)) [07:35:50] at last [07:36:11] backporting change now [07:37:00] well, deploying more like :) [07:37:06] !log jnuche@deploy1002 Started scap: Backport for [[gerrit:911796|Hide wrong "this reference is used 0 times" in citation dialog (T241885 T335410)]] [07:37:10] T241885: References defined inside a reflist are incorrectly described as "used twice on this page" - https://phabricator.wikimedia.org/T241885 [07:37:10] T335410: Misleading message showing up when creating new citations - https://phabricator.wikimedia.org/T335410 [07:38:34] !log jnuche@deploy1002 thiemowmde and jnuche: Backport for [[gerrit:911796|Hide wrong "this reference is used 0 times" in citation dialog (T241885 T335410)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [07:39:06] please test your change on one of those mwdebug serversm Thiemo_WMDE [07:39:45] Confirmed, works as intended. [07:39:49] awesome! [07:40:01] neat! continuing deployment then [07:40:54] (03PS1) 10Muehlenhoff: Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) [07:43:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15169 [07:43:33] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q3): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10MoritzMuehlenhoff) Looks good. Best to use group/row C for eqiad and group/row B for codfw. [07:45:37] !log jnuche@deploy1002 Finished scap: Backport for [[gerrit:911796|Hide wrong "this reference is used 0 times" in citation dialog (T241885 T335410)]] (duration: 08m 33s) [07:45:43] T241885: References defined inside a reflist are incorrectly described as "used twice on this page" - https://phabricator.wikimedia.org/T241885 [07:45:44] T335410: Misleading message showing up when creating new citations - https://phabricator.wikimedia.org/T335410 [07:46:24] please test your change in production now, just to be sure [07:46:27] Thiemo_WMDE: [07:47:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) > The way to write to IRC is already present, see https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.irc_logger I'd be a b... [07:48:50] Ok, confirmed on https://de.wikivoyage.org/wiki/Pirna?veaction=edit which runs wmf.6. [07:48:58] excellent [07:49:09] Thanks for the patience! [07:49:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:49:16] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10JMeybohm) >>! In T288629#8808899, @dancy wrote: >>>! In T288629#8807158, @JMeybohm wrote: >> I don't see helm defaults bein... [07:49:27] thanks for choosing us for all your deployment needs. see you next time! [07:50:39] !log UTC morning backport and config training window complete [07:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:55:16] (03PS2) 10Muehlenhoff: Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) [07:56:28] 10SRE, 10LDAP-Access-Requests: Add user xcollazo to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T335445 (10Marostegui) 05Openβ†’03Resolved a:03Marostegui Thanks for the approval Olja. User added to the requested ldap group. [07:57:16] (03CR) 10CI reject: [V: 04-1] Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:57:25] (03PS2) 10Muehlenhoff: Remove remaining obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911761 (https://phabricator.wikimedia.org/T335282) [08:00:05] jeena and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T0800). nyaa~ [08:01:00] (03PS3) 10Muehlenhoff: Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) [08:07:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:09:20] (03PS38) 10JMeybohm: Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) [08:09:43] (03CR) 10CI reject: [V: 04-1] Make kubernetes::clusters the central place for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:09:53] (03CR) 10JMeybohm: Make kubernetes::clusters the central place for k8s config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:10:42] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:13:36] (03PS1) 10Ayounsi: sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) [08:14:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 32): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40906/console" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:15:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) @Marostegui, the account you just tagged is my personal Phabricator account that I use for Wikimedia related projects. Is the signature still needed for this account? [08:15:50] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [08:16:01] (03CR) 10CI reject: [V: 04-1] sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) (owner: 10Ayounsi) [08:16:53] (03PS2) 10Ayounsi: sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) [08:17:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) Sorry about that @AndrewTavis_WMDE - I still don't see this one having signed L3 either :-). We'd need the account @AndrewTavis_WMDE to have signed it. [08:17:37] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 199524 [08:21:24] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4038,cp4046 [puppet] - 10https://gerrit.wikimedia.org/r/912782 (https://phabricator.wikimedia.org/T322774) [08:21:46] (03PS1) 10Muehlenhoff: Point active server back to apt1001 [puppet] - 10https://gerrit.wikimedia.org/r/912783 [08:21:53] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: raise metadata fetch error [puppet] - 10https://gerrit.wikimedia.org/r/911842 (https://phabricator.wikimedia.org/T335413) (owner: 10Cwhite) [08:22:08] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4038,cp4046 [puppet] - 10https://gerrit.wikimedia.org/r/912782 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [08:22:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 199524 [08:23:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912783 (owner: 10Muehlenhoff) [08:23:57] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4038,cp4046 [puppet] - 10https://gerrit.wikimedia.org/r/912784 (https://phabricator.wikimedia.org/T322774) [08:24:22] !log restarting varnish on cp4038 and cp4046 to drop port 80 - T322774 [08:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:45] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4038,cp4046 [puppet] - 10https://gerrit.wikimedia.org/r/912784 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [08:25:26] (03PS1) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [08:25:52] (03CR) 10CI reject: [V: 04-1] Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:26:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40908/console" [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:26:42] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:26:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:26:52] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:27:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:17] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::blackbox::check::http: allow passing alert data [puppet] - 10https://gerrit.wikimedia.org/r/912342 (owner: 10David Caro) [08:27:43] (03PS2) 10JMeybohm: Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) [08:27:57] 10SRE, 10Infrastructure-Foundations: Decide on model for serving idm.wikimedia.org - https://phabricator.wikimedia.org/T320604 (10SLyngshede-WMF) 05Openβ†’03Resolved a:03SLyngshede-WMF idm.wikimedia.org will be served by a pair of Ganeti VMs. Failover will be via DNS. [08:27:59] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [08:28:30] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) 05In progressβ†’03Resolved [08:28:35] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [08:28:42] (03CR) 10JMeybohm: [V: 03+1] Remove profile::kubernetes::deployment_server from role::releases [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [08:29:00] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:29:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:13] (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [08:34:30] (Traffic on tunnel link) firing: Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [08:35:28] 10SRE, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) [08:36:23] 10SRE, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) [08:36:31] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:36:59] 10SRE, 10Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474 (10SLyngshede-WMF) p:05Triageβ†’03Medium [08:37:19] (03CR) 10ClΓ©ment Goubert: [C: 03+1] "LGTM, and switchover is done." [puppet] - 10https://gerrit.wikimedia.org/r/911847 (https://phabricator.wikimedia.org/T331706) (owner: 10Jbond) [08:38:10] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:38:44] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:39:22] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [08:39:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [08:41:48] 10SRE, 10Infrastructure-Foundations: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF) [08:42:25] 10SRE, 10Infrastructure-Foundations: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF) p:05Triageβ†’03Medium a:03SLyngshede-WMF [08:42:34] (03CR) 10Muehlenhoff: kafkamon: add bullseye role and node assignments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [08:43:35] (03CR) 10Ayounsi: [C: 03+1] Point active server back to apt1001 [puppet] - 10https://gerrit.wikimedia.org/r/912783 (owner: 10Muehlenhoff) [08:44:12] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:44:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) @Marostegui, `L3` was just signed. Thanks for the help in all this! @Dzahn, thank you also for checking on this. I need access to `analytics-privatedata-users` and... [08:44:52] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [08:45:52] (03CR) 10Muehlenhoff: [C: 03+2] Point active server back to apt1001 [puppet] - 10https://gerrit.wikimedia.org/r/912783 (owner: 10Muehlenhoff) [08:47:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) Confirmed L3 signed. @odimitrijevic or @Ottomata I need your approval as the request is for analytics-privatedata-users @AndrewTavis we still need your manager to approve... [08:47:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [08:51:28] 10SRE, 10Infrastructure-Foundations: Bitu IDM - Feedback - https://phabricator.wikimedia.org/T335470 (10SLyngshede-WMF) [08:52:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) [08:52:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) On a fresh bookworm installation I'm seeing a few Puppet failures like those: ` Error: Could not set 'link' on ensure: wrong number of argument... [08:53:40] (03PS2) 10Noa wmde: Add language codes cal and tpv to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) [08:53:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) [08:53:53] 10SRE, 10Infrastructure-Foundations: Bitu IDM - Feedback - https://phabricator.wikimedia.org/T335470 (10SLyngshede-WMF) @Aklapper are the SRE and Infrastructure-Foundations tags sufficient for now? [08:54:22] (03CR) 10David Caro: [C: 03+2] prometheus::blackbox::check::http: allow passing alert data [puppet] - 10https://gerrit.wikimedia.org/r/912342 (owner: 10David Caro) [08:54:30] (Traffic on tunnel link) resolved: Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [08:57:14] 10SRE, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10SLyngshede-WMF) [09:00:58] !log delete overlapping block 01GY1CQ4EAKRV9BQ8D9JB1VWGJ from thanos - T335406 [09:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:06] T335406: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 [09:02:26] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40910/console" [puppet] - 10https://gerrit.wikimedia.org/r/912342 (owner: 10David Caro) [09:02:26] (03PS1) 10Muehlenhoff: Fix build with pybuild from bookworm [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/912787 [09:03:27] (03CR) 10David Caro: "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40909/console" [puppet] - 10https://gerrit.wikimedia.org/r/912342 (owner: 10David Caro) [09:06:18] !log uploaded debdeploy 0.0.99.13+deb12u1 to bookworm-wikimedia T330495 [09:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:23] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [09:06:41] (03CR) 10Elukey: [C: 03+2] amd_gpu: add udev rules to bypass the 'render' group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909968 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:06:43] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp4037,cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/912788 (https://phabricator.wikimedia.org/T322774) [09:09:06] !log restart thanos-compact on thanos-fe2001 - T335406 [09:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:21] T335406: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 [09:09:49] (03CR) 10Cathal Mooney: [C: 03+2] Change Homer template to get license key from custom field [homer/public] - 10https://gerrit.wikimedia.org/r/912307 (https://phabricator.wikimedia.org/T334180) (owner: 10Cathal Mooney) [09:09:49] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp4037,cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/912788 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:10:19] (03Merged) 10jenkins-bot: Change Homer template to get license key from custom field [homer/public] - 10https://gerrit.wikimedia.org/r/912307 (https://phabricator.wikimedia.org/T334180) (owner: 10Cathal Mooney) [09:10:32] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Andy Cooper - https://phabricator.wikimedia.org/T335483 (10acooper) [09:10:45] (03PS1) 10Elukey: Prepare ml-cache nodes for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/912789 (https://phabricator.wikimedia.org/T331712) [09:10:56] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@fb6f0ea] (releasing): (no justification provided) [09:11:34] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@fb6f0ea] (releasing): (no justification provided) (duration: 00m 40s) [09:13:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40911/console" [puppet] - 10https://gerrit.wikimedia.org/r/912789 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [09:14:16] !log restarting varnish on cp4037 and cp4045 to drop port 80 - T322774 [09:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp4037,cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/912790 (https://phabricator.wikimedia.org/T322774) [09:17:31] (03PS1) 10Jelto: gitlab: enable and run partial backups daily [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) [09:18:38] PROBLEM - Varnish HTTP text-frontend - port 80 on cp4037 is CRITICAL: connect to address 10.128.0.19 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:19:39] (03CR) 10David Caro: profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [09:19:50] 10SRE, 10Infrastructure-Foundations: Validate managers for permission approval - https://phabricator.wikimedia.org/T335484 (10SLyngshede-WMF) [09:20:00] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp4037,cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/912790 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:20:54] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4045 is CRITICAL: connect to address 10.128.0.14 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:21:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40912/console" [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [09:21:53] (those two should recover soon) [09:22:18] RECOVERY - Varnish HTTP text-frontend - port 80 on cp4037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:22:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912387 [09:22:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912387 (owner: 10TrainBranchBot) [09:23:18] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) And one more question for @jbond and @jhathaway : We're installing the ruby-safe-yaml package via the monitoring profile, where it seems to be u... [09:23:25] 10SRE, 10Infrastructure-Foundations: Automatically deploy documentation to docs.wikimedia.org - https://phabricator.wikimedia.org/T335485 (10SLyngshede-WMF) [09:24:32] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4045 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:25:48] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove remaining obsolete nodejs images only used on Stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/911761 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [09:26:10] RECOVERY - Check systemd state on mw2325 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:11] (03PS4) 10Hnowlan: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) [09:26:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10jbond) >>! In T330495#8809860, @MoritzMuehlenhoff wrote: > After some digging I tend to believe this is caused by https://www.ruby-lang.org/en/news/2019/12/12/sepa... [09:26:37] (03CR) 10Jelto: [V: 03+1] "What do you think of doing of doing another partial backup 12h after a full backup? With that we can reduce the delta between backups whil" [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [09:26:39] (03CR) 10Hnowlan: svg: use rsvg-convert output flag (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [09:27:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/912789 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [09:27:09] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5024,cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/912793 (https://phabricator.wikimedia.org/T322774) [09:27:24] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: move timeout config to route-level [deployment-charts] - 10https://gerrit.wikimedia.org/r/909712 (owner: 10Hnowlan) [09:29:03] !log imported wmf-certificates to bookworm-wikimedia T330495 [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:07] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [09:29:19] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5024,cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/912793 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:29:35] (03CR) 10Elukey: [V: 03+1 C: 03+2] Prepare ml-cache nodes for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/912789 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [09:29:57] !log imported prometheus-rsyslog-exporter to bookworm-wikimedia T330495 [09:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:32] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:48] (03PS10) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [09:33:02] (03Merged) 10jenkins-bot: rest-gateway: move timeout config to route-level [deployment-charts] - 10https://gerrit.wikimedia.org/r/909712 (owner: 10Hnowlan) [09:33:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) [09:34:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS bullseye [09:34:30] 10SRE, 10Machine-Learning-Team: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [09:34:34] (03CR) 10Majavah: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [09:34:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) @Marostegui, I just updated the task with a new public SSH key that I generated only for these purposes :) [09:36:30] !log restarting varnish on cp5024 and cp5032 to drop port 80 - T322774 [09:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10Clement_Goubert) [09:37:20] !log depooling mw2331.codfw.wmnet for HW troubleshooting - T335486 [09:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:24] T335486: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 [09:37:35] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5024,cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/912794 (https://phabricator.wikimedia.org/T322774) [09:38:05] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5024,cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/912794 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:38:11] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10Clement_Goubert) ipmi-sel ` 22 | Apr-26-2023 | 17:04:31 | PS Redundancy | Power Supply | Redundancy Lost 23 | Apr-26-2023 | 17:04:36 | PS Redundancy | Power Supply... [09:38:21] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10Clement_Goubert) [09:39:46] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10Clement_Goubert) [09:39:46] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5024 is CRITICAL: connect to address 10.132.0.35 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:39:56] !log delete all 2023 replica=unset blocks from thanos - T335406 [09:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:00] T335406: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 [09:40:05] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10Clement_Goubert) ` 14 | Apr-26-2023 | 17:04:01 | Status | Power Supply | Power Supply input lost (AC/DC) 15 | Apr-26-2023 | 17:04:05 | PS Redundancy | Power... [09:40:14] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10Clement_Goubert) [09:40:20] !log depooling mw2330.codfw.wmnet for HW troubleshooting - T335487 [09:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] T335487: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 [09:41:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2330.codfw.wmnet with reason: PSU failure [09:41:10] (03CR) 10David Caro: [C: 03+1] "LGTM, tested on toolsbeta, have not checked running git-update it as gitpuppet instead of root though, just as the service is currently de" [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [09:41:16] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2330.codfw.wmnet with reason: PSU failure [09:41:22] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a69cd090-91f7-4f78-bb79-8ec372655d1f) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with... [09:41:30] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5032 is CRITICAL: connect to address 10.132.0.16 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:41:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2331.codfw.wmnet with reason: PSU failure [09:41:46] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 7 days, 0:00:00 on mw2331.codfw.wmnet with reason: PSU failure [09:42:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912387 (owner: 10TrainBranchBot) [09:42:12] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2331.codfw.wmnet with reason: PSU failure [09:42:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2331.codfw.wmnet with reason: PSU failure [09:42:19] 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7bd46a0c-3d49-4be0-8ba9-7c9dcf8c35c4) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with... [09:42:42] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [09:43:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:44:20] (03PS1) 10Elukey: install_server: fix ml-cache's partman path in netboot [puppet] - 10https://gerrit.wikimedia.org/r/912795 (https://phabricator.wikimedia.org/T331712) [09:44:44] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:46:12] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.477 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:46:13] (03CR) 10Elukey: [C: 03+2] install_server: fix ml-cache's partman path in netboot [puppet] - 10https://gerrit.wikimedia.org/r/912795 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [09:46:20] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:21] 10SRE, 10Infrastructure-Foundations: Automatically deploy documentation to docs.wikimedia.org - https://phabricator.wikimedia.org/T335485 (10SLyngshede-WMF) p:05Triageβ†’03Low [09:47:14] 10SRE, 10Infrastructure-Foundations: Validate managers for permission approval - https://phabricator.wikimedia.org/T335484 (10SLyngshede-WMF) p:05Triageβ†’03Low [09:47:25] (03CR) 10Jcrespo: "Personally, as it is now on this patch, I am not a fan. Gitlab backup storage was planned for daily backups- increasing its storage use wi" [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [09:47:31] 10SRE, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10SLyngshede-WMF) p:05Triageβ†’03Low [09:47:47] 10SRE, 10Infrastructure-Foundations: Determine which sender address to use for email notification - https://phabricator.wikimedia.org/T335091 (10SLyngshede-WMF) p:05Triageβ†’03Low [09:54:28] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1001.eqiad.wmnet with OS bullseye [09:54:37] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye executed with errors: - ml-cache1001 (**FAIL*... [09:54:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [09:55:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS bullseye [09:55:18] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [09:59:50] (03PS1) 10Jbond: check_puppetrun: Drop safe_yaml [puppet] - 10https://gerrit.wikimedia.org/r/912798 (https://phabricator.wikimedia.org/T330495) [10:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1000) [10:00:44] (03CR) 10CI reject: [V: 04-1] check_puppetrun: Drop safe_yaml [puppet] - 10https://gerrit.wikimedia.org/r/912798 (https://phabricator.wikimedia.org/T330495) (owner: 10Jbond) [10:00:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [10:00:49] (03PS1) 10Vgutierrez: hiera: Set haproxy on port 80 for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/912799 (https://phabricator.wikimedia.org/T322774) [10:01:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [10:03:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40914/console" [puppet] - 10https://gerrit.wikimedia.org/r/912799 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:04:36] (03PS1) 10Elukey: install_server: fix config for ml-cache in netboot (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/912800 (https://phabricator.wikimedia.org/T331712) [10:04:48] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ml-cache1001.eqiad.wmnet with OS bullseye [10:05:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org [10:05:28] (03CR) 10Elukey: [C: 03+2] install_server: fix config for ml-cache in netboot (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/912800 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [10:05:30] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Set haproxy on port 80 for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/912799 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:05:57] elukey: Luca Toscano: install_server: fix config for ml-cache in netboot (part 2) (3addba7303), may I merge this one? [10:06:15] +1 :) [10:06:34] (03PS1) 10Hnowlan: thumbor: new image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912801 (https://phabricator.wikimedia.org/T335271) [10:06:39] elukey: done [10:06:57] vgutierrez: molte grazie [10:07:05] (03CR) 10Muehlenhoff: [C: 03+2] Adapt sources.list for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912773 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:09:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS bullseye [10:09:36] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [10:10:55] (03PS1) 10Jbond: vendor_modules: update augeasproviders_core to 3.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/912803 [10:10:57] (03PS1) 10Jbond: vendor_modules: update dnsquery module [puppet] - 10https://gerrit.wikimedia.org/r/912804 [10:11:37] (03CR) 10CI reject: [V: 04-1] vendor_modules: update augeasproviders_core to 3.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [10:12:16] (03PS2) 10Jbond: vendor_modules: update augeasproviders_core to 3.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/912803 [10:12:49] (03PS2) 10Jbond: vendor_modules: update dnsquery module [puppet] - 10https://gerrit.wikimedia.org/r/912804 [10:13:04] (03PS3) 10Jbond: vendor_modules: update dnsquery module [puppet] - 10https://gerrit.wikimedia.org/r/912804 [10:13:08] (03PS1) 10MarcoAurelio: [gawiki] Restrict CX publishing to NS_MAIN to extendedconfirmed only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912805 (https://phabricator.wikimedia.org/T335466) [10:14:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40915/console" [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [10:14:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40916/console" [puppet] - 10https://gerrit.wikimedia.org/r/912804 (owner: 10Jbond) [10:15:00] (PowerSupply) firing: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:15:54] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:15:59] (03CR) 10MarcoAurelio: [C: 04-1] "No such user group on ga.wikipedia yet. Requesting clarification." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912805 (https://phabricator.wikimedia.org/T335466) (owner: 10MarcoAurelio) [10:16:32] (03PS1) 10Alexandros Kosiaris: machinetranslation: Bump to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/912807 (https://phabricator.wikimedia.org/T331505) [10:20:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [10:24:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [10:24:47] (03CR) 10Ayounsi: [C: 03+1] "πŸš€" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [10:26:26] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5023,cp5031 [puppet] - 10https://gerrit.wikimedia.org/r/912809 (https://phabricator.wikimedia.org/T322774) [10:28:08] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5023,cp5031 [puppet] - 10https://gerrit.wikimedia.org/r/912809 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:33:34] !log restarting varnish on cp5023 and cp5031 to drop port 80 - T322774 [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:52] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5023,cp5031 [puppet] - 10https://gerrit.wikimedia.org/r/912810 (https://phabricator.wikimedia.org/T322774) [10:35:42] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5023,cp5031 [puppet] - 10https://gerrit.wikimedia.org/r/912810 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [10:36:30] (03PS2) 10Alexandros Kosiaris: machinetranslation: Bump to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/912807 (https://phabricator.wikimedia.org/T331505) [10:36:32] (03PS1) 10Alexandros Kosiaris: machinetranslation: Switch to 2023-04-27-093807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/912811 (https://phabricator.wikimedia.org/T331505) [10:36:35] (03PS1) 10Alexandros Kosiaris: machinetranslation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/912812 (https://phabricator.wikimedia.org/T331505) [10:46:54] (03CR) 10Hnowlan: [C: 03+2] thumbor: new image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912801 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [10:52:56] (03Merged) 10jenkins-bot: thumbor: new image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912801 (https://phabricator.wikimedia.org/T335271) (owner: 10Hnowlan) [10:55:32] (03PS1) 10Stang: lowiki: Use Western style (0-9) numerals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912815 [10:56:05] (03PS2) 10Stang: lowiki: Use Western style (0-9) numerals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912815 (https://phabricator.wikimedia.org/T335345) [10:56:51] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Andy Cooper - https://phabricator.wikimedia.org/T335483 (10Marostegui) a:03Marostegui [10:58:23] (03PS1) 10Marostegui: data.yaml: Add acooper [puppet] - 10https://gerrit.wikimedia.org/r/912816 (https://phabricator.wikimedia.org/T335483) [10:59:07] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix build with pybuild from bookworm [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/912787 (owner: 10Muehlenhoff) [10:59:35] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:59:43] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:00:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) Thanks Andrew, I have contacted you out of band to verify your ssh key. [11:00:48] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:03:02] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5022,cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/912817 (https://phabricator.wikimedia.org/T322774) [11:03:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.997% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:03:32] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:04:51] (03CR) 10Jbond: [C: 03+2] git-sync-upstream: add support for gituser and alternate base directories [puppet] - 10https://gerrit.wikimedia.org/r/910059 (owner: 10Jbond) [11:05:19] (03PS11) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [11:06:22] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5022,cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/912817 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [11:07:10] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:08:01] (03PS1) 10Jcrespo: Update indexes for latest queries needed for mediabackups [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912818 (https://phabricator.wikimedia.org/T327157) [11:08:03] (03PS1) 10Jcrespo: Add functionality to detect last uploaded time for backup start [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912819 (https://phabricator.wikimedia.org/T327157) [11:09:23] !log restarting varnish on cp5022 and cp5030 to drop port 80 - T322774 [11:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/912816 (https://phabricator.wikimedia.org/T335483) (owner: 10Marostegui) [11:09:43] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:10:33] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5022,cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/912820 (https://phabricator.wikimedia.org/T322774) [11:11:14] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5022,cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/912820 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [11:11:58] (03PS2) 10Jcrespo: Update indexes for latest queries needed for mediabackups [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912818 (https://phabricator.wikimedia.org/T327157) [11:12:02] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [11:12:06] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5030 is CRITICAL: connect to address 10.132.0.27 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:12:57] (03PS2) 10EoghanGaffney: [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 [11:13:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 54994 [11:14:27] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:15:20] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.460 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:18:59] (03CR) 10Krinkle: [C: 03+2] Remove enabling of Central Notice Timing in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910427 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [11:19:23] 10SRE-swift-storage, 10Thumbor, 10Platform Team Workboards (Platform Engineering Reliability), 10SVG: SVG rasterizer renders non Latin text as tofu glyph randomly (as thumbor-k8s lack noto fonts) - https://phabricator.wikimedia.org/T335271 (10hnowlan) Looks like this issue is fixed in new thumbs: * https:... [11:20:21] (03Merged) 10jenkins-bot: Remove enabling of Central Notice Timing in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910427 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [11:22:50] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add acooper [puppet] - 10https://gerrit.wikimedia.org/r/912816 (https://phabricator.wikimedia.org/T335483) (owner: 10Marostegui) [11:23:49] (03PS12) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [11:24:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40917/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [11:24:53] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Andy Cooper - https://phabricator.wikimedia.org/T335483 (10Marostegui) 05Openβ†’03Resolved Added to the ldap group. Allow 30 minutes for the change to spread everywhere. [11:25:00] (03PS3) 10EoghanGaffney: [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) [11:25:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Bump to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/912807 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:25:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Switch to 2023-04-27-093807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/912811 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:25:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/912812 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:27:27] (03CR) 10LSobanski: [gitlab/failover] Rename host flags (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [11:28:13] * Krinkle staging on mwdebug1002 [11:30:59] (03CR) 10ClΓ©ment Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10ClΓ©ment Goubert) [11:31:42] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [11:31:48] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: sre.discovery.datacenter should support switching the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364 (10Clement_Goubert) 05Openβ†’03In progress p:05Triageβ†’03Me... [11:31:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) >>! In T330495#8809997, @jbond wrote: >>>! In T330495#8809860, @MoritzMuehlenhoff wrote: >> After some digging I tend to believe this is caused... [11:32:33] (03Merged) 10jenkins-bot: machinetranslation: Bump to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/912807 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:32:52] (03Merged) 10jenkins-bot: machinetranslation: Switch to 2023-04-27-093807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/912811 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:33:21] (03Merged) 10jenkins-bot: machinetranslation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/912812 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:35:55] (03PS12) 10JMeybohm: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [11:41:02] !log krinkle@deploy1002 Synchronized wmf-config/: I195978cbd61d80 (duration: 06m 29s) [11:44:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 54994 [11:44:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 23951 [11:44:51] (03PS1) 10Stevemunene: Create scap deployment source for product analytics [puppet] - 10https://gerrit.wikimedia.org/r/912834 (https://phabricator.wikimedia.org/T333000) [11:46:18] (03PS1) 10Jbond: package_builder: add hooks for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912835 (https://phabricator.wikimedia.org/T321783) [11:47:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 23951 [11:47:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40918/console" [puppet] - 10https://gerrit.wikimedia.org/r/912835 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:51:33] (03PS1) 10Cathal Mooney: Only process global vlan list in Juniper config on frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/912836 (https://phabricator.wikimedia.org/T322937) [11:52:30] (Access port speed <= 100Mbps) firing: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [11:53:11] (03CR) 10Slyngshede: Read systems and approval rules from YAML file. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede) [11:54:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/912835 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:54:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] package_builder: add hooks for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/912835 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:56:58] !log upload python3-pypuppetdb_3.1.0-1_all.deb to bookworm [11:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:59] (03PS2) 10Cathal Mooney: Only process global vlan list in Juniper config on frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/912836 (https://phabricator.wikimedia.org/T322937) [12:03:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10karapayneWMDE) EM of Wikidata here. As Andrew's manager, I approve this request [12:04:53] (03PS2) 10Jbond: check_puppetrun: Drop safe_yaml [puppet] - 10https://gerrit.wikimedia.org/r/912798 (https://phabricator.wikimedia.org/T330495) [12:05:27] (03CR) 10Ayounsi: [C: 03+1] Only process global vlan list in Juniper config on frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/912836 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:05:39] (03CR) 10Cathal Mooney: [C: 03+2] Only process global vlan list in Juniper config on frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/912836 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:06:07] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) [12:06:15] (03Merged) 10jenkins-bot: Only process global vlan list in Juniper config on frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/912836 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:06:22] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10Marostegui) Pending L3 signature. [12:08:28] (03PS1) 10Ladsgroup: Remove 1024px and 1920px from pre-gen thumbsizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) [12:08:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [12:11:59] (03PS1) 10Cathal Mooney: Add 'default' for tenant.slug in vlans else statement [homer/public] - 10https://gerrit.wikimedia.org/r/912838 (https://phabricator.wikimedia.org/T322937) [12:12:00] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Clement_Goubert) I would like @MoritzMuehlenhoff and other serviceops (@Joe, @akosiaris ?) input on this. I think it's sound, but maybe in case of the main deployment server being down, we don't want to... [12:12:15] !log imported puppet 5.5.22-2+deb13u3 to bookworm-wikimedia T330495 [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:20] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [12:12:54] (03CR) 10Cathal Mooney: [C: 03+2] Add 'default' for tenant.slug in vlans else statement [homer/public] - 10https://gerrit.wikimedia.org/r/912838 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:13:19] (03PS1) 10Elukey: cassandra: update 'dev' version to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/912839 (https://phabricator.wikimedia.org/T331712) [12:13:19] (03Merged) 10jenkins-bot: Add 'default' for tenant.slug in vlans else statement [homer/public] - 10https://gerrit.wikimedia.org/r/912838 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:13:59] (03PS1) 10Muehlenhoff: Install 5.5.22-2+deb12u3 in late-setup.sh [puppet] - 10https://gerrit.wikimedia.org/r/912840 (https://phabricator.wikimedia.org/T330495) [12:15:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Ottomata) Approved. [12:15:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10Clement_Goubert) a:05Clement_Goubertβ†’03thcipriani [12:15:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [12:17:08] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:17:43] (03PS2) 10Elukey: cassandra: update 'dev' version to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/912839 (https://phabricator.wikimedia.org/T331712) [12:17:45] (03PS1) 10Elukey: role::ml_cache::storage: set cassandra version to 3.x [puppet] - 10https://gerrit.wikimedia.org/r/912841 (https://phabricator.wikimedia.org/T331712) [12:18:29] (03CR) 10Elukey: [C: 03+2] role::ml_cache::storage: set cassandra version to 3.x [puppet] - 10https://gerrit.wikimedia.org/r/912841 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [12:19:15] (03CR) 10Muehlenhoff: [C: 03+2] Install 5.5.22-2+deb12u3 in late-setup.sh [puppet] - 10https://gerrit.wikimedia.org/r/912840 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [12:21:51] (03PS1) 10JMeybohm: profile::imagecatalog migrate from user token to client cert [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) [12:24:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40919/console" [puppet] - 10https://gerrit.wikimedia.org/r/912842 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:25:08] PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:25:08] PROBLEM - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:25:08] PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:25:32] this is me, node just reimaged, downtime expired [12:26:03] should resolve in a bit [12:26:44] RECOVERY - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is OK: TCP OK - 0.000 second response time on 10.64.130.9 port 9042 https://phabricator.wikimedia.org/T93886 [12:26:44] RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:26:44] RECOVERY - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is OK: SSL OK - Certificate ml-cache1001-a valid until 2024-06-15 08:50:14 +0000 (expires in 414 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [12:27:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1001.eqiad.wmnet with OS bullseye [12:27:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [12:27:21] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye completed: - ml-cache1001 (**PASS**) - Remo... [12:27:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] httpd: always use systemd [puppet] - 10https://gerrit.wikimedia.org/r/911847 (https://phabricator.wikimedia.org/T331706) (owner: 10Jbond) [12:27:58] (03PS1) 10Marostegui: data.yaml: Add Andrew McAllister [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) [12:28:21] (03CR) 10Marostegui: [C: 04-2] "Waiting for:" [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [12:28:44] (03CR) 10CI reject: [V: 04-1] data.yaml: Add Andrew McAllister [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [12:29:13] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [12:29:22] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [12:29:45] (03PS2) 10Marostegui: data.yaml: Add Andrew McAllister [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) [12:30:01] (03CR) 10MVernon: [C: 03+1] "Seems like a good idea, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [12:33:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40920/console" [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [12:35:38] (03PS1) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [12:36:02] (03CR) 10CI reject: [V: 04-1] toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [12:36:31] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:37:37] (03CR) 10Jbond: [V: 03+1 C: 04-1] gerrit: move hieradata from role/common to common/profile (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [12:37:52] PROBLEM - puppet last run on krb2002 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:38:29] (03CR) 10Jbond: [C: 03+2] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [12:38:31] (03CR) 10Jbond: [C: 03+2] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [12:38:35] (03CR) 10Jbond: [C: 03+2] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [12:39:21] (03CR) 10Majavah: [C: 03+1] tcl86: switch base image to bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912343 (https://phabricator.wikimedia.org/T335420) (owner: 10BryanDavis) [12:39:30] (03CR) 10Majavah: [C: 03+1] Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [12:40:40] (03Merged) 10jenkins-bot: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [12:40:49] (03Merged) 10jenkins-bot: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [12:40:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/912798 (https://phabricator.wikimedia.org/T330495) (owner: 10Jbond) [12:40:56] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [12:40:59] (03Merged) 10jenkins-bot: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [12:41:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [12:43:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage [12:43:32] RECOVERY - puppet last run on krb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:45:51] (03PS1) 10Cathal Mooney: Avoid creating EVPN import policy with default accept if no Vlans [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) [12:46:17] (03CR) 10Muehlenhoff: [C: 03+1] cassandra: update 'dev' version to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/912839 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [12:46:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) This user is already part of NDA ldap group. [12:46:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [12:47:54] (03CR) 10Marostegui: "Waiting only for ssh key verification" [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [12:48:37] (03CR) 10Cathal Mooney: "Should have said it's no-diff on existing devices." [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:50:10] (03PS2) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [12:50:25] (03CR) 10Hokwelum: [C: 03+1] "looks good, thank you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) (owner: 10Hokwelum) [12:50:34] (03CR) 10CI reject: [V: 04-1] toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [12:50:55] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5021,cp5029 [puppet] - 10https://gerrit.wikimedia.org/r/912847 (https://phabricator.wikimedia.org/T322774) [12:51:26] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5021,cp5029 [puppet] - 10https://gerrit.wikimedia.org/r/912847 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [12:52:00] (03CR) 10ArielGlenn: [C: 03+2] WME refresh token api uses a post request [puppet] - 10https://gerrit.wikimedia.org/r/911897 (https://phabricator.wikimedia.org/T335368) (owner: 10Hokwelum) [12:52:22] (03CR) 10Kamila SoučkovΓ‘: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [12:53:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [12:53:59] (03CR) 10Filippo Giunchedi: prometheus::ops: add demo node exporter job for SONiC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [12:55:17] (03PS1) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [12:55:48] (03CR) 10CI reject: [V: 04-1] package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [12:56:02] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5021,cp5029 [puppet] - 10https://gerrit.wikimedia.org/r/912849 (https://phabricator.wikimedia.org/T322774) [12:56:32] (03CR) 10Jelto: [V: 03+1] gitlab: enable and run partial backups daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [12:56:43] (03PS4) 10ArielGlenn: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [12:56:59] !log restarting varnish on cp5021 and cp5029 to drop port 80 - T322774 [12:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:31] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10fgiunchedi) [12:57:53] (03PS2) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [12:58:08] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5021,cp5029 [puppet] - 10https://gerrit.wikimedia.org/r/912849 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [12:58:26] (03CR) 10CI reject: [V: 04-1] package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [12:59:37] o/ [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1300) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1300). Please do the needful. [13:00:06] cmelo, dcausse, noa_wmde, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:46] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10akosiaris) The motd states: ` While it is perfectly working, this is not the active deployment server. If you want to deploy software, you should /not/ do it from here; it will probably work, but the n... [13:01:01] o/ [13:01:09] o/ [13:01:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) [13:01:49] (03CR) 10Marostegui: "Ssh key verified. This is ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [13:02:14] o/ [13:02:18] (03PS3) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [13:02:29] o/ [13:02:29] * TheresNoTime can deploy [13:02:30] o/ [13:02:48] (03CR) 10CI reject: [V: 04-1] package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:02:52] cmelo: will start with yours [13:03:09] o/ I'm around but would prefer someone else will deploy [13:03:09] ok thanks [13:03:15] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10MoritzMuehlenhoff) Yeah, what Alex said. In addition, if we really want to prevent deployers from using the inactive servers, the better fix would be to have scap check/prevent this. [13:03:35] (03PS4) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [13:03:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:03:42] (03CR) 10Jbond: [C: 03+2] check_puppetrun: Drop safe_yaml [puppet] - 10https://gerrit.wikimedia.org/r/912798 (https://phabricator.wikimedia.org/T330495) (owner: 10Jbond) [13:04:04] (03CR) 10CI reject: [V: 04-1] package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:04:20] (03Merged) 10jenkins-bot: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910055 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:04:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS bullseye [13:04:58] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: - ml-cache1002 (**WARN**) - Down... [13:05:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) Thanks all for the approvals and help with this! [13:05:10] !log samtar@deploy1002 Started scap: Backport for [[gerrit:910055|metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only (T334088)]] [13:05:14] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [13:05:42] (03PS5) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [13:05:49] (03PS9) 10Samtar: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:06:39] !log samtar@deploy1002 samtar and cmelo: Backport for [[gerrit:910055|metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only (T334088)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:06:57] cmelo: live on mwdebug, can you test? [13:06:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40927/console" [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:07:06] sure thanks [13:07:12] (03CR) 10Volans: [C: 04-1] sre.network.peering: don't log on "show" command (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) (owner: 10Ayounsi) [13:10:00] (03PS1) 10Elukey: modules: allow istio gateways to have more selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 [13:11:02] cmelo: FYI I've confirmed the change on mwdebug at https://meta.wikimedia.org/wiki/Special:ListGroupRights#campaignevents-beta-tester, am I good to sync and then move to 910056? [13:12:12] (03PS4) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) [13:12:50] (03PS5) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [13:13:15] (03CR) 10Jcrespo: gitlab: enable and run partial backups daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [13:14:12] Eurgh, sorry, I was distracted by something else [13:14:25] (03CR) 10Krinkle: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [13:14:49] Yes, the diff on ListGroupRights LGTM, too [13:14:49] syncng [13:15:17] cmelo / HouseOfM all good on your end as well? [13:15:26] (03PS6) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [13:15:56] yes, the only thing I could not test Daimona was a not valid organizer, were you able to find one? [13:16:02] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5020,cp5028 [puppet] - 10https://gerrit.wikimedia.org/r/912851 (https://phabricator.wikimedia.org/T322774) [13:16:23] (03CR) 10Krinkle: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [13:16:32] wdym? [13:16:55] I mean, try to add a not valid organizer and ge the error message [13:17:12] (03PS7) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [13:17:49] The feature should still be disabled, if you're seeing it then something's not right [13:17:49] @Daimona, we're talking in slack, please come join us [13:19:03] I am seeing it because I am using wikimediadebug I think [13:19:55] Sorry, didn't see that. I'd rather keep the conversation in a single place, though [13:20:07] fair [13:20:07] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [13:20:18] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:910055|metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only (T334088)]] (duration: 15m 07s) [13:20:22] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [13:20:34] It should not be visible with wikimediadebug either, because the patch which enables the feature wasn't merged yet. I can't confirm though, because I'm not an organizer on meta [13:20:35] (just for clarity, `910055: metawiki: Give campaignevents-organize-events to campaignevents-beta-tester only` has been merged and sync'd to production. I intend to move on to `910056: Enable $wgCampaignEventsEnableMultipleOrganizers in production`) [13:20:58] The wikidata patch has been removed from this deployment window as it appears to not be ready to go after all. next time :) [13:21:07] Noa_WMDE: ack, thank you :) [13:21:28] I can see it on meta with wikimediadebug, I think HouseOfM can see it as weel right HouseOfM? [13:21:54] TheresNoTime: thanks! Could you please hold on for a second while we figure this thing out? [13:21:55] No, I can't [13:22:00] Daimona: ack [13:22:01] I am not an organiser [13:22:49] (03CR) 10Michael Große: [C: 03+1] "The respective Wikibase change is I7099f80ed7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) (owner: 10Noa wmde) [13:23:30] Oh I am sorry wrong link, I was looking at the meta but on beta [13:23:32] cmelo: On what parge are you seeing it? [13:23:37] it is all fine then [13:23:38] Ah [13:23:43] Yes, that would explain it [13:24:06] (03CR) 10Krinkle: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [13:24:33] Then I think we can move on? [13:25:10] yep [13:25:13] Moving on to `910056: Enable $wgCampaignEventsEnableMultipleOrganizers in production` [13:25:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:25:21] Yup, ty! [13:25:26] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T335403 (10Jclark-ctr) Replaced failed cable new cable id 2013339101906. netbox updated [13:26:00] yup [13:26:05] ty [13:26:15] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableMultipleOrganizers in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910056 (https://phabricator.wikimedia.org/T334088) (owner: 10Cmelo) [13:26:44] !log samtar@deploy1002 Started scap: Backport for [[gerrit:910056|Enable $wgCampaignEventsEnableMultipleOrganizers in production (T334088)]] [13:26:48] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [13:26:58] o/ [13:27:27] anything still left to deploy? *reads up* [13:27:30] (Access port speed <= 100Mbps) resolved: Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [13:27:38] (03PS3) 10Ayounsi: sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) [13:28:05] Lucas_WMDE: 912337, 911308 and 912815 [13:28:17] !log samtar@deploy1002 samtar and cmelo: Backport for [[gerrit:910056|Enable $wgCampaignEventsEnableMultipleOrganizers in production (T334088)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:28:36] cmelo: Daimona: `910056: Enable $wgCampaignEventsEnableMultipleOrganizers in production` is live on mwdebug for testing [13:28:50] HouseOfM: ^ [13:28:53] hm, and no Wikidata language codes? [13:28:59] ty [13:29:23] Lucas_WMDE: not ready to go apparently? [13:29:27] ty [13:29:50] Thanks TheresNoTime. I can't test this one as I'm not an organizer on meta, but I can look for errors if cmelo tries it out [13:30:10] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [13:30:25] Oh and actually, I can test on testwiki [13:31:01] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [13:31:24] 10ops-eqiad: InterfaceSpeedError - https://phabricator.wikimedia.org/T335403 (10Jclark-ctr) 05Openβ†’03Resolved Alert cleared [13:31:57] (03PS1) 10Muehlenhoff: Make Scap directories on deployment servers compatible with CVE-2022-24756 fix [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) [13:32:01] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [13:32:05] Lucas_WMDE: are you available to deploy? If so, could you possibly take over after I've finished this one? (910056) [13:32:11] (03CR) 10Hnowlan: [C: 03+2] svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [13:32:53] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) >>! In T332953#8809416, @tstarling wrote: > Will the GitHub mirrors be switched over to replicate from GitLab? This is necessary for... [13:33:13] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1003.eqiad.wmnet with OS bullseye [13:33:14] (np if not) [13:33:17] TheresNoTime: yeah I can take over [13:33:23] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye [13:33:25] also I think the Wikidata config change should be good to deploy after all [13:33:27] Lucas_WMDE: thanks :) I'll ping you [13:33:28] but we can do the other ones first [13:33:29] ok [13:34:10] (03PS1) 10Legoktm: shellbox-syntaxhighlight: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912854 (https://phabricator.wikimedia.org/T320848) [13:34:40] (03CR) 10Volans: [C: 03+1] "That should work while we get the actual feature into spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) (owner: 10Ayounsi) [13:35:16] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [13:35:37] So, this seems to be working on testwiki, but there's a DBPerformance warning I'm looking at [13:35:38] (03PS1) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [13:35:41] (03PS1) 10Elukey: ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 [13:36:22] (03Merged) 10jenkins-bot: svg: use rsvg-convert output flag [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/908861 (https://phabricator.wikimedia.org/T334725) (owner: 10Hnowlan) [13:36:34] (03CR) 10CI reject: [V: 04-1] fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 (owner: 10Elukey) [13:36:38] (03CR) 10CI reject: [V: 04-1] ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 (owner: 10Elukey) [13:37:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/912853 (https://phabricator.wikimedia.org/T335354) (owner: 10Muehlenhoff) [13:38:05] Daimona: https://w.wiki/6dhQ right? [13:38:26] Yes, and other similar warnings for other queries [13:38:29] (03CR) 10Ayounsi: [C: 03+2] sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) (owner: 10Ayounsi) [13:38:31] This isn't a new issue though, I think [13:38:45] Because the other queries we were already executing before this patch [13:38:50] (03PS8) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [13:39:27] Yet it seems weird, it should be a POST submission, where write queries are supposedly allowed [13:40:00] (03PS6) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [13:40:02] (03PS1) 10Jbond: debian::codename::compare: allow passing explicit codename [puppet] - 10https://gerrit.wikimedia.org/r/912859 [13:41:08] (03Merged) 10jenkins-bot: sre.network.peering: don't log on "show" command [cookbooks] - 10https://gerrit.wikimedia.org/r/912779 (https://phabricator.wikimedia.org/T324655) (owner: 10Ayounsi) [13:41:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40928/console" [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:42:06] Daimona: worth rolling back to resolve? I see the warning is raised on the beta cluster too [13:42:21] I don't think a rollback would resolve this [13:42:29] I'm still trying to figure out why it complains [13:42:46] Would you happen to know if there were similar errors before the deployment? [13:42:53] Oh wait you said beta [13:42:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40929/console" [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:42:58] fwiw, the normalized message indicates the source of the restriction is not the MediaWiki.php entrypoint where the GET/POST restrictions are set [13:43:01] > `Expectation (writes <= 0) by MediaWiki\SpecialPage\SpecialPageFactory::executePath not met (actual: {actualSeconds}):` [13:43:14] this suggests SpecialPageFactory is overriding and setting a custom expectation [13:43:30] isnt' executePath() what we use for {{Special:}} inclusions from the parser? [13:43:51] (no, that's capturePath) [13:43:56] (03PS7) 10Jbond: package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 [13:44:10] if ( $context->getRequest()->wasPosted() && !$page->doesWrites() ) { [13:44:10] $trxProfiler->setExpectations( $trxLimits['POST-nonwrite'], __METHOD__ ); [13:44:10] Ahhhh I see [13:44:15] I guess you haven't declared your page to be making writes :) [13:44:24] I think the SpecialPage class is lacking doesWrites(): true [13:45:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40930/console" [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:45:10] And thanks @Krinkle for the pointer! [13:45:11] (03CR) 10Jbond: [C: 03+2] debian::codename::compare: allow passing explicit codename [puppet] - 10https://gerrit.wikimedia.org/r/912859 (owner: 10Jbond) [13:45:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40931/console" [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:45:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [13:45:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] package_builder: use new format for bookworm and above [puppet] - 10https://gerrit.wikimedia.org/r/912848 (owner: 10Jbond) [13:45:53] I thought FormSpecialPage had that by default [13:45:58] RECOVERY - IPMI Sensor Status on mw2330 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:46:04] But maybe not, after all?! [13:46:45] not afaict [13:46:47] Daimona: yw, - the context here is that special pages and API modules sometimes use post data to return data if the form is too big. For example ApiQuery can be posted with a large query, and e.g. Special:Export supports post as welll afaik. but they don't do writes. [13:47:01] (03Abandoned) 10Legoktm: shellbox: Update to 2022-02-04-153221 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763175 (https://phabricator.wikimedia.org/T298399) (owner: 10Legoktm) [13:47:10] Right, that makes sense [13:47:27] re FormSpecialPage, all the redirecting special pages (PermaLink etc.) are probably also forms but don’t write [13:47:29] I was just under this impression that FormSpecialPage defaulted to doesWrites=true for some reason, but I'm probably just misremembering [13:47:38] (03PS2) 10Cathal Mooney: Avoid creating EVPN import policy with default accept if no Vlans [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) [13:48:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage [13:48:30] are we go or no-go for sync? [13:48:47] At any rate, this is unrelated to the patch being deployed and I can fix it later [13:49:07] (03CR) 10Legoktm: [C: 03+2] shellbox-syntaxhighlight: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912854 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [13:49:09] @Daimona, I think there is another issue. Nobody has organiser user right on beta now? [13:49:18] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [13:49:23] Let me finish looking at the logs and also see if cmelo of HouseOfM found anything else [13:49:31] Oh, as I said :D [13:49:36] (ack) [13:49:47] All users should have the rights on beta [13:50:31] (03CR) 10Cathal Mooney: [C: 03+2] Avoid creating EVPN import policy with default accept if no Vlans [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [13:50:46] The control says no users do [13:51:03] (03Merged) 10jenkins-bot: Avoid creating EVPN import policy with default accept if no Vlans [homer/public] - 10https://gerrit.wikimedia.org/r/912846 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [13:51:47] https://en.wikipedia.beta.wmflabs.org/wiki/Special:ListGroupRights#user lists the right `campaignevents-organize-events` under `Users` [13:52:37] Uhm, let me see [13:53:04] Also, the patch being deployed should not affect beta at all, so if there's an issue it's likely pre-existing [13:53:12] I think I found the reason why no users have more the permission on beta, I think we need to add this onΒ  $wgGroupPermissions['user']['campaignevents-organize-events'] = true; [13:53:24] on CommonSettings-labs.php [13:53:43] is it right Daimona? [13:54:19] Ohhhh right, beta inherits from prod, doesn't it? [13:54:28] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10RobH) [13:54:30] So it's the previous patch that caused this [13:54:40] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10RobH) All the items currently listed are online according to icinga, removed and resolving task. [13:54:51] (03PS2) 10Elukey: modules: allow istio gateways to have more selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 [13:54:54] (03PS2) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [13:54:55] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T335299 (10RobH) 05Openβ†’03Resolved [13:54:56] (03PS2) 10Elukey: ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 [13:55:08] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912854 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [13:55:20] It's already a bit late -- TheresNoTime: would there be enough time to deploy a beta config patch now? [13:55:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10RobH) [13:55:57] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335295 (10RobH) 05Openβ†’03Resolved All items in this task desc were online according to manual checks (on 3 of them) and icinga check (for all of them). removed and resolving. [13:56:00] Daimona: not while holding that other patch [13:56:23] I think that one can go, unless cmelo or HouseOfM found anything in prod [13:56:24] (03PS1) 10Muehlenhoff: Stop installing ruby-safe-yaml [puppet] - 10https://gerrit.wikimedia.org/r/912864 (https://phabricator.wikimedia.org/T330495) [13:56:35] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335294 (10RobH) 05Openβ†’03Resolved All items online via manual check and icinga check, old false positives. [13:56:38] RECOVERY - IPMI Sensor Status on mw2331 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:56:38] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T335294 (10RobH) [13:56:39] (03CR) 10Krinkle: "Clean run in brief:" [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) (owner: 10Krinkle) [13:56:47] if it gets sync'd then a beta-only patch can be done whenever really [13:57:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10fgiunchedi) >>! In T335027#8795348, @ayounsi wrote: > Thanks for the quick reply! This now works: > ` > prometheus1006:~$ curl lsw1-e8-eqiad.... [13:57:50] (03PS1) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [13:58:21] Should we be able to see the field in meta? [13:58:33] I think that one can go, but I am not able to see the organizer field there yet [13:58:36] on meta [13:59:19] (03CR) 10CI reject: [V: 04-1] ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 (owner: 10Elukey) [13:59:21] (03PS1) 10Daimona Eaytoy: beta: Restore campaignevents-organize-events right for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912867 (https://phabricator.wikimedia.org/T334088) [13:59:23] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [13:59:30] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [13:59:36] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [13:59:44] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [13:59:46] going to sync [14:00:00] (PowerSupply) resolved: Power Supply - Status - issue on aqs2008:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=aqs2008 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:00:03] Daimona: ^ [14:00:19] Yes, the field should be visible on meta. @cmelo: do you still have mwdebug on? [14:00:22] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:00:31] TheresNoTime: sure, thanks [14:00:40] I can see it now with mediawikidebug, so it is fine [14:01:05] (03PS4) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 [14:01:15] Nice. [14:01:30] I've also written a patch for beta: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/912867/ [14:01:31] (NodeTextfileStale) resolved: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:02:12] (03CR) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [14:02:57] (03PS1) 10Majavah: Use shell webservice-runner for jdk17, ruby27 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912868 (https://phabricator.wikimedia.org/T293552) [14:03:01] Lucas_WMDE: just syncing 910056 then I've got to go β€” there are 3 config patches left in the window, and a beta-only config 912867 [14:03:05] ok [14:03:16] plus another one I wanted to bring back (the Wikidata languages) [14:03:46] (03CR) 10Cmelo: [C: 03+1] beta: Restore campaignevents-organize-events right for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912867 (https://phabricator.wikimedia.org/T334088) (owner: 10Daimona Eaytoy) [14:04:26] TheresNoTime: thanks for your patience [14:04:49] Ty [14:04:49] No problem! :D [14:04:51] I'm sorry that this took longer than expected [14:04:53] (03CR) 10Krinkle: Remove 1024px and 1920px from pre-gen thumbsizes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [14:04:56] yes, thank you!!! [14:05:00] * TheresNoTime has the easy job :P [14:05:38] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:910056|Enable $wgCampaignEventsEnableMultipleOrganizers in production (T334088)]] (duration: 38m 35s) [14:05:38] T334088: Enable the multiple organizers feature in production - https://phabricator.wikimedia.org/T334088 [14:05:38] 910056 live on production [14:05:38] Lucas_WMDE: all yours [14:06:12] ok [14:06:12] jouncebot: nowandnext [14:06:12] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [14:06:12] In 1 hour(s) and 54 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1600) [14:06:45] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) sretest1002 @jbond @Volans this server name is already in use can this be named sretest1003? [14:07:49] ok, so only cmelo’s config changes were deployed so far iiuc? [14:07:59] dcausse and koi: still around? [14:08:11] Lucas_WMDE: yes I'm around :) [14:08:23] ok :) [14:08:34] * Lucas_WMDE tries to remember wth labtestwiki is [14:08:37] but happy to move my patch for later if you prefer :) [14:08:47] nah, I think we can keep deploying for a bit [14:08:47] yeah [14:08:54] …a test wikitech? [14:08:56] Lucas_WMDE: this patch does not require any testing [14:09:03] TIL labtestwikitech.wikimedia.org [14:09:09] ok [14:09:20] (03PS2) 10Lucas Werkmeister (WMDE): labtestwiki: disable cirrus completion index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912337 (owner: 10DCausse) [14:09:23] yes it has few config issues and that's why we want to disable this maint script there [14:09:36] * Lucas_WMDE frowns at config variable that’s apparently set as 'yes' and 'no' instead of true/false [14:09:43] :) [14:09:49] (03PS1) 10Majavah: prometheus::blackbox: detect ip_families automatically [puppet] - 10https://gerrit.wikimedia.org/r/912872 [14:09:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912337 (owner: 10DCausse) [14:09:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1003.eqiad.wmnet with OS bullseye [14:09:57] heh, `scap backsport` – not the worst typo :D [14:10:01] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: - ml-cache1003 (**WARN**) - Down... [14:10:08] ow my back is hurting. I’ve done too much backsport [14:10:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [14:10:45] (03Merged) 10jenkins-bot: labtestwiki: disable cirrus completion index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912337 (owner: 10DCausse) [14:11:14] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:912337|labtestwiki: disable cirrus completion index]] [14:11:18] (03CR) 10CI reject: [V: 04-1] Unbreak WikitechPhabBan mechanism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [14:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:59] (03PS3) 10MarcoAurelio: Unbreak WikitechPhabBan mechanism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) [14:12:04] (03CR) 10CI reject: [V: 04-1] prometheus::blackbox: detect ip_families automatically [puppet] - 10https://gerrit.wikimedia.org/r/912872 (owner: 10Majavah) [14:12:46] !log lucaswerkmeister-wmde@deploy1002 dcausse and lucaswerkmeister-wmde: Backport for [[gerrit:912337|labtestwiki: disable cirrus completion index]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:13:31] !log installing curl security updates on buster [14:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:42] (03PS2) 10Majavah: prometheus::blackbox: detect ip_families automatically [puppet] - 10https://gerrit.wikimedia.org/r/912872 [14:14:22] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Andrew McAllister [puppet] - 10https://gerrit.wikimedia.org/r/912843 (https://phabricator.wikimedia.org/T335437) (owner: 10Marostegui) [14:14:32] RECOVERY - IPMI Sensor Status on aqs2008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:15:27] ok, nothing to test I hear, continuing [14:15:53] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:16:06] Lucas_WMDE: thanks for the deploy! :) [14:16:10] np [14:16:17] I’m just a bit scatterbrained today ^^ [14:17:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:17:56] (03PS4) 10MarcoAurelio: Unbreak WikitechPhabBan mechanism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) [14:18:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Marostegui) 05Openβ†’03Resolved a:03Marostegui I have merged the change and created the kerberos principal (the user was already part of NDA ldap group). @A... [14:18:35] looks like there haven’t been any changes on cnwikimedia in the past 30 days, so the closure wouldn’t interrupt any ongoing work afaict [14:18:55] yes, actually no changes for years [14:19:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 23 NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40933/console" [puppet] - 10https://gerrit.wikimedia.org/r/912872 (owner: 10Majavah) [14:19:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10AndrewTavis_WMDE) Thank you, @Marostegui! πŸ™ I did receive the email to set up my password. Appreciate all the assistance! :) [14:19:45] yeah, last changes according to allrevisions API were in jan 2022 [14:19:59] last logevents at the same time [14:20:02] and that was just massmessage, it seems [14:20:45] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:912337|labtestwiki: disable cirrus completion index]] (duration: 09m 31s) [14:22:38] (03PS2) 10Lucas Werkmeister (WMDE): Close cnwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911308 (https://phabricator.wikimedia.org/T274083) (owner: 10Stang) [14:23:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911308 (https://phabricator.wikimedia.org/T274083) (owner: 10Stang) [14:23:37] Lucas_WMDE: how many more patches do you have? [14:23:57] (03Merged) 10jenkins-bot: Close cnwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911308 (https://phabricator.wikimedia.org/T274083) (owner: 10Stang) [14:24:26] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:911308|Close cnwikimedia (T274083)]] [14:24:31] T274083: close cnwikimedia - https://phabricator.wikimedia.org/T274083 [14:24:35] (03PS3) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [14:25:06] I'm not familiar with how to close a wiki, is it needed to run some script? [14:25:17] legoktm: the currently ongoing one, one more for koi, and then one more for myself [14:25:18] !log restarting apache/FPM on mw canaries to pick up curl update [14:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:21] koi: I’m not sure [14:25:33] ok [14:25:39] * Lucas_WMDE looks at T334482 [14:25:42] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [14:25:52] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:911308|Close cnwikimedia (T274083)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:25:55] doesn’t seem like any maint script was needed [14:25:57] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [14:26:07] https://wikitech.wikimedia.org/wiki/Close_a_wiki seems to be the relevant documentation [14:26:09] koi: can you test on mwdebug? [14:26:14] taavi: thanks, looking [14:26:16] looking [14:26:17] (03CR) 10David Caro: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [14:26:27] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [14:26:53] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [14:26:59] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [14:27:06] (03PS2) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [14:27:18] (03CR) 10Majavah: [C: 03+1] "Minor nit inline, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:27:23] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [14:27:29] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [14:27:33] (03CR) 10Majavah: [C: 04-1] profile:toolforge:harbor: setup blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [14:27:58] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [14:28:02] groupOverrides isn’t in IS.php anymore, cool cool [14:28:04] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:28:06] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:28:12] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [14:28:27] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [14:28:31] but no cnwikimedia in core-Permissions.php [14:28:34] so I think that part is a no-op [14:28:55] I thought Lucas_WMDE looks good, when I trying to edit a page, it said "The action you have requested is limited to users in the group: Stewards" [14:29:01] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [14:29:03] ok [14:29:13] (it used to be "Users" [14:29:46] (03CR) 10Krinkle: Unbreak WikitechPhabBan mechanism (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [14:29:50] seems we need to add a notice in https://cn.wikimedia.org/wiki/MediaWiki:Sitenotice, I'll post a request on steward noticeboard later [14:30:07] alright, syncing [14:30:59] (03PS3) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [14:31:16] updating wikimedia/portals doesn’t seem to be needed either [14:31:17] (03CR) 10Legoktm: [C: 03+1] Use shell webservice-runner for jdk17, ruby27 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912868 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [14:32:12] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5020,cp5028 [puppet] - 10https://gerrit.wikimedia.org/r/912851 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:32:20] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [14:34:15] (03PS1) 10Superpes15: [cawikibooks] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912874 (https://phabricator.wikimedia.org/T331823) [14:35:09] oh right and we switched datacenters again didn’t we [14:35:19] so I should not be SSHed into mwmaint2002 ^^ [14:35:23] (03CR) 10BryanDavis: [C: 03+1] mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [14:35:26] * Lucas_WMDE just saw the banner [14:35:32] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:911308|Close cnwikimedia (T274083)]] (duration: 11m 05s) [14:35:37] T274083: close cnwikimedia - https://phabricator.wikimedia.org/T274083 [14:35:50] (03PS3) 10Lucas Werkmeister (WMDE): lowiki: Use Western style (0-9) numerals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912815 (https://phabricator.wikimedia.org/T335345) (owner: 10Stang) [14:36:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912815 (https://phabricator.wikimedia.org/T335345) (owner: 10Stang) [14:36:11] !log restarting varnish on cp5020 and cp5028 to drop port 80 - T322774 [14:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:05] (03Merged) 10jenkins-bot: lowiki: Use Western style (0-9) numerals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912815 (https://phabricator.wikimedia.org/T335345) (owner: 10Stang) [14:37:30] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:912815|lowiki: Use Western style (0-9) numerals (T335345)]] [14:37:34] T335345: Change numerals from lao to arabic for lowikipedia - https://phabricator.wikimedia.org/T335345 [14:37:54] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5020,cp5028 [puppet] - 10https://gerrit.wikimedia.org/r/912875 (https://phabricator.wikimedia.org/T322774) [14:38:33] (03CR) 10MarcoAurelio: Unbreak WikitechPhabBan mechanism (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [14:38:53] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:912815|lowiki: Use Western style (0-9) numerals (T335345)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:39:45] Lucas_WMDE, I opened a random page's history and it looks fine [14:39:50] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5020,cp5028 [puppet] - 10https://gerrit.wikimedia.org/r/912875 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:39:54] (03PS4) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [14:39:56] (03CR) 10Eevans: [C: 03+2] "Apologies, I'd have sworn that I did this... πŸ˜•" [puppet] - 10https://gerrit.wikimedia.org/r/911934 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [14:40:09] (03PS1) 10Jforrester: Replace references to actionsToolbar [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911804 (https://phabricator.wikimedia.org/T335469) [14:40:15] (03CR) 10MarcoAurelio: Unbreak WikitechPhabBan mechanism (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [14:40:18] koi: ack, thanks [14:40:21] syncing then [14:40:56] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5028 is CRITICAL: connect to address 10.132.0.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [14:40:57] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/912864 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:41:12] (03CR) 10Muehlenhoff: [C: 03+2] Stop installing ruby-safe-yaml [puppet] - 10https://gerrit.wikimedia.org/r/912864 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [14:41:21] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [14:42:00] (03PS4) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [14:42:03] (03CR) 10David Caro: toolforge: add pingthing to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:42:04] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5020 is CRITICAL: connect to address 10.132.0.24 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [14:42:29] (03CR) 10CI reject: [V: 04-1] toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:42:32] (03CR) 10Eevans: [C: 03+1] "Apologies, I'd have sworn that I did this... πŸ˜•" [puppet] - 10https://gerrit.wikimedia.org/r/912839 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [14:42:44] (03CR) 10Elukey: [C: 03+2] cassandra: update 'dev' version to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/912839 (https://phabricator.wikimedia.org/T331712) (owner: 10Elukey) [14:43:00] (03PS5) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [14:43:17] (03PS3) 10Lucas Werkmeister (WMDE): Add language codes cal and tpv to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) (owner: 10Noa wmde) [14:43:21] (03CR) 10David Caro: toolforge: add pingthing to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:44:20] (03PS5) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [14:44:23] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [14:44:32] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:44:38] RECOVERY - IPMI Sensor Status on elastic2050 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:44:44] (03CR) 10CI reject: [V: 04-1] toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:45:11] (03PS2) 10Superpes15: [cawikibooks] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912874 (https://phabricator.wikimedia.org/T331823) [14:45:13] (03PS1) 10Superpes15: [cawikinews] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912877 (https://phabricator.wikimedia.org/T331823) [14:45:40] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.473 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:46:24] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:912815|lowiki: Use Western style (0-9) numerals (T335345)]] (duration: 08m 53s) [14:46:26] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/912844 (owner: 10David Caro) [14:46:28] T335345: Change numerals from lao to arabic for lowikipedia - https://phabricator.wikimedia.org/T335345 [14:46:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) (owner: 10Noa wmde) [14:47:22] (03Merged) 10jenkins-bot: Add language codes cal and tpv to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912290 (https://phabricator.wikimedia.org/T308062) (owner: 10Noa wmde) [14:47:48] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:912290|Add language codes cal and tpv to wmgExtraLanguageNames (T308062)]] [14:47:52] T308062: Carolinian language (ISO code: cal) and Tanapag (ISO code: tpv) label support on Wikidata - https://phabricator.wikimedia.org/T308062 [14:49:08] PROBLEM - Host an-worker1147 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:13] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and noa: Backport for [[gerrit:912290|Add language codes cal and tpv to wmgExtraLanguageNames (T308062)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:49:18] checking [14:50:16] seems to work on https://www.wikidata.org/wiki/Special:SetLabel/Q4115189/cal and https://www.wikidata.org/wiki/Special:SetLabel/Q4115189/tpv [14:50:18] syncing [14:50:44] RECOVERY - Host an-worker1147 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:50:52] (PuppetCertificateAboutToExpire) resolved: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:50:58] PROBLEM - IPMI Sensor Status on an-worker1147 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:51:11] (03CR) 10JHathaway: [C: 03+1] vendor_modules: update augeasproviders_core to 3.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [14:51:21] (03CR) 10JHathaway: [C: 03+1] vendor_modules: update dnsquery module [puppet] - 10https://gerrit.wikimedia.org/r/912804 (owner: 10Jbond) [14:52:53] (03PS3) 10Superpes15: [cawikibooks] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912874 (https://phabricator.wikimedia.org/T331823) [14:52:55] (03PS2) 10Superpes15: [cawikinews] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912877 (https://phabricator.wikimedia.org/T331823) [14:52:58] (03PS1) 10Superpes15: [cawikiquote] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912880 (https://phabricator.wikimedia.org/T331823) [14:54:32] (JobUnavailable) firing: Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:44] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:912290|Add language codes cal and tpv to wmgExtraLanguageNames (T308062)]] (duration: 07m 55s) [14:55:49] T308062: Carolinian language (ISO code: cal) and Tanapag (ISO code: tpv) label support on Wikidata - https://phabricator.wikimedia.org/T308062 [14:56:29] !log UTC afternoon backport+config window done [14:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:33] legoktm: ^ [14:56:41] ty :D [14:56:43] (03PS1) 10EoghanGaffney: [gitlab/failover] Switch primary from codfw->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/912881 [14:56:52] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10Jhancock.wm) 05Openβ†’03Resolved found loose power cable on redundant. replaced and secured. alert has cleared. [14:57:07] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Switch primary from codfw->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/912881 (owner: 10EoghanGaffney) [14:57:23] (03CR) 10BryanDavis: [C: 04-1] "I don't think this would fix whatever problem it thinks it is fixing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [14:57:44] (03PS2) 10EoghanGaffney: [gitlab/failover] Switch primary from codfw->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/912881 (https://phabricator.wikimedia.org/T335504) [14:57:46] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10Jhancock.wm) 05Openβ†’03Resolved found loose power cable. might have been tangled up not sure. replaced and secured. alert has cleared. [14:58:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) backup1010 E1. U5. PORT5. CABLIEID 20220246 backup1011 F1. U5 PORT5 CABLIEID 20220245 [14:58:30] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2330.codfw.wmnet [14:58:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2330.codfw.wmnet [14:58:58] !log repooling mw2330.codfw.wmnet - T335487 [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] T335487: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 [14:59:58] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [15:00:41] (03PS6) 10David Caro: toolforge: add pingthing to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/912844 [15:01:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:29] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5020,cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/912883 (https://phabricator.wikimedia.org/T322774) [15:01:45] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2330.codfw.wmnet - https://phabricator.wikimedia.org/T335487 (10Clement_Goubert) Thanks, back in service :) [15:02:35] (03PS2) 10Vgutierrez: hiera: Disable http->https in varnish on cp5019,cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/912883 (https://phabricator.wikimedia.org/T322774) [15:02:55] (03PS1) 10Superpes15: [cawikisource] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912884 (https://phabricator.wikimedia.org/T331823) [15:03:02] (03PS1) 10FNegri: d/changelog: Prepare for 0.94 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/912885 [15:03:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.902% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:03:19] (03PS1) 10Jforrester: [function-orchestrator] Update image reference, now it's from GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/912886 [15:03:21] (03PS1) 10Jforrester: [function-evaluator] Update image reference, now it's from GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/912887 [15:03:39] (03PS2) 10FNegri: d/changelog: Prepare for 0.94 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/912885 (https://phabricator.wikimedia.org/T331336) [15:03:45] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10elukey) a:03klausman The eqiad cluster is on bullseye, these are the steps needed after a reimage to make a node work again: ` elukey@ml-cache1003:~$ sudo chown -R cassandra:cass... [15:04:12] I'm live hacking on mwdebug1001 [15:04:26] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:56] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912785 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [15:05:29] (03CR) 10David Caro: "LGTM, will wait for a +1 from others though" [puppet] - 10https://gerrit.wikimedia.org/r/912872 (owner: 10Majavah) [15:06:14] (03PS1) 10Superpes15: [cawiktionary] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912888 (https://phabricator.wikimedia.org/T331823) [15:06:24] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5019,cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/912883 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [15:07:50] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [15:08:22] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5019,cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/912889 (https://phabricator.wikimedia.org/T322774) [15:09:33] (JobUnavailable) firing: (2) Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:55] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:39] (03PS3) 10Elukey: modules: allow istio gateways to have more selectors [deployment-charts] - 10https://gerrit.wikimedia.org/r/912850 [15:10:41] (03PS3) 10Elukey: fast-api: bump the istio ingress module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/912855 [15:10:43] (03PS3) 10Elukey: ml-services: deploy ores-legacy on a separate istio gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/912856 [15:10:52] !log restarting varnish on cp5019 and cp5027 to drop port 80 - T322774 [15:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5019,cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/912889 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [15:13:33] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:37] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:14:14] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:14:35] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5019 is CRITICAL: connect to address 10.132.0.19 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [15:15:43] (03CR) 10Jbond: [C: 04-1] prometheus::blackbox: detect ip_families automatically (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912872 (owner: 10Majavah) [15:15:55] (03CR) 10Ladsgroup: Remove 1024px and 1920px from pre-gen thumbsizes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) (owner: 10Ladsgroup) [15:15:58] (03PS2) 10Ladsgroup: Remove 1024px and 1920px from pre-gen thumbsizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912837 (https://phabricator.wikimedia.org/T211661) [15:16:03] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:57] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog πŸ“₯): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) >>! In T288629#8809671, @JMeybohm wrote: > Can you elaborate/point me to the discussion on how... [15:17:32] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:17:32] (03PS3) 10Herron: kafkamon: add bullseye role and node assignments [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) [15:17:50] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [15:17:53] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 128 bytes in 0.471 second response time https://wikitech.wikimedia.org/wiki/Varnish [15:17:56] (03CR) 10Herron: kafkamon: add bullseye role and node assignments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:18:26] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:19:09] (03PS3) 10Krinkle: mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) [15:19:33] (JobUnavailable) firing: (2) Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:05] (03CR) 10Krinkle: [C: 03+2] mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [15:20:55] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:20:57] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:21:03] (03Merged) 10jenkins-bot: mc: Fix accidental mcrouter prefix $wgWANObjectCache on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912421 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [15:21:03] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:21:32] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:21:38] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:21:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2331.codfw.wmnet [15:21:39] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2331.codfw.wmnet [15:21:58] !log repooled mw2331.codfw.wmnet - T335486 [15:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:01] T335486: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 [15:22:12] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:22:17] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:22:20] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:22:25] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:22:27] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [15:22:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:22:55] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add label to prometheus6001 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912407 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [15:23:21] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: PSU failure for mw2331.codfw.wmnet - https://phabricator.wikimedia.org/T335486 (10Clement_Goubert) Thanks, repooled. [15:23:40] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:58] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus6001 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912407 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [15:24:02] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/912885 (https://phabricator.wikimedia.org/T331336) (owner: 10FNegri) [15:25:41] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:27:01] 10SRE, 10LDAP-Access-Requests: Add user xcollazo to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T335445 (10xcollazo) Thanks @Marostegui. For my understanding, where do we keep this LDAP group assignment? I know we keep a bunch of group memberships at https://gerrit.wikimedia.org/r/plugins... [15:28:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:29:15] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:57] !log krinkle@deploy1002 Synchronized wmf-config/mc.php: Ia174ea2b0645 (duration: 06m 05s) [15:29:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] vendor_modules: update augeasproviders_core to 3.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/912803 (owner: 10Jbond) [15:30:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] vendor_modules: update dnsquery module [puppet] - 10https://gerrit.wikimedia.org/r/912804 (owner: 10Jbond) [15:30:45] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:31:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) [15:31:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) sretest1002 A6 U35 PORT 31 CABLEID 1917 [15:31:37] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) [15:32:37] the rsync-doc stuff is known issue but should be fixed soon-ish [15:32:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:32:44] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [15:32:45] by switching to quickdatacopy class [15:32:48] (03PS1) 10Volans: tox: no bandit request_without_timeout in tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912892 [15:32:50] (03PS1) 10Volans: requests: rename TypeTimeout to TimeoutType [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 [15:32:52] (03PS1) 10Volans: dns: clarify type and adhere to dnspython type [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 [15:33:33] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:41] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [15:33:47] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [15:34:21] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [15:34:27] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [15:34:51] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:56] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [15:35:02] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:35:49] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:35:56] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [15:37:05] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [15:37:22] (03CR) 10CI reject: [V: 04-1] tox: no bandit request_without_timeout in tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912892 (owner: 10Volans) [15:37:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Jclark-ctr) [15:38:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 (owner: 10Volans) [15:39:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:39:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 (owner: 10Volans) [15:39:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Jclark-ctr) [15:40:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Jclark-ctr) puppetmaster1006 B3 U27. PORT. 22 CABLEID 3913 [15:41:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:42:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Jclark-ctr) [15:42:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Jclark-ctr) pki-root1002 B8 U12 PORT. 9 CABLIEID 1136 [15:42:59] (03PS1) 10Legoktm: Point SyntaxHighlight at /srv/app/pygmentize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912895 (https://phabricator.wikimedia.org/T320848) [15:43:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:44:20] (03CR) 10Herron: "should prometheus[3-6]002 hosts have replica_label 'b' since prometheus[3-6]001 have replica_label 'a'?" [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [15:46:24] (03CR) 10Herron: [C: 03+2] "thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/912341 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:48:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jclark-ctr) frav1003 C1 U15. PORT 24 CABLEID 1870 , 1871 [15:52:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install sretest1002 - https://phabricator.wikimedia.org/T334393 (10Volans) >>! In T334393#8810727, @Jclark-ctr wrote: > sretest1002 @jbond @Volans this server name is already in use can this be named sretest1003? @Jclark-ctr co... [15:52:46] (03CR) 10Volans: [V: 03+2 C: 03+2] "The CI failure is fixed in the next CR in the series" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912892 (owner: 10Volans) [15:52:50] (03CR) 10Volans: [C: 03+2] requests: rename TypeTimeout to TimeoutType [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 (owner: 10Volans) [15:52:58] (03CR) 10Volans: [C: 03+2] dns: clarify type and adhere to dnspython type [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 (owner: 10Volans) [15:53:25] (03Abandoned) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [15:53:50] (03PS1) 10Vgutierrez: hiera: Disable http->https in varnish on cp5018,cp5026 [puppet] - 10https://gerrit.wikimedia.org/r/912896 (https://phabricator.wikimedia.org/T322774) [15:54:58] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable http->https in varnish on cp5018,cp5026 [puppet] - 10https://gerrit.wikimedia.org/r/912896 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [15:55:09] !log upload puppetboard_4.3.0-1_all.deb to bookworm-wikimedia [15:55:10] (03CR) 10Andrea Denisse: prometheus: Add label to prometheus3002 data blocks to prevent data duplication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [15:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:37] (03CR) 10Jdlrobson: "recheck" [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912378 (owner: 10Jdlrobson) [15:57:09] Hi folks, would any deployer be willing to deploy a beta-only config change remaining from the earlier B&C window? [15:57:14] (03CR) 10CI reject: [V: 04-1] tox: no bandit request_without_timeout in tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912892 (owner: 10Volans) [15:57:16] (03CR) 10CI reject: [V: 04-1] requests: rename TypeTimeout to TimeoutType [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 (owner: 10Volans) [15:57:18] (03CR) 10CI reject: [V: 04-1] dns: clarify type and adhere to dnspython type [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 (owner: 10Volans) [15:58:04] !log restarting varnish on cp5018 and cp5026 to drop port 80 - T322774 [15:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:29] (03PS1) 10Vgutierrez: hiera: Enable http->https in haproxy on cp5018,cp5026 [puppet] - 10https://gerrit.wikimedia.org/r/912898 (https://phabricator.wikimedia.org/T322774) [15:59:40] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host kafkamon1003.eqiad.wmnet [15:59:41] !log herron@cumin1001 START - Cookbook sre.dns.netbox [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1600). [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:23] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable http->https in haproxy on cp5018,cp5026 [puppet] - 10https://gerrit.wikimedia.org/r/912898 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [16:01:50] (03CR) 10Volans: [C: 03+2] requests: rename TypeTimeout to TimeoutType [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 (owner: 10Volans) [16:01:52] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon1003.eqiad.wmnet - herron@cumin1001" [16:01:56] (03CR) 10Volans: [C: 03+2] dns: clarify type and adhere to dnspython type [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 (owner: 10Volans) [16:05:31] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kafkamon1003.eqiad.wmnet - herron@cumin1001" [16:05:31] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:31] !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache kafkamon1003.eqiad.wmnet on all recursors [16:05:31] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafkamon1003.eqiad.wmnet on all recursors [16:05:50] (03Merged) 10jenkins-bot: requests: rename TypeTimeout to TimeoutType [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912893 (owner: 10Volans) [16:06:20] (03Merged) 10jenkins-bot: dns: clarify type and adhere to dnspython type [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912894 (owner: 10Volans) [16:10:25] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Papaul) [16:13:34] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912900 [16:15:44] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Jhancock.wm) [16:16:20] (03PS6) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:17:42] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:18:33] (03PS7) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:18:50] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.2.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912900 (owner: 10Volans) [16:19:11] (03PS8) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:19:21] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon1003.eqiad.wmnet - herron@cumin1001" [16:20:24] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kafkamon1003.eqiad.wmnet - herron@cumin1001" [16:20:24] !log herron@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kafkamon1003.eqiad.wmnet [16:21:06] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:23:34] (03CR) 10Filippo Giunchedi: prometheus: Add label to prometheus3002 data blocks to prevent data duplication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [16:24:01] (03PS9) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:24:56] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/912900 (owner: 10Volans) [16:25:20] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:27:20] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti: consider --no-wait-for-sync as a default option for instance creation - https://phabricator.wikimedia.org/T335522 (10herron) p:05Triageβ†’03Medium [16:28:42] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:43] (03PS1) 10Volans: Upstream release v1.2.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/912903 [16:28:56] (03CR) 10Volans: [C: 03+2] Upstream release v1.2.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/912903 (owner: 10Volans) [16:32:56] (03PS10) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:34:18] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:34:24] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Papaul) [16:36:05] (03PS11) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:36:30] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Papaul) [16:36:46] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:27] (03PS12) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:40:02] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:20] (03CR) 10David Caro: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:40:52] (03CR) 10CI reject: [V: 04-1] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:42:24] (03CR) 10DCausse: [C: 03+1] search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [16:42:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:42:50] (03CR) 10CI reject: [V: 04-1] data.yaml: Add Tricia Burmeister [puppet] - 10https://gerrit.wikimedia.org/r/912906 (https://phabricator.wikimedia.org/T334628) (owner: 10Triciaburmeister) [16:48:44] (03CR) 10Dzahn: "unfortunately a one-liner won't be enough here. You are currently in the "ldap_only_admins" section but the new group is an actual shell (" [puppet] - 10https://gerrit.wikimedia.org/r/912906 (https://phabricator.wikimedia.org/T334628) (owner: 10Triciaburmeister) [16:50:02] (03PS13) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:50:05] (03PS1) 10Jcrespo: Set both eqiad and codfw mediabackups to by default, backup Commons [puppet] - 10https://gerrit.wikimedia.org/r/912926 (https://phabricator.wikimedia.org/T327157) [16:50:51] (03CR) 10Triciaburmeister: data.yaml: Add Tricia Burmeister (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912906 (https://phabricator.wikimedia.org/T334628) (owner: 10Triciaburmeister) [16:51:09] (03PS2) 10Jcrespo: mediabackups: Set both dcs to backup Commons by default [puppet] - 10https://gerrit.wikimedia.org/r/912926 (https://phabricator.wikimedia.org/T327157) [16:51:15] (03Abandoned) 10Triciaburmeister: data.yaml: Add Tricia Burmeister [puppet] - 10https://gerrit.wikimedia.org/r/912906 (https://phabricator.wikimedia.org/T334628) (owner: 10Triciaburmeister) [16:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [16:52:46] (03PS14) 10Andrew Bogott: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) [16:53:17] (03CR) 10ArielGlenn: [C: 03+1] "There are other ways that this can fail of course, such as intermittent DNS service failures on the upstream side, which we've seen before" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [16:54:27] (03CR) 10Dzahn: "you don't have to necessarily abandon it, we could also amend to it but either way is fine" [puppet] - 10https://gerrit.wikimedia.org/r/912906 (https://phabricator.wikimedia.org/T334628) (owner: 10Triciaburmeister) [16:54:29] (03CR) 10Andrew Bogott: [C: 03+2] Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:54:50] (03CR) 10Krinkle: "Applied in beta cluster on deployment host and on mw host, mwscript works on both as expected (deploy host no-op, still succeed without pr" [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [16:55:49] (03Merged) 10jenkins-bot: Add rabbitmq_network_partition alert [alerts] - 10https://gerrit.wikimedia.org/r/912865 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:56:20] !log uploaded python3-wmflib_1.2.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [16:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:25] cc moritzm ^^^ [16:56:48] (03Abandoned) 10Jdlrobson: Don't mutate given schema in mapSchema() [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/912378 (owner: 10Jdlrobson) [16:56:59] (03Abandoned) 10Jdlrobson: Map schema should not have side effects and map marks field [extensions/Graph] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911802 (https://phabricator.wikimedia.org/T335335) (owner: 10Jdlrobson) [16:57:07] (03PS1) 10Herron: ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) [16:59:11] (03PS3) 10Jcrespo: mediabackups: Set both dcs to backup Commons by default [puppet] - 10https://gerrit.wikimedia.org/r/912926 (https://phabricator.wikimedia.org/T327157) [17:00:07] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1700). [17:00:07] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1700) [17:00:40] * bd808 peeks into the TE deploy possibilities pile [17:01:10] !log herron@cumin1001 START - Cookbook sre.ganeti.reimage for host kafkamon1003.eqiad.wmnet with OS bullseye [17:01:10] (03CR) 10CI reject: [V: 04-1] ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [17:01:22] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q3): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by herron@cumin1001 for host kafkamon1003.eqiad.wmnet with OS bullseye [17:03:06] (03PS2) 10Herron: ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) [17:06:54] !log deploy2002 - armed the keyholder (sudo keyholder arm and enter passphrase from deployment-key-passphrase in pwstore) - monitoring alert should resolve - T335435 [17:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:57] T335435: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 [17:07:53] (03CR) 10CI reject: [V: 04-1] ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) (owner: 10Herron) [17:08:01] I don't have any services to deploy this week. And it looks like Krinkle deployed the config patch for Wikitech, so nothing for me to deploy today. [17:08:36] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) Thanks all for the input. In that case.. I think all that was left to do here was to arm the keyholder again, after the server reboot. And I just did that above. Monitoring can stay as it is th... [17:08:37] bd808: btw, did any graphs come down? I didn't look.. [17:08:49] (03PS1) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) [17:08:59] I was just going to check myself Krinkle :) [17:09:05] k [17:09:48] effie: does this look ok to you? --^^ [17:09:48] 10SRE, 10serviceops: keyholder on inactive deployment server - https://phabricator.wikimedia.org/T335435 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:10:46] (03Abandoned) 10MarcoAurelio: Unbreak WikitechPhabBan mechanism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/911803 (https://phabricator.wikimedia.org/T335510) (owner: 10MarcoAurelio) [17:11:10] duesen: I am here, what is up ? [17:11:11] !log hnowlan@deploy1002 Started deploy [restbase/deploy@a08f56d]: Deploying new wikis: T333272 T334460 T334741 T335020 [17:11:19] T333272: Add kbdwiktionary to RESTBase - https://phabricator.wikimedia.org/T333272 [17:11:19] T335020: Add fatwiki to RESTBase - https://phabricator.wikimedia.org/T335020 [17:11:20] T334741: Add kcgwiktionary to RESTBase - https://phabricator.wikimedia.org/T334741 [17:11:20] T334460: Add guwwikinews to RESTBase - https://phabricator.wikimedia.org/T334460 [17:11:44] effie: I made a config patch for enabling parsoid cache warming jobs for all of group0: https://phabricator.wikimedia.org/T329366) [17:11:58] Is that too bold? [17:12:08] Krinkle: It looks very quiet since the change -- https://logstash.wikimedia.org/goto/2b9d02613fd59924adb5558d10cff70a -- I will try some of the reproduction cases I guess now. [17:12:22] (03CR) 10Dzahn: [C: 03+2] "We will have to work on it anyways, it's currently down, we want to create a replacement VM on bullseye and changing the name now will mea" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [17:13:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:11] duesen: I am afraid I missed your ping on the task, very sorry about that [17:14:30] np, that's what we have irc for ;) [17:14:41] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@a08f56d]: Deploying new wikis: T333272 T334460 T334741 T335020 (duration: 03m 29s) [17:15:03] (03CR) 10Dzahn: [C: 03+2] "also deleted the "labs" name from DNS zones in devtools - to prevent confusion and since it should never come back as this" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn) [17:15:24] my suggestion is we have a go on Tuesday afternoon, we have some folks already out for the long weekend [17:16:02] while we can discuss internally on Tuesday morning, and do the first step [17:16:15] how does that sound? [17:17:58] effie: I'm out myself, I'll only be fully back on Thursday. We can do it on Wednesday morning, or Thursday, or push it by a week. [17:18:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:36] (03PS3) 10Herron: ganeti: enable --no-wait-for-sync by default [software/spicerack] - 10https://gerrit.wikimedia.org/r/912928 (https://phabricator.wikimedia.org/T335522) [17:18:43] effie: do you think group0 is ok though? Do we need to try with a smaller set first? [17:20:04] I will have a better answer for you after Tue, my personal opinion is that it is worth the risk. If we merge it ourselves, who else we could ping to help us out eg. ensure that everything is working as inteded ? [17:20:08] intended* [17:20:09] (KeyholderUnarmed) resolved: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:20:31] (03CR) 10Jcrespo: [C: 03+2] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40939/console" [puppet] - 10https://gerrit.wikimedia.org/r/912926 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [17:23:21] (03CR) 10Herron: [C: 03+1] prometheus: Add label to prometheus3002 data blocks to prevent data duplication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [17:27:06] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@5a46db1] (releasing): (no justification provided) [17:27:44] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@5a46db1] (releasing): (no justification provided) (duration: 00m 40s) [17:28:30] (03PS1) 10Herron: kafkamon: add monitoring bullseye yaml [puppet] - 10https://gerrit.wikimedia.org/r/912930 (https://phabricator.wikimedia.org/T335424) [17:30:11] (03PS1) 10Volans: distros: add bookworm-wikimedia to known distros [puppet] - 10https://gerrit.wikimedia.org/r/912931 [17:30:11] (03CR) 10Herron: [C: 03+2] kafkamon: add monitoring bullseye yaml [puppet] - 10https://gerrit.wikimedia.org/r/912930 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [17:32:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10KFrancis) >>! In T335437#8809837, @Marostegui wrote: > Confirmed L3 signed. > @odimitrijevic or @Ottomata I need your approval as the request is for analytics-privatedata-users > @An... [17:33:17] (03PS1) 10Herron: kafkamon: add monitoring bullseye clusters [puppet] - 10https://gerrit.wikimedia.org/r/912932 (https://phabricator.wikimedia.org/T335424) [17:33:36] (03PS2) 10Herron: kafkamon: add monitoring bullseye clusters [puppet] - 10https://gerrit.wikimedia.org/r/912932 (https://phabricator.wikimedia.org/T335424) [17:33:42] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:36] (03CR) 10Herron: [C: 03+2] kafkamon: add monitoring bullseye clusters [puppet] - 10https://gerrit.wikimedia.org/r/912932 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [17:35:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:35:53] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafkamon1003.eqiad.wmnet with reason: host reimage [17:39:03] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafkamon1003.eqiad.wmnet with reason: host reimage [17:42:57] (03CR) 10Ladsgroup: "do you mean group1? group0 is basically testwiki and mediawiki (and a whole bunch of closed wikis) so this will be effectively noop" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912929 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [17:46:40] (03PS5) 10Ebernhardson: search: Report age of titlesuggest indices to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) [17:49:32] (JobUnavailable) firing: (2) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:24] (03PS1) 10Jcrespo: recentuploads: Set custom headers for querying the mediawiki api [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912935 (https://phabricator.wikimedia.org/T327157) [17:51:25] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=1) for host kafkamon1003.eqiad.wmnet with OS bullseye [17:51:35] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q3): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by herron@cumin1001 for host kafkamon1003.eqiad.wmnet with OS bullseye completed: - kafkamon1003 (**FAIL**) - Remov... [17:51:40] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q3): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by herron@cumin1001 for host kafkamon1003.eqiad.wmnet with OS bullseye executed with errors: - kafkamon1003 (**FAIL**... [17:51:59] (03CR) 10CI reject: [V: 04-1] recentuploads: Set custom headers for querying the mediawiki api [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912935 (https://phabricator.wikimedia.org/T327157) (owner: 10Jcrespo) [17:52:50] (03PS2) 10Jcrespo: recentuploads: Set custom headers for querying the mediawiki api [software/mediabackups] - 10https://gerrit.wikimedia.org/r/912935 (https://phabricator.wikimedia.org/T327157) [17:53:58] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service,burrow-logging-eqiad.service,burrow-main-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:32] (JobUnavailable) firing: (3) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:59:33] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:05] jeena and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T1800). [18:00:51] backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/911804 before train [18:02:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911804 (https://phabricator.wikimedia.org/T335469) (owner: 10Jforrester) [18:04:39] jeena: thanks [18:05:08] πŸ™‚ [18:05:51] MatmaRex: would you be able to verify once it's on debug servers? [18:06:05] sure [18:06:15] thanks! [18:13:08] (03PS1) 10Andrea Denisse: prometheus: Show transfer progress when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/912937 (https://phabricator.wikimedia.org/T309979) [18:14:13] (03CR) 10Andrea Denisse: "Hi, this is a small patch to get more visibility during Prometheus migrations." [puppet] - 10https://gerrit.wikimedia.org/r/912937 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [18:14:32] (JobUnavailable) firing: (4) Reduced availability for job thanos-sidecar in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:20] ^ That's m, I'm working with Prometheus. [18:20:35] (03Merged) 10jenkins-bot: Replace references to actionsToolbar [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/911804 (https://phabricator.wikimedia.org/T335469) (owner: 10Jforrester) [18:21:00] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:911804|Replace references to actionsToolbar (T335469)]] [18:21:05] T335469: Uncaught TypeError: can't access property "items", this.actionsToolbar.getToolGroupByName(...) is null - https://phabricator.wikimedia.org/T335469 [18:22:49] !log jhuneidi@deploy1002 jhuneidi and jforrester: Backport for [[gerrit:911804|Replace references to actionsToolbar (T335469)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [18:23:09] MatmaRex: ready ^ [18:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [18:30:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) [18:31:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) NAT mappings in place. Tested frbast1002 usage via a host file change on my local host with my current config and the stock config we provi... [18:31:18] I haven't reproduced the error so continuing with backport [18:32:52] jeena: very sorry, i'm back now [18:33:02] np, I was able to verify :) [18:34:25] (hmm, on an unrelated note, i just got logged out from all wikis. i guess that's normal, but noting just in case) [18:35:00] oh hmm I don't really know when that should happen [18:37:17] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:911804|Replace references to actionsToolbar (T335469)]] (duration: 16m 10s) [18:37:28] T335469: Uncaught TypeError: can't access property "items", this.actionsToolbar.getToolGroupByName(...) is null - https://phabricator.wikimedia.org/T335469 [18:37:46] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:23] rolling wmf.6 to all wikis now [18:39:06] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912941 (https://phabricator.wikimedia.org/T330212) [18:39:08] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912941 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:39:27] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10JKieserman) [18:40:10] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912941 (https://phabricator.wikimedia.org/T330212) (owner: 10TrainBranchBot) [18:42:22] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:47:02] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.6 refs T330212 [18:47:10] T330212: 1.41.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T330212 [18:53:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/912931 (owner: 10Volans) [18:54:42] (03PS1) 10Ryan Kemper: wdqs: make uptime sli a % [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/912944 [18:57:16] (03CR) 10Gehel: [C: 03+1] "LGTM" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/912944 (owner: 10Ryan Kemper) [18:57:21] (03PS2) 10Ryan Kemper: wdqs: make uptime sli a % [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/912944 (https://phabricator.wikimedia.org/T323064) [19:03:16] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.808% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:03:59] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@f162f4d]: Deploying T333001 on platform_eng Airflow instance. [19:04:04] T333001: Setup for allowing Airflow deployment via Git Repository - https://phabricator.wikimedia.org/T333001 [19:04:32] (03PS3) 10Ryan Kemper: wdqs: make uptime sli a % [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/912944 (https://phabricator.wikimedia.org/T323064) [19:04:35] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: make uptime sli a % [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/912944 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [19:08:32] (03PS1) 10Catrope: tests: Don't fail "composer buildConfig" when there are changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912947 [19:10:40] (03PS2) 10Catrope: tests: Don't fail "composer diffConfig" when there are changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912947 [19:14:15] (03CR) 10Dzahn: [C: 03+1] "though it won't show overall progress, only progress per each file, I think" [puppet] - 10https://gerrit.wikimedia.org/r/912937 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [19:14:30] (03PS68) 10Jbond: puppetserver: add puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) [19:14:32] (03PS1) 10Jbond: get_config: add specific get_config script for puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/912949 [19:16:00] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@f162f4d]: Deploying T333001 on platform_eng Airflow instance. (duration: 12m 01s) [19:16:05] T333001: Setup for allowing Airflow deployment via Git Repository - https://phabricator.wikimedia.org/T333001 [19:16:53] (03CR) 10Jbond: get_config: add specific get_config script for puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912949 (owner: 10Jbond) [19:17:03] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10BCornwall) @ayounsi Thanks for the report! I have a naive question: Would it be possible/more correct to interface confctl/etcd rather than a cookbook? That (by my observa... [19:17:47] 10SRE, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Papaul) [19:18:23] 10SRE: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Papaul) a:03Jgreen [19:18:36] 10SRE: Q3:rack/setup/install X - https://phabricator.wikimedia.org/T334505 (10Papaul) @Jgreen all yours [19:19:19] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Papaul) a:03Jgreen [19:19:47] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Papaul) @Jgreen all yours [19:27:00] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@bc37201]: (no justification provided) [19:27:11] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@bc37201]: (no justification provided) (duration: 00m 10s) [19:30:17] (03PS13) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [19:31:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T335437 (10Dzahn) @Marostegui It's called "NDA and MOU: Volunteer accounts with Server and LDAP-level access...". Members of "sre" should be able to see it. [19:34:14] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:41:38] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10BBlack) I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the p... [19:48:52] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:49:06] PROBLEM - Query Service HTTP Port on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:49:16] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:49:26] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:49:32] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:50:04] PROBLEM - WDQS SPARQL on wdqs2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 414 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:50:08] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-categories.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:13] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:31] ^Anyone messing with this? [19:51:16] OOM [19:52:20] RECOVERY - Query Service HTTP Port on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:52:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:52:35] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2012 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:52:40] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2012 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:53:16] RECOVERY - WDQS SPARQL on wdqs2012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.276 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:53:38] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:54:28] (03CR) 10BryanDavis: [C: 03+1] "Per legoktm's comment on T320848, this should probably sit until at least 2023-04-28 to increase our confidence in the the new shellbox co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912895 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [19:54:45] ^I restarted the failed systemd services. wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service and wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service are being difficult [19:55:13] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:55:23] ryankemper: ^ [19:55:51] gehel: thanks [19:55:57] Not sure why yet but they're just saying "nope, the exporters aren't running" and just exit [19:56:13] brett: the exporter service will fail to start if blazegraph (which produces the metrics the exporter wants to export) is failing [19:56:16] and now that they're running they don't exit :/ [19:56:20] taking a look now, thanks for jumping in [19:56:28] ryankemper: blazegraph was running, AFAIK [19:56:59] > Service prometheus-blazegraph-exporter-wdqs-categories not present or not running [19:57:04] > > wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service: Main process exited, code=exited, status=1/FAILURE [19:57:51] systemctl start prometheus-blazegraph-exporter-wdqs-categories && systemctl start wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service works [19:58:02] RECOVERY - Check systemd state on wdqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:59:08] (03PS1) 10Kosta Harlan: GrowthExperiments: Undeploy topic match mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912959 (https://phabricator.wikimedia.org/T335205) [19:59:39] (03CR) 10Kosta Harlan: [C: 04-2] "Wait for Idebb03746b538eb5340c85aa582797dd4af3b4ad to be in group2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912959 (https://phabricator.wikimedia.org/T335205) (owner: 10Kosta Harlan) [20:00:04] brennen and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230427T2000). [20:00:04] superpes, cmelo, and Daimona: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:13] (SystemdUnitFailed) resolved: (5) wdqs-blazegraph.service Failed on wdqs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:23] brett: ah yes, specifically the blazegraph exporter needs `wdqs-blazegraph` and the categories one `wdqs-categories`. Looks like you got it figured out, I see all the units up and running [20:00:58] o/ [20:01:37] this host wdqs2012 was used to transfer a blazegraph journal to a not-yet-in-service-host `wdqs2022` (that transfer has finished), I think probably the downtime expired thus the alerts. anyway services are running now, will leave `wdqs2012` depooled while it catches up on lag [20:01:55] I see. Thanks for the context [20:02:31] * TheresNoTime can deploy [20:03:07] cmelo: will do your beta-only one first [20:03:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912867 (https://phabricator.wikimedia.org/T334088) (owner: 10Daimona Eaytoy) [20:03:36] PROBLEM - Check systemd state on cloudbackup2001 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-tools-project.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:20] (03Merged) 10jenkins-bot: beta: Restore campaignevents-organize-events right for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912867 (https://phabricator.wikimedia.org/T334088) (owner: 10Daimona Eaytoy) [20:05:40] cmelo: Daimona: that'll be live on beta in the next 5 mins or so [20:05:48] ty [20:05:48] Superpes: ready? [20:06:03] Thanks! [20:06:04] HI TheresNoTime YEP [20:06:14] Ops sorry for the caps lmao [20:06:49] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912874 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:07:43] (03Merged) 10jenkins-bot: [cawikibooks] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912874 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:08:05] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912877 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:08:11] Superpes: your *so* ready ;) [20:08:27] Lol [20:08:53] (03Merged) 10jenkins-bot: [cawikinews] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912877 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:09:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912880 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:10:05] (03Merged) 10jenkins-bot: [cawikiquote] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912880 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:10:26] !log samtar@deploy1002 Started scap: Backport for [[gerrit:912874|[cawikibooks] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912877|[cawikinews] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912880|[cawikiquote] Add a wordmark (Vector 2022) (T331823)]] [20:10:31] T331823: Add the Catalan Wikiquote, Wikibooks, Wiktionary, Wikisource, and Wikinews correct wordmark in the new vector skin - https://phabricator.wikimedia.org/T331823 [20:11:32] Superpes: doing `cawikibooks`, `cawikinews` and `cawikiquote` together [20:11:45] (03PS2) 10Samtar: [cawikisource] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912884 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:11:49] !log samtar@deploy1002 samtar and superpes: Backport for [[gerrit:912874|[cawikibooks] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912877|[cawikinews] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912880|[cawikiquote] Add a wordmark (Vector 2022) (T331823)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:12:00] Ok will test these [20:12:04] :) [20:12:22] (03PS2) 10Samtar: [cawiktionary] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912888 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:12:43] (FTR: confirming that the beta config change is now live and working) [20:12:50] Daimona: \o/ [20:13:13] Now we're done for real, thanks again :) [20:13:21] Ciao Daimona :) [20:13:48] Everything is fine TheresNoTime thanks :) [20:13:54] syncing those [20:14:12] Ciao Superpes, buon deployment ;) [20:14:37] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912884 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:15:23] (03Merged) 10jenkins-bot: [cawikisource] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912884 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:15:42] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912888 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:16:31] (03Merged) 10jenkins-bot: [cawiktionary] Add a wordmark (Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912888 (https://phabricator.wikimedia.org/T331823) (owner: 10Superpes15) [20:20:09] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:912874|[cawikibooks] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912877|[cawikinews] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912880|[cawikiquote] Add a wordmark (Vector 2022) (T331823)]] (duration: 09m 43s) [20:20:14] T331823: Add the Catalan Wikiquote, Wikibooks, Wiktionary, Wikisource, and Wikinews correct wordmark in the new vector skin - https://phabricator.wikimedia.org/T331823 [20:20:28] !log samtar@deploy1002 Started scap: Backport for [[gerrit:912884|[cawikisource] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912888|[cawiktionary] Add a wordmark (Vector 2022) (T331823)]] [20:20:30] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:14] thank you! [20:21:47] !log samtar@deploy1002 superpes and samtar: Backport for [[gerrit:912884|[cawikisource] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912888|[cawiktionary] Add a wordmark (Vector 2022) (T331823)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:21:51] Superpes: `cawikisource` and `cawiktionary` now :) [20:21:55] cmelo: you're welcome :) [20:22:17] TheresNoTime They're good too! :D [20:22:23] syncing [20:23:18] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:52] (03CR) 10BryanDavis: [C: 03+2] tcl86: switch base image to bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912343 (https://phabricator.wikimedia.org/T335420) (owner: 10BryanDavis) [20:26:02] (03CR) 10BryanDavis: [C: 03+2] Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [20:26:39] * TheresNoTime should create a playlist for their "deployment music" [20:26:44] (03Merged) 10jenkins-bot: tcl86: switch base image to bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912343 (https://phabricator.wikimedia.org/T335420) (owner: 10BryanDavis) [20:26:46] (03Merged) 10jenkins-bot: Remove jessie and stretch image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/911953 (owner: 10BryanDavis) [20:26:54] (03PS2) 10BryanDavis: Use shell webservice-runner for jdk17, ruby27 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912868 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [20:27:01] (03CR) 10BryanDavis: [C: 03+2] Use shell webservice-runner for jdk17, ruby27 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912868 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [20:27:41] (03Merged) 10jenkins-bot: Use shell webservice-runner for jdk17, ruby27 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912868 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [20:27:43] TheresNoTime: Highway to Hell? [20:27:43] TheresNoTime: deployment music would be fun [20:27:43] (03PS1) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [20:27:47] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:912884|[cawikisource] Add a wordmark (Vector 2022) (T331823)]], [[gerrit:912888|[cawiktionary] Add a wordmark (Vector 2022) (T331823)]] (duration: 07m 19s) [20:27:55] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service,burrow-logging-eqiad.service,burrow-main-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:55] T331823: Add the Catalan Wikiquote, Wikibooks, Wiktionary, Wikisource, and Wikinews correct wordmark in the new vector skin - https://phabricator.wikimedia.org/T331823 [20:27:55] Superpes: all live [20:28:03] (03CR) 10CI reject: [V: 04-1] OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:28:17] Many thanks TheresNoTime :) [20:29:23] (03PS2) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [20:29:45] (03CR) 10CI reject: [V: 04-1] OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:29:53] !log close UTC late backport window [20:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:03] (03PS1) 10EoghanGaffney: [gitlab/failover] Swap DNS entries for gitlab [dns] - 10https://gerrit.wikimedia.org/r/912972 (https://phabricator.wikimedia.org/T335504) [20:41:03] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) [20:44:10] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) [20:44:33] (03PS6) 10Ryan Kemper: search: Report age of titlesuggest indices to prom [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [20:48:00] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/912979 [20:48:14] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) [20:48:51] (03PS2) 10CDanis: add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) [20:50:29] (03PS3) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [20:50:44] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jgreen) [20:52:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [20:52:30] (03PS7) 10Ryan Kemper: search: Report age of titlesuggest indices to prom [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [20:53:43] (03PS8) 10Ryan Kemper: search: Report age of titlesuggest indices to prom [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [20:55:42] (03CR) 10Ebernhardson: [C: 03+1] search: Report age of titlesuggest indices to prom [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [20:56:19] (03PS4) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [20:58:27] (03CR) 10Ryan Kemper: [C: 03+2] search: Report age of titlesuggest indices to prom [puppet] - 10https://gerrit.wikimedia.org/r/911940 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [21:01:09] (03PS5) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [21:02:45] (03CR) 10Jforrester: "Without this how will we signal to Jenkins that there is (or is not) a diff? The only signal we have to emit is the exit code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912947 (owner: 10Catrope) [21:04:21] (03PS1) 10Jgreen: Add frack hosts frauth2002,frmon2002,frbast2002, remove host frbackup2001. [dns] - 10https://gerrit.wikimedia.org/r/912985 (https://phabricator.wikimedia.org/T334505) [21:07:15] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/output/912979/40942/" [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [21:08:40] (03PS5) 10Herron: kafkamon: transition to firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) [21:08:45] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Show transfer progress when migrating data [puppet] - 10https://gerrit.wikimedia.org/r/912937 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [21:16:01] (03CR) 10Cwhite: [C: 03+2] grafana: raise metadata fetch error [puppet] - 10https://gerrit.wikimedia.org/r/911842 (https://phabricator.wikimedia.org/T335413) (owner: 10Cwhite) [21:16:03] (03CR) 10Dwisehaupt: [C: 03+2] Add frack hosts frauth2002,frmon2002,frbast2002, remove host frbackup2001. [dns] - 10https://gerrit.wikimedia.org/r/912985 (https://phabricator.wikimedia.org/T334505) (owner: 10Jgreen) [21:19:59] 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) [21:20:18] 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jgreen) [21:20:31] (03PS2) 10Cwhite: prometheus::ops: add demo node exporter job for SONiC [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) [21:20:52] (03CR) 10Cwhite: prometheus::ops: add demo node exporter job for SONiC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [21:25:34] (03PS6) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [21:27:24] (03PS7) 10Andrew Bogott: OpenStack: add a clouds.yaml file for environment setup [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) [21:29:18] (03CR) 10Andrew Bogott: "This change will means anyone on a host with this file (e.g. cloudcontrols) will be able to access APIs as novaadmin without sudo." [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [21:58:37] (03CR) 10Cwhite: [C: 03+2] prometheus::ops: add demo node exporter job for SONiC [puppet] - 10https://gerrit.wikimedia.org/r/910083 (https://phabricator.wikimedia.org/T335027) (owner: 10Cwhite) [22:07:23] (03PS1) 10Zabe: Start writing to af_actor/afh_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912994 (https://phabricator.wikimedia.org/T334295) [22:08:52] jouncebot: nowandnext [22:08:52] No deployments scheduled for the next 7 hour(s) and 51 minute(s) [22:08:52] In 7 hour(s) and 51 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230428T0600) [22:09:23] (03CR) 10Zabe: [C: 03+2] Start writing to af_actor/afh_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912994 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [22:10:10] (03Merged) 10jenkins-bot: Start writing to af_actor/afh_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912994 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [22:10:46] !log zabe@deploy1002 Started scap: T334295 [22:10:54] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [22:17:45] !log zabe@deploy1002 Finished scap: T334295 (duration: 06m 58s) [22:17:50] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [22:18:02] (03PS1) 10Ebernhardson: Update extra plugin to 7.10.2-wmf8 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/912995 (https://phabricator.wikimedia.org/T332355) [22:23:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [22:44:02] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:50:33] (03PS14) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [22:52:24] (03PS15) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [22:52:52] (03PS1) 10BryanDavis: Remove jessie-sssd & stretch-sssd image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912997 [22:56:30] (03PS16) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [22:57:55] (03CR) 10BryanDavis: [C: 03+2] "Self +2 should be uncontroversial. This was meant to have been part of 7fc6331 but none of us noticed that I forgot to add the base images" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912997 (owner: 10BryanDavis) [22:58:31] (03Merged) 10jenkins-bot: Remove jessie-sssd & stretch-sssd image configuration [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/912997 (owner: 10BryanDavis) [23:03:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 4.712% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:06:08] (03PS1) 10Cwhite: opensearch: add disable_security_plugin option [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T335027) [23:06:58] (03PS2) 10Cwhite: opensearch: add disable_security_plugin option [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) [23:08:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) [23:08:22] (03PS3) 10Cwhite: opensearch: add disable_security_plugin option [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) [23:13:03] (03PS1) 10Cwhite: hiera: disable security plugin on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/912391 (https://phabricator.wikimedia.org/T333732) [23:14:39] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/912390/40945/" [puppet] - 10https://gerrit.wikimedia.org/r/912390 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [23:24:08] (03CR) 10Catrope: tests: Don't fail "composer diffConfig" when there are changes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912947 (owner: 10Catrope) [23:24:13] (03Abandoned) 10Catrope: tests: Don't fail "composer diffConfig" when there are changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912947 (owner: 10Catrope) [23:43:57] (03CR) 10Cwhite: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/912979 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [23:45:58] (03CR) 10Cwhite: [C: 03+1] prometheus: Add label to prometheus5002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912385 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [23:46:11] (03CR) 10Cwhite: [C: 03+1] prometheus: Add label to prometheus4002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912383 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [23:46:21] (03CR) 10Cwhite: [C: 03+1] prometheus: Add label to prometheus3002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912381 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [23:46:30] (03CR) 10Cwhite: [C: 03+1] prometheus: Add label to prometheus6002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912409 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [23:51:27] (03PS1) 10Catrope: labs: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913002 (https://phabricator.wikimedia.org/T319064)