[00:00:13] (SystemdUnitFailed) resolved: puppet-agent-timer.service Failed on apifeatureusage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:17] (03PS1) 10Superpes15: [slwiki] Enable VisualEditor on Draft and Project namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913687 (https://phabricator.wikimedia.org/T335208) [00:09:00] (03PS2) 10Superpes15: [slwiki] Enable VisualEditor on Draft and Project namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913687 (https://phabricator.wikimedia.org/T335208) [00:23:29] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:34] (03PS1) 10Superpes15: [frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913689 (https://phabricator.wikimedia.org/T335642) [00:30:03] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:47] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:39:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912403 [00:39:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912403 (owner: 10TrainBranchBot) [00:56:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/912403 (owner: 10TrainBranchBot) [00:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:45:26] (03CR) 10Tim Starling: [C: 03+1] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [02:09:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [03:00:05] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:15] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:14] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.273% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:13:05] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:09] (03PS1) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:18:03] (03PS2) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:21:44] (03PS3) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:25:19] (03PS4) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:44:47] (03PS5) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:45:36] (03PS6) 10RLazarus: Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) [03:50:01] (03CR) 10RLazarus: "Preview: https://grafana.wikimedia.org/dashboard/snapshot/7l2nIPrmQiNbRfCptITXYZsBfL0FpJL4" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [04:22:26] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10SGupta-WMF) [04:24:54] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10SGupta-WMF) [04:25:45] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10SGupta-WMF) [04:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [06:46:37] (03PS1) 10Elukey: amd_rocm: fix package declaration on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/913695 [06:48:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40968/console" [puppet] - 10https://gerrit.wikimedia.org/r/913695 (owner: 10Elukey) [06:49:44] (03CR) 10Elukey: [V: 03+1 C: 03+2] amd_rocm: fix package declaration on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/913695 (owner: 10Elukey) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:08:14] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.23% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:23:13] (DiskSpace) resolved: Disk space an-airflow1001:9100:/ 5.226% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:57:53] (03PS10) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [08:15:40] 10SRE-swift-storage, 10Discovery-Search: Ensure swiftly access for non-SREs - https://phabricator.wikimedia.org/T335144 (10Gehel) [08:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [09:51:49] (03CR) 10Majavah: [C: 04-1] "The formatting changes make this a bit hard to read but in general this seems good, except that you also need to install python3-click in " [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:52:02] (03CR) 10Majavah: [C: 04-1] maintain_dbusers: add prometheus stats (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:52:59] (03CR) 10David Caro: "Did not test the cli, but looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/913684 (owner: 10Andrew Bogott) [09:57:16] (03CR) 10Majavah: [C: 04-1] Use signed-by to include the Wikimedia repo starting with Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913121 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1000) [10:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [10:38:40] (03PS1) 10David Caro: p:toolforge::prometheus: add rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/913912 [10:39:22] (03CR) 10Majavah: "too slow, I sent out https://gerrit.wikimedia.org/r/c/operations/puppet/+/913664 yesterday :-P" [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [10:40:17] (03CR) 10David Caro: "Specially I0149a5b0bc9eeaf34a8bc6c8dbe7b339d2aa53ac will not work when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [10:40:59] (03CR) 10David Caro: p:toolforge::prometheus: add rewrite module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [10:41:18] (03CR) 10David Caro: [C: 03+2] p:toolforge::prometheus: add rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [10:41:33] (03PS1) 10Superpes15: Close nawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) [10:42:13] (03CR) 10CI reject: [V: 04-1] Close nawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) (owner: 10Superpes15) [10:46:28] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) (owner: 10Superpes15) [10:46:58] (03CR) 10David Caro: [C: 03+2] p:toolforge::prometheus: add rewrite module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [10:57:04] (03PS2) 10Majavah: P:toolforge::prometheus: reformat http definition [puppet] - 10https://gerrit.wikimedia.org/r/913664 [10:57:29] (03CR) 10Majavah: p:toolforge::prometheus: add rewrite module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/913912 (owner: 10David Caro) [11:12:56] (03PS1) 10Zabe: Start writing to af_actor/afh_actor in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913915 (https://phabricator.wikimedia.org/T334295) [11:15:49] jouncebot: nowandnext [11:15:50] No deployments scheduled for the next 1 hour(s) and 44 minute(s) [11:15:50] In 1 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1300) [11:16:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913915 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [11:16:32] (03PS3) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [11:16:49] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/913916 (owner: 10L10n-bot) [11:17:00] (03Merged) 10jenkins-bot: Start writing to af_actor/afh_actor in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913915 (https://phabricator.wikimedia.org/T334295) (owner: 10Zabe) [11:17:47] !log zabe@deploy1002 Started scap: Backport for [[gerrit:913915|Start writing to af_actor/afh_actor in group0 wikis (T334295)]] [11:17:51] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [11:20:21] !log zabe@deploy1002 sync-world aborted: Backport for [[gerrit:913915|Start writing to af_actor/afh_actor in group0 wikis (T334295)]] (duration: 02m 33s) [11:20:40] !log zabe@deploy1002 Started scap: Backport for [[gerrit:913915|Start writing to af_actor/afh_actor in group0 wikis (T334295)]] [11:24:54] 483 languages rebuilt out of 483 [11:24:56] hmm [11:31:12] !log zabe@deploy1002 zabe: Backport for [[gerrit:913915|Start writing to af_actor/afh_actor in group0 wikis (T334295)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:31:15] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [11:38:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:59] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: reformat http definition [puppet] - 10https://gerrit.wikimedia.org/r/913664 (owner: 10Majavah) [11:41:48] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:913915|Start writing to af_actor/afh_actor in group0 wikis (T334295)]] (duration: 21m 08s) [11:41:53] T334295: Write to af_actor/afh_actor in production - https://phabricator.wikimedia.org/T334295 [11:43:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:48:16] (03CR) 10David Caro: "Does this mean than now if dns fails, it will remove all the exports from the host? (thus shutting down NFS I guess)" [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) (owner: 10Andrew Bogott) [12:07:42] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: encapi: open up write access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [12:07:58] (03CR) 10David Caro: OpenStack: add a clouds.yaml file for environment setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [12:08:13] (03CR) 10David Caro: "hmpf... I failed to hit send xd, nm" [puppet] - 10https://gerrit.wikimedia.org/r/912965 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [12:10:10] (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra: install blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/913122 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [12:21:10] (03PS4) 10Slyngshede: Requisition approval functionality. [software/bitu] - 10https://gerrit.wikimedia.org/r/911249 [12:22:41] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10RobH) [12:28:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:30:07] (03CR) 10David Caro: "+1 from me if the questions are "yes, I'm sure it will never be empty" and "No, we did not get any of these logs" :)" [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [12:30:47] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [12:30:49] (03CR) 10David Caro: [C: 03+2] openstack: puppet-enc: add endpoint for deleting entire projects [puppet] - 10https://gerrit.wikimedia.org/r/906086 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [12:35:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:24] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/913660 (https://phabricator.wikimedia.org/T330759) (owner: 10Majavah) [12:35:56] (03CR) 10David Caro: [C: 03+2] openstack: envscript: update default port [puppet] - 10https://gerrit.wikimedia.org/r/913661 (owner: 10Majavah) [12:35:59] (03CR) 10David Caro: [C: 03+2] openstack: envscript: do not set a default for clouds_file [puppet] - 10https://gerrit.wikimedia.org/r/913660 (https://phabricator.wikimedia.org/T330759) (owner: 10Majavah) [12:36:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:50:48] (03CR) 10Majavah: openstack: admin_scripts: properly remove old projects from enc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [12:52:04] (03CR) 10David Caro: [C: 03+2] openstack: admin_scripts: properly remove old projects from enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906087 (https://phabricator.wikimedia.org/T334127) (owner: 10Majavah) [12:57:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1300). [13:00:05] Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ I can deploy [13:01:22] Superpes: around? [13:06:18] Hi Taavi yep [13:07:58] cool. is it fine to do all three at the same time or would you prefer to do them one at a time? [13:07:58] I’ve only a doubt for the last patch! Should the wiki be removed from the desktop-improvements db ? [13:08:18] I saw another closed project still in [13:08:32] Yep you can also deploy all three :) [13:08:49] (Together) [13:09:12] I think it's fine to leave it in, the readers/web people can adjust if needed [13:09:16] (03PS3) 10Majavah: [slwiki] Enable VisualEditor on Draft and Project namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913687 (https://phabricator.wikimedia.org/T335208) (owner: 10Superpes15) [13:09:21] (03PS2) 10Majavah: [frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913689 (https://phabricator.wikimedia.org/T335642) (owner: 10Superpes15) [13:09:26] (03PS3) 10Majavah: Close nawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) (owner: 10Superpes15) [13:09:28] Perfect! Thanks :) [13:09:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913687 (https://phabricator.wikimedia.org/T335208) (owner: 10Superpes15) [13:09:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913689 (https://phabricator.wikimedia.org/T335642) (owner: 10Superpes15) [13:09:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) (owner: 10Superpes15) [13:10:54] (03Merged) 10jenkins-bot: [slwiki] Enable VisualEditor on Draft and Project namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913687 (https://phabricator.wikimedia.org/T335208) (owner: 10Superpes15) [13:10:57] (03Merged) 10jenkins-bot: [frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913689 (https://phabricator.wikimedia.org/T335642) (owner: 10Superpes15) [13:11:00] (03Merged) 10jenkins-bot: Close nawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913913 (https://phabricator.wikimedia.org/T335674) (owner: 10Superpes15) [13:11:16] !log taavi@deploy1002 Started scap: Backport for [[gerrit:913687|[slwiki] Enable VisualEditor on Draft and Project namespaces (T335208)]], [[gerrit:913689|[frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 (T335642)]], [[gerrit:913913|Close nawiki (T335674)]] [13:11:22] T335674: Close na.wikipedia - https://phabricator.wikimedia.org/T335674 [13:11:22] T335642: Change the French Wikibooks logo - https://phabricator.wikimedia.org/T335642 [13:11:22] T335208: VisualEditor in the Draft and Wikipedia namespaces on the Slovenian Wikipedia - https://phabricator.wikimedia.org/T335208 [13:12:39] !log taavi@deploy1002 superpes and taavi: Backport for [[gerrit:913687|[slwiki] Enable VisualEditor on Draft and Project namespaces (T335208)]], [[gerrit:913689|[frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 (T335642)]], [[gerrit:913913|Close nawiki (T335674)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:12:45] Testing [13:13:36] Everything is fine taavi :) [13:13:40] awesome, syncing [13:13:44] Thanks :D [13:19:15] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:913687|[slwiki] Enable VisualEditor on Draft and Project namespaces (T335208)]], [[gerrit:913689|[frwikibooks] Change the logo for Vector legacy and add a wordmark for Vector 2022 (T335642)]], [[gerrit:913913|Close nawiki (T335674)]] (duration: 07m 59s) [13:19:20] T335674: Close na.wikipedia - https://phabricator.wikimedia.org/T335674 [13:19:21] T335642: Change the French Wikibooks logo - https://phabricator.wikimedia.org/T335642 [13:19:21] T335208: VisualEditor in the Draft and Wikipedia namespaces on the Slovenian Wikipedia - https://phabricator.wikimedia.org/T335208 [13:19:29] done! [13:20:58] Perfect! Thanks for your time taavi :) [13:31:03] (03PS2) 10Ssingh: pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) [13:32:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40969/console" [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:33:42] (03CR) 10Ssingh: [V: 03+1] pybal/lvs: remove backward compatibility for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:43:51] (03PS2) 10Raymond Ndibe: webservice: add tool- prefix [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/913681 (https://phabricator.wikimedia.org/T334657) [13:55:27] (03CR) 10Herron: [C: 03+1] Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [13:58:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10phaultfinder) [14:04:52] !log move ns1 from dns2001 to dns2002: T334049 [14:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:55] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [14:09:32] !log move backup routes for ns0 from dns2001 to dns2002: T334049 [14:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:12] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) Create Dispatch: Success You have successfully submitted request SR167238531. [14:26:56] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [14:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [14:30:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:35:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:39:49] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:46:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr) @BBlack dns1003 name is already in use. Should this be changed to dns100{4..6} [14:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10ssingh) @Jclark-ctr: yes please, dns100[1-3] are currently in use, so we should do dns100[4-6]. [14:54:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ssingh) Hi @Papaul: We are ready to start working on this, sorry for the delay! The above plan sounds fine so let's coordinate when you plan to go in so that I ca... [14:58:20] !log restart haproxy on cp1077: T334448 [14:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:25] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [15:05:48] (03CR) 10David Caro: [C: 03+2] cloudlib: support https for fetching data [puppet] - 10https://gerrit.wikimedia.org/r/875896 (owner: 10Majavah) [15:10:38] (03PS4) 10Majavah: hieradata: use port 443 for enc access on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/874894 [15:10:40] (03PS7) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [15:10:42] (03PS1) 10Majavah: hieradata: use port 443 for enc access on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/913947 [15:13:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) @ssingh no need to be sorry and welcome back. You can decom the server you want first and once it's done just let me know which one. Thanks [15:19:34] (03CR) 10David Caro: [C: 03+2] hieradata: use port 443 for enc access on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/913947 (owner: 10Majavah) [15:20:33] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [15:21:07] (03CR) 10RLazarus: [V: 03+2 C: 03+2] Add linkrecommendation SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/913691 (https://phabricator.wikimedia.org/T278083) (owner: 10RLazarus) [15:22:07] 10SRE, 10Discovery-Search, 10Traffic, 10API Platform (API Platform Roadmap): Generic strategy to deal with high volume / expensive traffic from cloud providers - https://phabricator.wikimedia.org/T326782 (10Gehel) [15:22:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr) [15:23:23] jouncebot: next [15:23:23] In 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1530) [15:24:58] jouncebot: nowandnext [15:24:58] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [15:24:58] In 0 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1530) [15:24:58] 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) [15:28:22] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) > As a different approach, what if the reimaging cookbook printed out the role information from th... [15:29:01] PROBLEM - Host mw2270 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1530). [15:30:46] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10BCornwall) Amir provided this on the ops mailing list: On 2023-04-29 14:12, Amir Sarabadani wrote: >Did we h... [15:30:57] RECOVERY - Host mw2270 is UP: PING OK - Packet loss = 0%, RTA = 32.00 ms [15:32:11] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy) [15:32:37] 10SRE-swift-storage, 10Discovery-Search: Ensure swiftly access for non-SREs - https://phabricator.wikimedia.org/T335144 (10MPhamWMF) p:05Triage→03Medium [15:33:52] (03PS1) 10Ahmon Dancy: Update kask container image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) [15:37:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:38:38] (03PS1) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) [15:39:52] (03PS2) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) [15:40:22] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [15:41:17] (03CR) 10Ahmon Dancy: "Don't merge yet!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy) [15:42:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:42:22] urandom: hey i was told you the person to contact for sessionstore2001 is that right? [15:43:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10MPhamWMF) [15:43:16] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10aaron) Headers and body could be logged for 5XXs easily enough in the Swift backend. [15:43:26] (03PS3) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) [15:43:37] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:44:11] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [15:46:28] (03PS4) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) [15:47:30] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [15:49:43] (03PS5) 10Eevans: sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) [15:50:45] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [15:51:35] (03CR) 10Andrew Bogott: nfs-exportd: Don't crash out if a dns lookup fails (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) (owner: 10Andrew Bogott) [15:51:44] (03PS3) 10Andrew Bogott: nfs-exportd: Don't crash out if a dns lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) [15:54:05] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:43] PROBLEM - Host mw2268 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:05] RECOVERY - Host mw2268 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [15:59:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: maintenance [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:25] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore2001 to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913951 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [16:01:17] (03PS1) 10Ssingh: hiera: temporarily remove dns2001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/913952 (https://phabricator.wikimedia.org/T334049) [16:03:55] !log upgrading sessionstore2001 to Cassandra 3.11.14 — T335383 [16:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:59] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [16:13:41] (03PS1) 10Eevans: sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913954 (https://phabricator.wikimedia.org/T335383) [16:15:21] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913954 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [16:17:05] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore200[2-3] to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913954 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [16:19:02] (03PS1) 10Andrew Bogott: rabbitmq_network_partition: move the rabbitmq alert from 'cloud' to 'eqiad' [alerts] - 10https://gerrit.wikimedia.org/r/913957 (https://phabricator.wikimedia.org/T335304) [16:19:55] !log upgrading sessionstore2002 to Cassandra 3.11.14 — T335383 [16:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:58] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [16:21:00] (03CR) 10David Caro: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/913957 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:22:01] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq_network_partition: move the rabbitmq alert from 'cloud' to 'eqiad' [alerts] - 10https://gerrit.wikimedia.org/r/913957 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:22:30] !log upgrading sessionstore2003 to Cassandra 3.11.14 — T335383 [16:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:28] (03Merged) 10jenkins-bot: rabbitmq_network_partition: move the rabbitmq alert from 'cloud' to 'eqiad' [alerts] - 10https://gerrit.wikimedia.org/r/913957 (https://phabricator.wikimedia.org/T335304) (owner: 10Andrew Bogott) [16:24:01] PROBLEM - IPMI Sensor Status on mw1466 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:26:58] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [16:27:39] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:28:00] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10FJoseph-WMF) Approved. [16:29:09] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:30:52] (03PS2) 10Urbanecm: dewiki: Deploy Growth features to 100% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) [16:30:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912233 (https://phabricator.wikimedia.org/T335385) (owner: 10Urbanecm) [16:33:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [16:37:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:912233|dewiki: Deploy Growth features to 100% of newcomers (T335385)]] [16:37:14] T335385: Increase Growth feature rollout at German Wikipedia to 100% - https://phabricator.wikimedia.org/T335385 [16:38:41] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:912233|dewiki: Deploy Growth features to 100% of newcomers (T335385)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [16:39:38] (03PS1) 10Ebernhardson: search: Fix collection of *_titlesuggest metric on small clusters [puppet] - 10https://gerrit.wikimedia.org/r/913959 (https://phabricator.wikimedia.org/T327199) [16:44:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:912233|dewiki: Deploy Growth features to 100% of newcomers (T335385)]] (duration: 07m 22s) [16:44:37] T335385: Increase Growth feature rollout at German Wikipedia to 100% - https://phabricator.wikimedia.org/T335385 [16:45:27] (03PS2) 10Urbanecm: [Growth] Remove config variables provided by extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 [16:45:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) rdb1014. A6. U.9 PORT. 3 CABLEID 1030 rdb1013. B6. U.6 PORT. 9 CABLEID 1278 [16:49:13] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance [16:50:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr) dns1004. A6. U.8 PORT. 11 CABLEID 1038 dns1005. B6 U.5 PORT. 0 CABLEID 1969 dns1006. C6 U27. PORT.27 CABLEID 3249 [16:54:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance [16:54:20] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10BCornwall) I'm hesitant to the idea of creating an abstraction over an abstraction - I may be an outlier but my experience with depooling has been with confctl rather than... [16:57:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [16:58:53] (03PS1) 10Andrew Bogott: nfs-exportd.py: remove some dead code [puppet] - 10https://gerrit.wikimedia.org/r/913962 [16:59:48] 10SRE, 10Traffic, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) 05Open→03Resolved Since the larger network issues have been fixed, I'm going to close this as resolved. Further improvements suggested... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1700) [17:00:05] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T1700). [17:02:05] jouncebot: rehash [17:02:10] jouncebot: refresh [17:02:10] I refreshed my knowledge about deployments. [17:04:26] (03CR) 10Andrew Bogott: mwopenstackclients3.py: add the ability to load auth creds from clouds.yaml (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913684 (owner: 10Andrew Bogott) [17:04:36] (03PS2) 10Andrew Bogott: mwopenstackclients3.py: add the ability to load auth creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913684 [17:09:16] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3.py: add the ability to load auth creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913684 (owner: 10Andrew Bogott) [17:12:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) dbproxy1022. A6. U10. PORT.9 CABLEID 1036 dbproxy1023. B6. U7. PORT.4 CABLEID 1273 dbproxy1024. C6. U28. PORT. 28 CABLEID 3250 dbproxy1025... [17:14:45] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10Jclark-ctr) a:03Jclark-ctr [17:22:12] (03PS1) 10Andrew Bogott: OpenStack observerenv: Add global clouds.yaml file with observer creds [puppet] - 10https://gerrit.wikimedia.org/r/913964 (https://phabricator.wikimedia.org/T330759) [17:22:45] (03PS1) 10Gergő Tisza: [noop] Disable section image recommendations in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913965 (https://phabricator.wikimedia.org/T329276) [17:28:06] (03CR) 10Ryan Kemper: [C: 03+1] search: Fix collection of *_titlesuggest metric on small clusters [puppet] - 10https://gerrit.wikimedia.org/r/913959 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [17:28:13] (03CR) 10Ryan Kemper: [C: 03+2] search: Fix collection of *_titlesuggest metric on small clusters [puppet] - 10https://gerrit.wikimedia.org/r/913959 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [17:30:58] (03PS1) 10Ssingh: ntp/codfw: point to dns2002 temporarily [dns] - 10https://gerrit.wikimedia.org/r/913966 (https://phabricator.wikimedia.org/T334049) [17:31:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) @Papaul Cables where connected to correct ports. i did swap cables while verifying Replaced Cable new cableid23030450... [17:31:50] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack observerenv: Add global clouds.yaml file with observer creds [puppet] - 10https://gerrit.wikimedia.org/r/913964 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [17:32:38] (03CR) 10Ssingh: [C: 03+2] ntp/codfw: point to dns2002 temporarily [dns] - 10https://gerrit.wikimedia.org/r/913966 (https://phabricator.wikimedia.org/T334049) (owner: 10Ssingh) [17:32:52] !log run authdns-update for CR 913966 [17:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:02] (03CR) 10Raymond Ndibe: [C: 03+2] webservice: add tool- prefix [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/913681 (https://phabricator.wikimedia.org/T334657) (owner: 10Raymond Ndibe) [17:35:50] (03Merged) 10jenkins-bot: webservice: add tool- prefix [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/913681 (https://phabricator.wikimedia.org/T334657) (owner: 10Raymond Ndibe) [17:42:08] (03CR) 10Gergő Tisza: [C: 03+1] "Would be nice to open a task about weeding some of these out of the code entirely." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912310 (owner: 10Urbanecm) [17:44:39] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:46:11] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:46:18] (03PS4) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) [17:49:51] (03CR) 10Thcipriani: [C: 03+1] gitlab runner: allow node:* images [puppet] - 10https://gerrit.wikimedia.org/r/911407 (https://phabricator.wikimedia.org/T335320) (owner: 10Mhurd) [17:50:00] (PowerSupply) firing: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:28:19] robh: thanks [18:28:55] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [18:33:26] (03CR) 10Superpes15: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913225 (https://phabricator.wikimedia.org/T335705) (owner: 10MdsShakil) [18:47:25] (03PS1) 10Ottomata: page_content_change - bump image to v0.13.0 for bugfix [deployment-charts] - 10https://gerrit.wikimedia.org/r/913973 (https://phabricator.wikimedia.org/T332948) [18:50:11] (03PS2) 10Ottomata: page_content_change - bump image to v0.13.0 and disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/913973 (https://phabricator.wikimedia.org/T332948) [18:55:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change - bump image to v0.13.0 and disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/913973 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata) [18:56:10] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:56:16] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:56:56] (03PS1) 10Ottomata: page_content_change - env value should be string [deployment-charts] - 10https://gerrit.wikimedia.org/r/913974 [18:57:04] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_change - env value should be string [deployment-charts] - 10https://gerrit.wikimedia.org/r/913974 (owner: 10Ottomata) [18:58:08] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:58:12] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:03:13] (03PS1) 10Andrew Bogott: mwopenstackclients: make the new 'oscloud' arg the last option [puppet] - 10https://gerrit.wikimedia.org/r/913975 [19:03:23] (03PS1) 10Ottomata: page_content_size - set http session pool_maxsize to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/913976 [19:04:10] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: make the new 'oscloud' arg the last option [puppet] - 10https://gerrit.wikimedia.org/r/913975 (owner: 10Andrew Bogott) [19:04:14] (03PS2) 10Ottomata: page_content_size - set http session pool_maxsize to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/913976 [19:05:16] (03CR) 10Ottomata: [V: 03+2 C: 03+2] page_content_size - set http session pool_maxsize to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/913976 (owner: 10Ottomata) [19:06:05] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:06:10] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:20:37] (03PS1) 10Andrew Bogott: envscript: mirror any global cloud definitions into root's clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913978 (https://phabricator.wikimedia.org/T330759) [19:41:59] !log dancy@deploy1002 Installing scap version "4.52.0" for 593 hosts [19:42:57] !log dancy@deploy1002 Installation of scap version "4.52.0" completed for 593 hosts [19:49:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10Jclark-ctr) [19:52:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10Jclark-ctr) [19:56:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T2000). [20:00:05] bd808, tgr, and MdsShakil: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:30] o/ I can deploy but would prefer if someone else would [20:01:32] o/ It's been so long since I deployed code for others I'm afraid I would struggle [20:04:12] that's fine, no worries! is it possible to test your patch on an mwdebug server? [20:04:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912895 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [20:05:08] taavi: yes. We can test by rendering any block [20:05:17] (03Merged) 10jenkins-bot: Point SyntaxHighlight at /srv/app/pygmentize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/912895 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [20:05:34] !log taavi@deploy1002 Started scap: Backport for [[gerrit:912895|Point SyntaxHighlight at /srv/app/pygmentize (T320848)]] [20:05:38] T320848: Install pygments in Shellbox container with pip, not a Debian package - https://phabricator.wikimedia.org/T320848 [20:06:53] !log taavi@deploy1002 legoktm and taavi: Backport for [[gerrit:912895|Point SyntaxHighlight at /srv/app/pygmentize (T320848)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:07:09] bd808: pulled to mwdebug servers, please test [20:08:12] taavi: seems to be working as expected [20:08:18] cool, syncing [20:11:18] (03PS1) 10Andrew Bogott: observerenv: include in both global and root-local clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913981 (https://phabricator.wikimedia.org/T330759) [20:11:44] (03Abandoned) 10Andrew Bogott: envscript: mirror any global cloud definitions into root's clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913978 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:13:47] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:912895|Point SyntaxHighlight at /srv/app/pygmentize (T320848)]] (duration: 08m 12s) [20:13:50] T320848: Install pygments in Shellbox container with pip, not a Debian package - https://phabricator.wikimedia.org/T320848 [20:13:57] aand it's live [20:14:45] MdsShakil: ping [20:14:55] (03PS2) 10Andrew Bogott: observerenv: include in both global and root-local clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913981 (https://phabricator.wikimedia.org/T330759) [20:16:22] thanks taavi. seems to still work. The fun bits will be switching the shellbox container later today. :D [20:16:43] oooh, exciting [20:20:15] (03CR) 10Andrew Bogott: [C: 03+2] observerenv: include in both global and root-local clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913981 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:28:47] (03PS1) 10Eevans: sessionstore: upgrade eqiad servers to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913983 (https://phabricator.wikimedia.org/T335383) [20:30:44] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913983 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [20:33:31] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in eqiad: maintenance [20:34:54] (03PS1) 10Andrew Bogott: Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 [20:36:56] (03CR) 10CI reject: [V: 04-1] Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [20:38:01] (03PS2) 10Andrew Bogott: Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 [20:38:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in eqiad: maintenance [20:39:05] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade eqiad servers to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/913983 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [20:39:47] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused role and profile for wmcs project- and home- nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/911424 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [20:40:05] (03CR) 10CI reject: [V: 04-1] Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [20:41:05] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: Don't crash out if a dns lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/913200 (https://phabricator.wikimedia.org/T335336) (owner: 10Andrew Bogott) [20:42:04] 10SRE, 10Traffic, 10Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10BCornwall) 05Open→03Stalled @bblack, @Vgutierrez is this patch by ema still something we'd like incorporated? [20:42:41] !log upgrading sessionstore1001 to Cassandra 3.11.14 — T335383 [20:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:45] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [20:45:06] !log upgrading sessionstore1002 to Cassandra 3.11.14 — T335383 [20:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:07] (03PS3) 10Andrew Bogott: Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 [20:46:09] (03PS1) 10Andrew Bogott: Create /root/.config/openstack before putting clouds.yaml in it [puppet] - 10https://gerrit.wikimedia.org/r/914007 [20:47:05] !log upgrading sessionstore1003 to Cassandra 3.11.14 — T335383 [20:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:47] (03CR) 10CI reject: [V: 04-1] Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 (owner: 10Andrew Bogott) [20:49:53] (03PS4) 10Andrew Bogott: Update a lot of mwopenstackclients uses to get creds from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/913985 [20:49:59] (03CR) 10Andrew Bogott: [C: 03+2] Create /root/.config/openstack before putting clouds.yaml in it [puppet] - 10https://gerrit.wikimedia.org/r/914007 (owner: 10Andrew Bogott) [20:57:01] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on an-worker1147:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=an-worker1147 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [20:57:34] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in eqiad: maintenance [20:58:53] Hi taavi [21:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230501T2100). [21:01:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10thcipriani) >>! In T335101#8795639, @Clement_Goubert wrote: > @thcipriani As approver for the `restricted` group, can you approve this request? Approved, sorry... [21:01:38] 10SRE, 10Traffic: Clean up Traffic Grafana dashboards to reflect HA-Proxy metrics - https://phabricator.wikimedia.org/T304153 (10BCornwall) 05In progress→03Invalid Marking as invalid as this is too vague to be actionable. Considering that we've been running haproxy for some time now and appear to have usef... [21:01:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10BCornwall) [21:02:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in eqiad: maintenance [21:02:45] (03PS1) 10Andrew Bogott: Revert "Remove unused role and profile for wmcs project- and home- nfs servers" [puppet] - 10https://gerrit.wikimedia.org/r/913987 [21:04:50] (03CR) 10CI reject: [V: 04-1] Revert "Remove unused role and profile for wmcs project- and home- nfs servers" [puppet] - 10https://gerrit.wikimedia.org/r/913987 (owner: 10Andrew Bogott) [21:06:37] (03PS2) 10Andrew Bogott: Partially revert "Remove unused role and profile for wmcs project-..." [puppet] - 10https://gerrit.wikimedia.org/r/913987 [21:06:39] (03PS2) 10Andrew Bogott: nfs-exportd.py: remove some dead code [puppet] - 10https://gerrit.wikimedia.org/r/913962 [21:08:31] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083] for row C switch upgrade - bking@cumin1001 - T334049 [21:08:31] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083] for row C switch upgrade - bking@cumin1001 - T334049 [21:08:35] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [21:08:49] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083]* for row C switch upgrade - bking@cumin1001 - T334049 [21:08:52] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083]* for row C switch upgrade - bking@cumin1001 - T334049 [21:08:57] (03CR) 10Andrew Bogott: [C: 03+2] Partially revert "Remove unused role and profile for wmcs project-..." [puppet] - 10https://gerrit.wikimedia.org/r/913987 (owner: 10Andrew Bogott) [21:10:05] 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10BCornwall) [21:15:12] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 17 hosts with reason: T334049 maint [21:15:16] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [21:15:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 17 hosts with reason: T334049 maint [21:16:50] 10SRE, 10Traffic: Create CI for latency-measurement - https://phabricator.wikimedia.org/T318288 (10BCornwall) 05Open→03Invalid Closing as invalid since this utility isn't used very often. Perhaps, at a later date, we can re-explore this. [21:18:46] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10bking) [21:19:04] 10SRE, 10Traffic: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10BCornwall) Hi, @Vgutierrez, does this ticket still need any work done or can it be closed? Thanks! [21:23:11] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 53999 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [21:25:59] 10SRE, 10Traffic: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10BCornwall) @Vgutierrez Would you consider this completed and ready to be closed? [21:27:21] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BCornwall) @Vgutierrez Would you consider this completed and ready... [21:29:44] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10BCornwall) Hi, @CDanis! Would you be so kind as to provide a description that helps describe the work to be done in this ticket? Thanks! [21:30:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:46:21] (03PS1) 10Legoktm: shellbox: Bump to 2023-05-01-213815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914014 (https://phabricator.wikimedia.org/T320848) [21:47:16] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [21:47:19] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [21:47:25] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [21:47:27] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [21:47:52] oops, I need to merge the change first [21:48:04] (03CR) 10Legoktm: [C: 03+2] shellbox: Bump to 2023-05-01-213815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914014 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [21:50:00] (PowerSupply) firing: Power Supply - Status - issue on mw1466:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=mw1466 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [21:51:59] legoktm: heh. I wish I could say I've not done that same step skip. [21:54:07] (03Merged) 10jenkins-bot: shellbox: Bump to 2023-05-01-213815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/914014 (https://phabricator.wikimedia.org/T320848) (owner: 10Legoktm) [21:55:15] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [21:55:41] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [21:55:47] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [21:55:58] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [21:56:05] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [21:56:18] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [21:56:24] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [21:56:39] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [21:56:45] !log legoktm@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [21:57:06] !log legoktm@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [21:57:14] ok, on staging now [21:57:21] let me live hack mwdebug1001 to point to staging [21:58:58] bd808: okay, mwdebug1001 should be pointing to the new image on staging (haven't done any cache clear yet) [21:59:45] the version is just cached in apcu, so should we just restart php-fpm on mwdebug1001 to clear it? [22:00:06] seems like that would work, yeah [22:01:05] done, but still seeing 2.11.2 on special:version [22:02:27] > MediaWiki\SyntaxHighlight\Pygmentize::getVersion(); [22:02:27] = "2.11.2" [22:02:27] > MediaWiki\SyntaxHighlight\Pygmentize::fetchVersion(); [22:02:27] = "2.15.1" [22:03:51] oh it's also in wan cache, hmm [22:04:33] previews of highlighting (on mw.o using mwdebug1001 via browser extension) don't seem to have the new upstream wrapping whitespace yet. [22:05:00] that was stuff I had to update the tests for -- https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SyntaxHighlight_GeSHi/+/906127/5/tests/parser/parserTests.txt#103 [22:07:04] (03PS1) 10Bking: wdqs: use transferpy lib for data-transfer.py [cookbooks] - 10https://gerrit.wikimedia.org/r/914018 (https://phabricator.wikimedia.org/T321605) [22:07:27] I'm not sure how to deal with the fact that the version is cached in both apcu + wan cache [22:07:59] I think WAN means another server will fill it with 2.11.2 before mwdebug1001 can set 2.15.1 (plus we don't want that on other hosts yet...) [22:08:27] how about, I just live hack the cache key to be different on mwdebug1001 [22:08:50] legoktm: https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Sandbox is showing the json block with the comment formatted for me via mwdebug1001. That block is also showing the tags. Both are new with the new version. [22:09:25] oh cool [22:09:36] because those aren't cached and hit shellbox always :D [22:10:38] do you want to do any other testing or should we push it live now? [22:11:23] bd808: ^ [22:11:29] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2072608&oldid=2072583 [22:11:30] Hello everyone :) Tell me, I added a patch here, should I do something else? First time doing this :) [22:11:36] Let me try a lang=wikitext test page too just for grins. hang on [22:13:27] legoktm: :shipit: https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Sandbox/Wikitext looks right to me too. Compare to https://meta.wikimedia.beta.wmflabs.org/wiki/User:Bd808/Pygments2.15.0/Wikitext. [22:13:30] Iniquity: lgtm, just be around on IRC when the deployment window is scheduled [22:13:52] yes, taavi said it for me:)  thx [22:14:06] :) [22:14:09] ok! [22:14:38] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [22:15:29] https://grafana.wikimedia.org/d/3SiE86Nnz/mediawiki-shellouts?orgId=1&refresh=30s&viewPanel=12 [22:15:44] https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&refresh=30s [22:15:56] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [22:16:02] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [22:16:46] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [22:16:52] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [22:17:15] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [22:17:21] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [22:17:57] here we go! [22:18:13] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [22:18:17] (hopefully this is much less exciting and more routine in the future :D) [22:18:19] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [22:19:03] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [22:19:19] updating all of them is kind of noisy, but I guess for now that's how the service is designed [22:20:25] if one of them is going to be updated more frequently we can version it independently, but IMO this ends up being less work overall since they all benefit from shared deployments [22:21:21] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [22:22:09] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [22:22:15] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [22:22:39] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [22:22:45] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [22:23:12] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [22:23:18] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [22:23:48] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [22:23:54] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [22:24:37] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [22:28:56] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [22:34:50] legoktm: I guess now we wait for caches to age out? [22:34:56] pretty much [22:36:47] I'll be around for the next hour-ish, have the dashboards live updating in the background [22:36:54] the good news is that there's no cache stampede so far [22:40:26] (03PS1) 10Dzahn: gerrit: add gerrit1003 to gerrit ssh_allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/914021 (https://phabricator.wikimedia.org/T326368) [22:45:07] (03PS1) 10Jdlrobson: Pixel: Patches for latest release [skins/Vector] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914023 [22:49:22] (03PS2) 10Dzahn: gerrit: add gerrit1003 to gerrit ssh_allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/914021 (https://phabricator.wikimedia.org/T326368) [22:54:26] (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit1003 to gerrit ssh_allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/914021 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [22:54:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/914021/40974/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/914021 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [23:11:30] legoktm: https://www.mediawiki.org/wiki/Special:Version is showing the new version! [23:12:22] :D [23:12:51] bunch of fetch_lexers shell outs too, so they should have the new list now too [23:13:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:13:32] https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&refresh=30s&viewPanel=36&from=now-6h&to=now [23:13:43] the new version might also be faster? [23:13:57] (might also just be a case of needing more data) [23:14:42] I did the .1 bump for some specific speed regression fixes. I didn't look to see what other perf changes upstream had made in the versions we skipped. [23:18:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:31:13] (DiskSpace) firing: Disk space an-airflow1001:9100:/ 5.997% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-airflow1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace