[00:02:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "fixed puppet run after follow-up change above." [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [00:05:03] (03CR) 10Dzahn: [C: 03+2] "after this puppet run works and was noop on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/875449 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [00:08:19] (03PS1) 10Dzahn: phabricator: remove vcs_addresses parameter and warnings [puppet] - 10https://gerrit.wikimedia.org/r/875450 (https://phabricator.wikimedia.org/T296022) [00:10:20] (03CR) 10Dzahn: [V: 03+1] phabricator: add systemd::tmpfile snippet for phd run dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [00:10:45] (03Abandoned) 10Dzahn: phabricator: add systemd::tmpfile snippet for phd run dir [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [00:12:21] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/875450/38973/" [puppet] - 10https://gerrit.wikimedia.org/r/875450 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [00:13:52] (03PS3) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [00:14:11] (03CR) 10Dzahn: "manual rebase on top of my changes to previous patch in relation chain" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [00:16:05] (03PS4) 10Dzahn: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [00:18:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:19:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:03:08] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10BBlack) Summarizing some of the lengthy IRC discussion and investigation on this topic (most of which was @Vgutierrez !): We seem to have a likely candidate mechanism for how this is happening, and it has to do with... [01:22:34] (03PS1) 10BBlack: Add transit_buffer patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) [01:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:37:46] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:40] (03CR) 10BryanDavis: [C: 04-1] Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:42:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:16] (03CR) 10CI reject: [V: 04-1] Add transit_buffer patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [01:57:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:46] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:32] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:54:08] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:54:12] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:59:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:00:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:31:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:31:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:37:26] (03PS1) 10Marostegui: db2151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/875457 (https://phabricator.wikimedia.org/T326206) [06:38:55] (03CR) 10Marostegui: [C: 03+2] db2151: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/875457 (https://phabricator.wikimedia.org/T326206) (owner: 10Marostegui) [06:39:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2151 for the first time in s6 T326206', diff saved to https://phabricator.wikimedia.org/P42832 and previous config saved to /var/cache/conftool/dbconfig/20230105-063937-marostegui.json [06:39:41] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [06:41:47] (03PS1) 10Marostegui: db1176: Productionize db1176 [puppet] - 10https://gerrit.wikimedia.org/r/875458 (https://phabricator.wikimedia.org/T326211) [06:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 to clone db1176 T326211', diff saved to https://phabricator.wikimedia.org/P42833 and previous config saved to /var/cache/conftool/dbconfig/20230105-064153-marostegui.json [06:41:57] T326211: Install MariaDB 11 on db1176 - https://phabricator.wikimedia.org/T326211 [06:44:24] (03PS2) 10Marostegui: mariadb: Productionize db1176 into s1 [puppet] - 10https://gerrit.wikimedia.org/r/875458 (https://phabricator.wikimedia.org/T326211) [06:44:47] (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/875459 (https://phabricator.wikimedia.org/T326211) [06:45:50] (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/875459 (https://phabricator.wikimedia.org/T326211) (owner: 10Marostegui) [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0700) [07:00:04] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0700). [07:02:12] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875460 (https://phabricator.wikimedia.org/T320613) (owner: 10Awight) [07:08:44] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [07:08:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1176 into s1 [puppet] - 10https://gerrit.wikimedia.org/r/875458 (https://phabricator.wikimedia.org/T326211) (owner: 10Marostegui) [07:10:23] (03PS2) 10Awight: Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) [07:19:05] <_joe_> jouncebot: now [07:19:05] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0700) [07:19:06] For the next 0 hour(s) and 10 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0700) [07:19:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [07:24:13] (03Merged) 10jenkins-bot: mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [07:25:59] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:26:50] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:27:56] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:28:30] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:42:09] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/875353 (owner: 10Muehlenhoff) [07:50:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db2151 in s6 T326206', diff saved to https://phabricator.wikimedia.org/P42836 and previous config saved to /var/cache/conftool/dbconfig/20230105-075046-marostegui.json [07:50:51] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [07:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:58:00] (03PS1) 10Muehlenhoff: sre.swift.roll-restart-reboot-proxies: Also restart Envoy [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 [07:58:41] 10SRE, 10MediaWiki-libs-Rdbms, 10Performance-Team: Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10aaron) [07:58:43] !log installing glibc security updates on bullseye [07:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:08] 10SRE, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10larissagaulia) [08:00:04] Amir1, apergos, and jnuche: Time to snap out of that daydream and deploy UTC morning backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0800). [08:00:04] matthiasmullie: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:23] o/ [08:00:30] I'm not well enugh to run the deployment window today, migraines for day 3 in a row [08:00:44] 10SRE, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10aaron) p:05Triage→03Lowest [08:00:49] I hope either Amir1 or jnuche is here [08:04:42] Ah, I can also self-service [08:04:58] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Prometheus Redis exporters [puppet] - 10https://gerrit.wikimedia.org/r/875470 (https://phabricator.wikimedia.org/T135991) [08:05:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875470 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:06:16] it would be good if someone managing the window were here in case things go awry and additional coordination is needed [08:07:42] certainly :) [08:07:58] (03PS4) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 [08:11:30] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10ayounsi) Thanks for those great pictures! As a first step (and I think that's what you suggested previously!) it might be worth asking Interxion's remote hands if that'... [08:17:49] (03CR) 10JMeybohm: [C: 03+1] service::catalog: Add aux-k8s-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [08:18:59] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Enable Kartographer "nearby" on mobile skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875460 (https://phabricator.wikimedia.org/T320613) (owner: 10Awight) [08:19:29] I'll reschedule for this afternoon [08:20:31] (03CR) 10Awight: [C: 03+2] "Deploying to beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875460 (https://phabricator.wikimedia.org/T320613) (owner: 10Awight) [08:21:20] (03Merged) 10jenkins-bot: [beta] Enable Kartographer "nearby" on mobile skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875460 (https://phabricator.wikimedia.org/T320613) (owner: 10Awight) [08:22:03] thanks matthiasmullie, sorry for the delay [08:22:15] oh no worries [08:22:21] hope you feel better soon! [08:22:52] (03CR) 10JMeybohm: Add a spark-operator chart and helmfile configuration (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [08:24:00] thanks :-) [08:27:33] (03PS1) 10Ayounsi: drmrs offload Vodafone from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/875804 (https://phabricator.wikimedia.org/T324955) [08:29:58] (03PS1) 10Muehlenhoff: people: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875805 (https://phabricator.wikimedia.org/T135991) [08:30:00] (03PS1) 10Muehlenhoff: peopleweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/875806 (https://phabricator.wikimedia.org/T135991) [08:31:46] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [08:34:11] (03CR) 10DCausse: [C: 03+1] cirrus: Disable incoming link counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862343 (https://phabricator.wikimedia.org/T317023) (owner: 10Ebernhardson) [08:36:40] (03CR) 10DCausse: "I think the purpose of this was to ultimately upload the bz2 files in hdfs but since you wrote the spark job that directly imports from el" [puppet] - 10https://gerrit.wikimedia.org/r/835705 (owner: 10Ebernhardson) [08:41:01] (03CR) 10Ayounsi: [C: 03+2] drmrs offload Vodafone from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/875804 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [08:41:24] matthiasmullie: sorry I was late this morning. I can do the backport if you want [08:41:28] err config change [08:41:44] (03Merged) 10jenkins-bot: drmrs offload Vodafone from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/875804 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [08:41:58] ah, yeah, that'd be great [08:42:18] (03PS5) 10Hashar: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [08:42:30] I have fixed the Bug header ;) [08:43:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [08:43:40] which nowadays is all about heading to the deployment server and issueing `scap backport 830877` ;) [08:44:17] (03Merged) 10jenkins-bot: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [08:44:41] !log hashar@deploy1002 Started scap: Backport for [[gerrit:830877|[SearchVue] Enable extension on ptwiki, ruwiki & idwiki (T310367)]] [08:44:44] T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367 [08:44:46] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [08:46:05] hashar: what was up with that bug header? was it some non-space character? [08:46:31] (03CR) 10David Caro: "Just a small change, but looks good!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [08:46:36] !log hashar@deploy1002 hashar and mlitn: Backport for [[gerrit:830877|[SearchVue] Enable extension on ptwiki, ruwiki & idwiki (T310367)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:46:52] matthiasmullie: it used `Bug: 1234` which is for the legacy Bugzilla bugs [08:47:01] (03CR) 10MVernon: "I don't think the swift frontends (currently) run envoy, so I'm not sure why this change is necessary?" [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff) [08:47:09] it missed a `T` prefix in front of the number: `T1234` [08:47:10] T1234: Restrict Bugzilla access to read-only - https://phabricator.wikimedia.org/T1234 [08:47:10] ) [08:47:16] Ooh! I hadn't notices the lack of T :p [08:47:19] the patch is on mwdebug servers [08:48:19] cgecking [08:49:10] (03CR) 10David Caro: [C: 03+2] alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [08:49:32] (03CR) 10David Caro: karma: add metrcsinfra alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [08:49:49] hashar: LGTM [08:50:09] lets fly ;) [08:53:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:56:19] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:830877|[SearchVue] Enable extension on ptwiki, ruwiki & idwiki (T310367)]] (duration: 11m 38s) [08:56:23] T310367: [L] Deploy SearchVue - https://phabricator.wikimedia.org/T310367 [08:57:11] matthiasmullie: done ,) [08:57:15] hashar: Thanks! [08:58:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:00:05] dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0900). [09:06:12] (03CR) 10David Caro: [C: 03+2] karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [09:07:15] (03PS10) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) [09:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42837 and previous config saved to /var/cache/conftool/dbconfig/20230105-090831-root.json [09:11:24] (03PS1) 10Ayounsi: Revert "drmrs offload Vodafone from Tata" [homer/public] - 10https://gerrit.wikimedia.org/r/875384 [09:12:05] (03CR) 10Ayounsi: [C: 03+2] BGP for NTT in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/870904 (https://phabricator.wikimedia.org/T314929) (owner: 10Ayounsi) [09:12:14] (03CR) 10Hashar: [C: 03+1] phabricator: remove vcs_addresses parameter and warnings [puppet] - 10https://gerrit.wikimedia.org/r/875450 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [09:12:16] (03CR) 10Ayounsi: [C: 03+2] Revert "drmrs offload Vodafone from Tata" [homer/public] - 10https://gerrit.wikimedia.org/r/875384 (owner: 10Ayounsi) [09:12:54] (03Merged) 10jenkins-bot: Revert "drmrs offload Vodafone from Tata" [homer/public] - 10https://gerrit.wikimedia.org/r/875384 (owner: 10Ayounsi) [09:14:48] !log turn up BGP to NTT in drmrs - T314929 [09:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:58] (03PS1) 10Marostegui: Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/875385 [09:16:57] (03CR) 10Clément Goubert: [V: 03+1] service::catalog: Add aux-k8s-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [09:19:47] (03CR) 10Clément Goubert: [V: 03+1] service::catalog: Add aux-k8s-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [09:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42838 and previous config saved to /var/cache/conftool/dbconfig/20230105-092336-root.json [09:23:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:26:21] (03CR) 10Marostegui: [C: 03+2] Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/875385 (owner: 10Marostegui) [09:27:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42839 and previous config saved to /var/cache/conftool/dbconfig/20230105-092738-root.json [09:28:44] (03PS1) 10Marostegui: db1176: Install MariaDB 11 [puppet] - 10https://gerrit.wikimedia.org/r/875808 (https://phabricator.wikimedia.org/T326211) [09:28:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:30:12] (03CR) 10Marostegui: [C: 03+2] db1176: Install MariaDB 11 [puppet] - 10https://gerrit.wikimedia.org/r/875808 (https://phabricator.wikimedia.org/T326211) (owner: 10Marostegui) [09:34:55] (03CR) 10Volans: sre.swift.roll-restart-reboot-proxies: Also restart Envoy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff) [09:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42840 and previous config saved to /var/cache/conftool/dbconfig/20230105-093842-root.json [09:41:11] (03CR) 10Muehlenhoff: sre.swift.roll-restart-reboot-proxies: Also restart Envoy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff) [09:42:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42841 and previous config saved to /var/cache/conftool/dbconfig/20230105-094243-root.json [09:53:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42843 and previous config saved to /var/cache/conftool/dbconfig/20230105-095347-root.json [09:56:19] (03PS1) 10MVernon: swift: Remove ms-be2050 from the rings [puppet] - 10https://gerrit.wikimedia.org/r/875811 (https://phabricator.wikimedia.org/T308677) [09:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42844 and previous config saved to /var/cache/conftool/dbconfig/20230105-095748-root.json [09:58:23] (03CR) 10MVernon: sre.swift.roll-restart-reboot-proxies: Also restart Envoy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/875469 (owner: 10Muehlenhoff) [09:58:34] (03CR) 10MVernon: [C: 03+1] swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:02:25] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: Remove ms-be2050 from the rings [puppet] - 10https://gerrit.wikimedia.org/r/875811 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:03:01] (03CR) 10MVernon: [C: 03+2] swift: Remove ms-be2050 from the rings [puppet] - 10https://gerrit.wikimedia.org/r/875811 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [10:05:30] (03PS5) 10MVernon: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [10:05:49] jouncebot nowandnext [10:05:49] For the next 0 hour(s) and 54 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T0900) [10:05:49] In 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100) [10:05:49] In 0 hour(s) and 54 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100) [10:06:52] !log Restarting rolling reboot of api_appserver hosts in codfw [10:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:01] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:08:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42845 and previous config saved to /var/cache/conftool/dbconfig/20230105-100852-root.json [10:12:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42846 and previous config saved to /var/cache/conftool/dbconfig/20230105-101253-root.json [10:15:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38974/console" [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:17:16] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/875812 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:20:39] (03PS6) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) [10:20:44] (03CR) 10Jbond: [C: 04-1] admin: add data type for UIDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [10:22:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:23:15] (03PS2) 10Giuseppe Lavagetto: [WIP] modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) (owner: 10Ottomata) [10:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Pooling in s6', diff saved to https://phabricator.wikimedia.org/P42847 and previous config saved to /var/cache/conftool/dbconfig/20230105-102357-root.json [10:25:55] (03CR) 10Stevemunene: [C: 03+1] Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 (owner: 10Mforns) [10:26:26] !log Rolling reboot of api_appserver hosts in eqiad [10:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:33] (03PS6) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [10:26:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:27:43] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [10:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42848 and previous config saved to /var/cache/conftool/dbconfig/20230105-102758-root.json [10:32:46] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:41] (03PS2) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/875812 (https://phabricator.wikimedia.org/T288375) [10:42:46] (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42849 and previous config saved to /var/cache/conftool/dbconfig/20230105-104303-root.json [10:46:01] (03PS35) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [10:47:09] (03CR) 10Effie Mouzeli: [C: 03+1] redis: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868706 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:54:48] (03CR) 10Hashar: "Maybe instead of using systemd::override I should create the override file directly to workaround the circular dependency?" [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [10:56:10] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/875812 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [10:56:33] (03PS3) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 [10:56:35] (03PS2) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 [10:57:36] (03CR) 10CI reject: [V: 04-1] mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 (owner: 10Giuseppe Lavagetto) [10:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After cloning db1176', diff saved to https://phabricator.wikimedia.org/P42850 and previous config saved to /var/cache/conftool/dbconfig/20230105-105808-root.json [11:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100) [11:00:28] jouncebot: now [11:00:29] For the next 0 hour(s) and 59 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100) [11:00:29] For the next 0 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1100) [11:00:50] hashar: I'm rebooting api_appservers in eqiad, if you have a mw deployment to do [11:01:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:01:26] yeah I will hold a bit, I plan to deploy a Gerrit plugin update which would cause Gerrit to be unavailable for a minunte or so [11:02:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:04:46] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/875812 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [11:07:01] (03PS1) 10Muehlenhoff: postgresql::user: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/875817 [11:10:42] (03Merged) 10jenkins-bot: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/875812 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [11:12:37] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:12:59] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:13:07] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:13:25] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:13:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:13:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:13:57] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:14:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:17:36] (03CR) 10Hashar: [C: 03+2] wm-checks-api: add support for Puppet Catalogue Compiler [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424 (owner: 10Hashar) [11:18:24] (03Merged) 10jenkins-bot: wm-checks-api: add support for Puppet Catalogue Compiler [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424 (owner: 10Hashar) [11:19:41] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=mwdebug,name=codfw [11:19:48] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:20:24] !log hashar@deploy1002 Started deploy [gerrit/gerrit@32f984a]: wm-checks-api: add support for Puppet Catalogue Compiler [11:20:26] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:20:35] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@32f984a]: wm-checks-api: add support for Puppet Catalogue Compiler (duration: 00m 10s) [11:22:43] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mwdebug,name=codfw [11:23:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:23:31] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=mwdebug,name=eqiad [11:23:53] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:24:24] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=mwdebug,name=eqiad [11:24:35] .23 [11:25:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:12] (03PS1) 10Muehlenhoff: Extend memcached alias [puppet] - 10https://gerrit.wikimedia.org/r/875823 [11:27:46] (03CR) 10Effie Mouzeli: [C: 03+2] Puppet: Remove nutcracker and multi-dc redis [puppet] - 10https://gerrit.wikimedia.org/r/869183 (owner: 10Effie Mouzeli) [11:29:57] (03CR) 10Effie Mouzeli: [C: 03+2] Enable profile::auto_restarts::service for Prometheus Redis exporters [puppet] - 10https://gerrit.wikimedia.org/r/875470 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:30:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:22] (03PS3) 10David Caro: puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 [11:33:24] (03PS1) 10David Caro: puppet-enc.py: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 [11:33:26] (03PS1) 10David Caro: puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 [11:34:55] (03CR) 10CI reject: [V: 04-1] puppet-enc.py: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 (owner: 10David Caro) [11:35:48] (03CR) 10CI reject: [V: 04-1] puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 (owner: 10David Caro) [11:36:06] (03CR) 10CI reject: [V: 04-1] puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [11:40:57] !log disabling puppet on all hosts running mcrouter to merge 860102 [11:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:43:11] effie: you want me to stop rebooting appservers? [11:44:32] claime: I am sorry I hadnt realised you were in the middle of such thing ! [11:44:39] No worries [11:44:49] I'll stop it for now [11:45:08] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [11:45:38] great thank you, I just need to stop puppet [11:45:57] gimme a sec [11:46:00] and then you can continue, I will enable it a but leter again [11:46:01] sure sure [11:47:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:39] effie: you can go ahead [11:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:52:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:03] claime: I am going to restart Gerrit; I am waiting for a few changes being processed by CI. [11:56:27] 10SRE, 10Infrastructure-Foundations, 10LDAP, 10Security, 10User-MoritzMuehlenhoff: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10LSobanski) [11:56:49] (03PS3) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 [11:56:50] hashar: ack, np, I don't rely on it for the reboots [11:57:08] It's more scap deployments that are disturbed by the reboots [11:57:24] !log Stopping Gerrit for plugin deployment [11:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:29] !log hashar@deploy1002 Started deploy [gerrit/gerrit@32f984a]: wm-checks-api: add support for Puppet Catalogue Compiler [11:57:39] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@32f984a]: wm-checks-api: add support for Puppet Catalogue Compiler (duration: 00m 09s) [11:58:35] !log Restarting Gerrit [11:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:22] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:59:41] done ;) [12:01:52] (03PS36) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [12:02:06] (03PS2) 10Effie Mouzeli: mediawiki: adapt releases to the changes upstream in puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/868005 (owner: 10Giuseppe Lavagetto) [12:02:34] (03PS1) 10David Caro: puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 [12:02:36] (03PS5) 10Slyngshede: Access Requests, allow users to request more permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 [12:02:39] !log gerrit: running `copy-approvals` script to prepare for Gerrit 3.6 upgrade (T309870): `ssh -p 29418 gerrit.wikimedia.org gerrit copy-approvals --verbose` [12:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:43] T309870: Upgrade to Gerrit 3.6 - https://phabricator.wikimedia.org/T309870 [12:04:31] (03CR) 10CI reject: [V: 04-1] puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro) [12:04:45] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [12:06:16] (03PS4) 10David Caro: puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 [12:06:18] (03PS2) 10David Caro: puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 [12:06:20] (03PS2) 10David Caro: puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 [12:06:22] (03PS2) 10David Caro: puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 [12:08:07] (03CR) 10CI reject: [V: 04-1] puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 (owner: 10David Caro) [12:08:52] (03CR) 10CI reject: [V: 04-1] puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 (owner: 10David Caro) [12:09:05] (03CR) 10CI reject: [V: 04-1] puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [12:09:54] (03CR) 10CI reject: [V: 04-1] puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 (owner: 10David Caro) [12:13:47] (03PS7) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [12:13:49] (03PS10) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:15:52] (03PS8) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [12:16:02] (03PS11) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:16:56] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:17:43] (03CR) 10Effie Mouzeli: [C: 03+2] P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [12:18:41] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [12:19:13] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [12:24:11] (03PS9) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [12:26:30] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [12:28:07] (03CR) 10Vgutierrez: "builds as expected in build2001, please add a new changelog entry" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [12:29:54] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [12:30:41] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [12:31:00] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [12:31:56] !log ladsgroup: Deployed security patch for T233004 T326293 [12:31:59] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [12:32:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: adapt releases to the changes upstream in puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/868005 (owner: 10Giuseppe Lavagetto) [12:35:31] (03PS1) 10Ladsgroup: Add fix_creditssource_drifts.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) [12:37:30] (03Merged) 10jenkins-bot: mediawiki: adapt releases to the changes upstream in puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/868005 (owner: 10Giuseppe Lavagetto) [12:38:02] RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:16] (03PS1) 10Hashar: wm-checks-api: fix PCC handling of empty messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/875877 [12:41:05] (03CR) 10Marostegui: Add fix_creditssource_drifts.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) (owner: 10Ladsgroup) [12:41:44] (03CR) 10Hashar: [C: 03+2] wm-checks-api: fix PCC handling of empty messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/875877 (owner: 10Hashar) [12:42:07] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:42:24] (03CR) 10Ladsgroup: Add fix_creditssource_drifts.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) (owner: 10Ladsgroup) [12:42:31] (03Merged) 10jenkins-bot: wm-checks-api: fix PCC handling of empty messages [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/875877 (owner: 10Hashar) [12:42:39] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:43:19] (03CR) 10Marostegui: [C: 03+1] Add fix_creditssource_drifts.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) (owner: 10Ladsgroup) [12:43:44] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:01] (03CR) 10Ladsgroup: [C: 03+2] Add fix_creditssource_drifts.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) (owner: 10Ladsgroup) [12:44:06] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:44:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:44:24] (03Merged) 10jenkins-bot: Add fix_creditssource_drifts.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875876 (https://phabricator.wikimedia.org/T326156) (owner: 10Ladsgroup) [12:44:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42851 and previous config saved to /var/cache/conftool/dbconfig/20230105-124437-ladsgroup.json [12:44:40] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [12:45:04] !log hashar@deploy1002 Started deploy [gerrit/gerrit@b1ae5b4]: wm-checks-api: fix PCC handling of empty messages [12:45:14] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b1ae5b4]: wm-checks-api: fix PCC handling of empty messages (duration: 00m 10s) [12:45:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:46:40] (03PS1) 10Giuseppe Lavagetto: mwdebug: change the mcrouter remote pools too [deployment-charts] - 10https://gerrit.wikimedia.org/r/875879 [12:46:43] <_joe_> effie: ^^ [12:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42852 and previous config saved to /var/cache/conftool/dbconfig/20230105-124651-ladsgroup.json [12:47:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mwdebug: change the mcrouter remote pools too [deployment-charts] - 10https://gerrit.wikimedia.org/r/875879 (owner: 10Giuseppe Lavagetto) [12:48:25] k [12:48:44] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:49:35] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:49:51] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:52:33] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:58:36] I am restarting Gerrit again for a plugin update [12:58:48] !log hashar@deploy1002 Started deploy [gerrit/gerrit@b1ae5b4]: wm-checks-api: fix PCC handling of empty messages [12:58:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b1ae5b4]: wm-checks-api: fix PCC handling of empty messages (duration: 00m 08s) [12:59:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:00:38] !log Restarted Gerrit for a plugin update [13:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:47] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:01:06] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:01:07] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:01:22] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 18 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:01:27] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:01:43] !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [13:01:43] !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [13:01:49] !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [13:01:49] !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [13:01:59] !log oblivian@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [13:01:59] !log oblivian@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [13:01:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42853 and previous config saved to /var/cache/conftool/dbconfig/20230105-130158-ladsgroup.json [13:02:02] !log oblivian@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [13:02:03] !log oblivian@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [13:02:04] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:02:13] (03PS1) 10Phedenskog: prometheus: recording rules for webperf metrics. [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) [13:02:48] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:02:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:03:03] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:03:04] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:03:15] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:03:16] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:03:24] (03PS10) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [13:03:26] (03PS1) 10Jbond: P:cache::envoy: drop profile as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/875888 [13:03:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:06:59] (03PS2) 10Jbond: P:cache::envoy: drop profile as its no longer used [puppet] - 10https://gerrit.wikimedia.org/r/875888 [13:07:54] (03PS11) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [13:08:05] (03PS12) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [13:08:42] <_joe_> jouncebot: now [13:08:42] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [13:08:51] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 [13:09:14] (03CR) 10Muehlenhoff: C:ldap::client::utils remove ldapsupportlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870524 (owner: 10Slyngshede) [13:09:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 (owner: 10Giuseppe Lavagetto) [13:09:54] 10SRE, 10SRE-swift-storage: All thumbnails on arywiki broken. Giving "unauthorized" error - https://phabricator.wikimedia.org/T326309 (10Bawolff) [13:12:43] (03PS2) 10Slyngshede: C:ldap::client::utils remove ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/870524 [13:13:20] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) [13:16:19] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) [13:16:25] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 [13:16:29] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) I had a look at the eqiad counters and updated the task description with what has increased. We should replace the optic on the 5 interfaces that stand out. [13:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42854 and previous config saved to /var/cache/conftool/dbconfig/20230105-131705-ladsgroup.json [13:19:30] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 (owner: 10Giuseppe Lavagetto) [13:20:00] (03CR) 10Muehlenhoff: C:ldap::client::utils remove ldapsupportlib (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870524 (owner: 10Slyngshede) [13:21:37] !log enable puppet on all mw servers [13:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 (owner: 10Giuseppe Lavagetto) [13:22:45] (03Abandoned) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [13:24:24] (03PS1) 10Clément Goubert: mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) [13:25:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875817 (owner: 10Muehlenhoff) [13:25:34] (03CR) 10Muehlenhoff: [C: 03+2] postgresql::user: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/875817 (owner: 10Muehlenhoff) [13:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:26:45] 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10LSobanski) @MoritzMuehlenhoff to see if I understand your most recent comment correctly, is production done and the remaining work is within WMCS only? [13:27:06] (03Merged) 10jenkins-bot: mediawiki: fix typo in rsyslog configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/875890 (owner: 10Giuseppe Lavagetto) [13:27:58] (03PS2) 10Clément Goubert: mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) [13:28:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869228 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:28:26] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:28:46] (03PS1) 10Matthias Mullie: Also get central description [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875906 (https://phabricator.wikimedia.org/T325831) [13:29:24] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:29:42] (03CR) 10Muehlenhoff: [C: 03+2] IDP: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/869228 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:30:01] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:30:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:30:11] (03PS1) 10Ladsgroup: Enable write both for externallinks in ten largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875892 (https://phabricator.wikimedia.org/T321662) [13:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42855 and previous config saved to /var/cache/conftool/dbconfig/20230105-133211-ladsgroup.json [13:32:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1100.eqiad.wmnet with reason: Maintenance [13:32:17] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [13:32:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1100.eqiad.wmnet with reason: Maintenance [13:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T326156)', diff saved to https://phabricator.wikimedia.org/P42856 and previous config saved to /var/cache/conftool/dbconfig/20230105-133234-ladsgroup.json [13:33:20] (03PS1) 10Muehlenhoff: Turnilo: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875893 (https://phabricator.wikimedia.org/T135991) [13:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T326156)', diff saved to https://phabricator.wikimedia.org/P42857 and previous config saved to /var/cache/conftool/dbconfig/20230105-133448-ladsgroup.json [13:36:03] jouncebot: nowandnext [13:36:04] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [13:36:04] In 0 hour(s) and 23 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1400) [13:36:04] In 0 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1400) [13:36:10] (03CR) 10Ladsgroup: [C: 03+2] Enable write both for externallinks in ten largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875892 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [13:36:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875892 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [13:36:53] (03Merged) 10jenkins-bot: Enable write both for externallinks in ten largest s3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875892 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [13:37:21] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:875892|Enable write both for externallinks in ten largest s3 wikis (T321662)]] [13:37:24] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [13:38:00] !log start [eqiad] faulty VC optics maintenance - T325803 [13:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:03] T325803: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 [13:39:14] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:875892|Enable write both for externallinks in ten largest s3 wikis (T321662)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:40:35] (03Abandoned) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: add gutter pools for /*/mw-wan keys [puppet] - 10https://gerrit.wikimedia.org/r/864853 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [13:40:42] (03CR) 10Slyngshede: C:ldap::client::utils remove ldapsupportlib (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/870524 (owner: 10Slyngshede) [13:42:29] !log aswikiquote: Run importDump.php to import a XML dump (per new wiki importers request, running into issues with a largish page) [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:22] PROBLEM - Juniper virtual chassis ports on asw2-a-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:45:44] (03PS1) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: add gutter pools for /*/mw-wan keys [puppet] - 10https://gerrit.wikimedia.org/r/875894 (https://phabricator.wikimedia.org/T258779) [13:46:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:875892|Enable write both for externallinks in ten largest s3 wikis (T321662)]] (duration: 08m 54s) [13:46:18] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [13:46:19] (03CR) 10FNegri: [C: 03+1] "I have a slight preference for using the '??' in the hostname, but I'm also happy for this to be merged as it is." [puppet] - 10https://gerrit.wikimedia.org/r/869816 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:49:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42858 and previous config saved to /var/cache/conftool/dbconfig/20230105-134955-ladsgroup.json [13:51:22] RECOVERY - Juniper virtual chassis ports on asw2-a-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:51:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [13:53:09] (03CR) 10Jbond: [C: 03+2] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [13:53:21] (03PS17) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [13:53:51] (03CR) 10Effie Mouzeli: [C: 03+1] Extend memcached alias [puppet] - 10https://gerrit.wikimedia.org/r/875823 (owner: 10Muehlenhoff) [13:58:00] !log start of externallinks migration in elwiki (and rest of large wikis in s3) (T326314) [13:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:04] T326314: Run maint script backfilling new fields - https://phabricator.wikimedia.org/T326314 [13:58:24] (03PS5) 10Majavah: openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) [13:58:26] (03PS2) 10Majavah: hieradata: use port 443 for enc access [puppet] - 10https://gerrit.wikimedia.org/r/874894 [13:58:28] (03PS5) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [13:58:30] (03PS1) 10Majavah: cloudlib: support https for fetching data [puppet] - 10https://gerrit.wikimedia.org/r/875896 [13:58:41] (03PS1) 10Effie Mouzeli: ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 [13:59:00] (03PS1) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875907 [13:59:12] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:59:47] 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10MoritzMuehlenhoff) >>! In T198138#8501425, @LSobanski wrote: > @MoritzMuehlenhoff to see if I understand your most recent comment correctly, is production done and the remaining work is within WMCS onl... [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1400) [14:00:05] matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:23] o/ [14:01:18] (03CR) 10Muehlenhoff: [C: 03+2] Extend memcached alias [puppet] - 10https://gerrit.wikimedia.org/r/875823 (owner: 10Muehlenhoff) [14:01:29] (03CR) 10Hashar: [C: 03+1] "That is awesome thank you so much :)" [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [14:02:16] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 18 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:02:56] (03PS3) 10Slyngshede: C:ldap::client::utils remove ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/870524 [14:03:45] (03PS1) 10Majavah: hieradata: disable agent forwarding in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) [14:05:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42859 and previous config saved to /var/cache/conftool/dbconfig/20230105-140501-ladsgroup.json [14:05:43] (03PS2) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) [14:06:02] (03PS2) 10Effie Mouzeli: ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 [14:06:25] marostegui: you can self-serve. Correct? [14:06:51] sorry, wrong ping [14:06:55] matthiasmullie: [14:06:58] Yes, happy to [14:07:02] Starting [14:07:06] (03CR) 10Jbond: [C: 03+2] "recived configmation from Valentin via irc" [puppet] - 10https://gerrit.wikimedia.org/r/875888 (owner: 10Jbond) [14:07:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875906 (https://phabricator.wikimedia.org/T325831) (owner: 10Matthias Mullie) [14:07:51] (03PS12) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [14:07:58] (03PS1) 10Matthias Mullie: Also get central description [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875908 (https://phabricator.wikimedia.org/T325831) [14:08:10] (03PS1) 10Matthias Mullie: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875909 [14:09:28] (03Merged) 10jenkins-bot: Also get central description [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875906 (https://phabricator.wikimedia.org/T325831) (owner: 10Matthias Mullie) [14:09:30] (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance :-)" [puppet] - 10https://gerrit.wikimedia.org/r/870524 (owner: 10Slyngshede) [14:09:53] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:875906|Also get central description (T325831)]] [14:09:56] T325831: [QA task] Special:Search quick preview - local vs central description - https://phabricator.wikimedia.org/T325831 [14:10:06] (03CR) 10Majavah: [C: 04-1] "likely need to update documentation like https://wikitech.wikimedia.org/wiki/Help:Putty first" [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [14:11:43] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:875906|Also get central description (T325831)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:13:24] PROBLEM - Juniper virtual chassis ports on asw2-c-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:15:41] (03PS13) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [14:15:46] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [14:16:48] (03CR) 10Muehlenhoff: [C: 03+1] "Fantastic!" [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [14:17:26] (03PS14) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [14:17:34] (03Abandoned) 10Hashar: systemd::unit: support multiple overrides [puppet] - 10https://gerrit.wikimedia.org/r/875347 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [14:17:50] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:875906|Also get central description (T325831)]] (duration: 07m 57s) [14:17:53] T325831: [QA task] Special:Search quick preview - local vs central description - https://phabricator.wikimedia.org/T325831 [14:17:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875908 (https://phabricator.wikimedia.org/T325831) (owner: 10Matthias Mullie) [14:18:13] (03PS15) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [14:18:42] (03CR) 10Majavah: [C: 04-1] ipsec: remove ipsec role and the strongswan module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [14:20:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T326156)', diff saved to https://phabricator.wikimedia.org/P42860 and previous config saved to /var/cache/conftool/dbconfig/20230105-142008-ladsgroup.json [14:20:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:20:12] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [14:20:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:20:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T326156)', diff saved to https://phabricator.wikimedia.org/P42861 and previous config saved to /var/cache/conftool/dbconfig/20230105-142029-ladsgroup.json [14:21:02] (03Merged) 10jenkins-bot: Also get central description [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875908 (https://phabricator.wikimedia.org/T325831) (owner: 10Matthias Mullie) [14:21:24] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:21:28] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:875908|Also get central description (T325831)]] [14:22:07] (03CR) 10Filippo Giunchedi: "Very good start! See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [14:22:13] (03CR) 10Ottomata: [C: 03+1] [WIP] modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) (owner: 10Ottomata) [14:22:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38982/console" [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [14:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T326156)', diff saved to https://phabricator.wikimedia.org/P42862 and previous config saved to /var/cache/conftool/dbconfig/20230105-142244-ladsgroup.json [14:23:17] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:875908|Also get central description (T325831)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:23:19] T325831: [QA task] Special:Search quick preview - local vs central description - https://phabricator.wikimedia.org/T325831 [14:24:20] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) Everything has been replaced, thanks @Jclark-ctr! I'll check it on Monday to see if there are any ongoing errors and close it if it's good! [14:25:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38983/console" [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [14:29:00] (03PS22) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [14:29:20] (03PS3) 10Ottomata: modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) [14:29:54] (03CR) 10Filippo Giunchedi: "LGTM (for the alerting/metrics bits)" [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [14:29:56] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:30:01] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:875908|Also get central description (T325831)]] (duration: 08m 32s) [14:30:04] T325831: [QA task] Special:Search quick preview - local vs central description - https://phabricator.wikimedia.org/T325831 [14:30:34] During scap, I just had 2 php-fpm-restarts failures: [14:30:41] 14:29:20 /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 (ran as mwdeploy@mw1485.eqiad.wmnet) returned [255]: ssh: connect to host mw1485.eqiad.wmnet port 22: No route to host [14:30:41] 14:29:21 /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 (ran as mwdeploy@mw1483.eqiad.wmnet) returned [255]: ssh: connect to host mw1483.eqiad.wmnet port 22: No route to host [14:30:41] 14:30:00 php-fpm-restart: 100% (in-flight: 0; ok: 293; fail: 2; left: 0) [14:30:42] 14:30:00 2 hosts had failures restarting php-fpm [14:31:02] (03CR) 10Hashar: [C: 03+1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [14:31:16] (03PS6) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [14:31:17] (I have 2 more patches to scap - should I continue, or wait?) [14:31:24] (03PS23) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [14:31:30] ping Amir1 ^ [14:31:32] (03PS15) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [14:32:34] (03CR) 10Ottomata: [C: 03+2] modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) (owner: 10Ottomata) [14:32:39] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:32:50] matthiasmullie: Probably my reboots [14:32:52] Sorry [14:33:06] Give me a second to stop them [14:33:06] Ah, ok, will resume then! [14:33:12] ah, no worries [14:33:19] go ahead, LMK when you're done [14:33:57] (03PS3) 10Effie Mouzeli: ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 [14:37:32] (03Merged) 10jenkins-bot: modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) (owner: 10Ottomata) [14:37:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42864 and previous config saved to /var/cache/conftool/dbconfig/20230105-143751-ladsgroup.json [14:38:24] matthiasmullie: thanks, should be done relatively shortly [14:39:46] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:39:52] (03PS7) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [14:40:08] (03PS24) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [14:40:17] (03PS16) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [14:40:33] 10SRE, 10Infrastructure-Foundations, 10netops: Add per-output queue graphing for Juniper network devices in LibreNMS - https://phabricator.wikimedia.org/T326322 (10cmooney) p:05Triage→03Medium [14:41:28] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:43:59] (03CR) 10MVernon: [C: 03+2] swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:44:21] (03PS4) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) [14:44:27] (03PS1) 10Giuseppe Lavagetto: kafka-logging: we now support ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/875902 (https://phabricator.wikimedia.org/T271138) [14:46:00] 10SRE: Updates of passwords of users created with postgresql::user / PostgreSQL change to scram-sha256 - https://phabricator.wikimedia.org/T326325 (10MoritzMuehlenhoff) [14:46:07] 10SRE: Updates of passwords of users created with postgresql::user / PostgreSQL change to scram-sha256 - https://phabricator.wikimedia.org/T326325 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:47:26] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: CLI tools for CAS administration - https://phabricator.wikimedia.org/T233940 (10jbond) also possibly useful: https://apereo.github.io/cas/6.5.x/installation/Configuring-Commandline-Shell.html [14:47:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [14:47:56] 10SRE, 10SRE-swift-storage: All thumbnails on arywiki broken. Giving "unauthorized" error - https://phabricator.wikimedia.org/T326309 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I remember this wiki was not very cooperative when we were creating it. Something must have messed up there. Running this fi... [14:48:04] (03PS1) 10Giuseppe Lavagetto: mw-debug: run rsyslog in debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/875903 [14:49:55] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Zabe) [14:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42865 and previous config saved to /var/cache/conftool/dbconfig/20230105-145257-ladsgroup.json [14:55:14] Looks like I'll overrun the backport window - I have 2 more patches to scap. AFAICT, nothing's scheduled after this, so I can probably just continue once claime is done? Any objections? [14:55:29] matthiasmullie: I am so sorry, there's a server not coming back [14:55:52] No worries, go ahead :) [14:56:02] PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:17] There it is [14:56:30] !log hard resetting mw1486 [14:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [14:58:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [14:58:53] If you can do with one server being down, it should be ok in a bit [14:59:02] But this particular one seems quite broken [14:59:09] UEFI0060: Power required by the system exceeds the power supplied by the Power [14:59:11] Supply Units (PSUs). [14:59:14] :( [14:59:29] is it inactive in conftool? [14:59:51] claime: Sure, just ping me when it's safe to resume! [14:59:56] (03PS1) 10JMeybohm: Skip empty fixture files [deployment-charts] - 10https://gerrit.wikimedia.org/r/875947 [15:00:08] taavi: It isn't, I managed to make it boot [15:00:30] RECOVERY - Host mw1486 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:01:18] I have 3 currently rebooting, 3 next, and then it's done [15:02:09] (03CR) 10Muehlenhoff: [C: 03+1] ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [15:03:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug: run rsyslog in debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/875903 (owner: 10Giuseppe Lavagetto) [15:04:45] (03CR) 10Muehlenhoff: "I'm totally in favour, but in T198138 it was mentioned Toolforge needs it, but that was five year ago and is likely obsolete?" [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [15:05:36] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 163 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:06:13] 10SRE, 10SRE-swift-storage: All thumbnails on arywiki broken. Giving "unauthorized" error - https://phabricator.wikimedia.org/T326309 (10Maurusian) Thank you very much! [15:07:10] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T326156)', diff saved to https://phabricator.wikimedia.org/P42866 and previous config saved to /var/cache/conftool/dbconfig/20230105-150804-ladsgroup.json [15:08:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:08:08] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [15:08:18] (03Merged) 10jenkins-bot: mw-debug: run rsyslog in debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/875903 (owner: 10Giuseppe Lavagetto) [15:08:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:08:22] (03PS17) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:08:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42867 and previous config saved to /var/cache/conftool/dbconfig/20230105-150825-ladsgroup.json [15:09:14] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:09:23] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:09:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42868 and previous config saved to /var/cache/conftool/dbconfig/20230105-150939-ladsgroup.json [15:09:46] (03PS5) 10David Caro: puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 [15:09:48] (03PS3) 10David Caro: puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 [15:09:50] (03PS3) 10David Caro: puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 [15:09:52] (03PS3) 10David Caro: puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 [15:10:02] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:10:11] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:10:38] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:11:02] (03CR) 10Ottomata: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:12:41] (03CR) 10CI reject: [V: 04-1] puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [15:13:47] (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [15:14:00] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [15:14:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:19:12] (03CR) 10Hashar: "The WMCS project has a single instance https://openstack-browser.toolforge.org/project/extdist:" [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) (owner: 10Hashar) [15:19:23] (03CR) 10Hashar: [C: 03+1] extdist: remove integration/composer.git [puppet] - 10https://gerrit.wikimedia.org/r/852906 (https://phabricator.wikimedia.org/T293055) (owner: 10Hashar) [15:20:14] (03PS1) 10David Caro: grafana: remove home test [puppet] - 10https://gerrit.wikimedia.org/r/875957 [15:21:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875902 (https://phabricator.wikimedia.org/T271138) (owner: 10Giuseppe Lavagetto) [15:22:03] (03CR) 10Herron: [C: 03+1] kafka-logging: we now support ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/875902 (https://phabricator.wikimedia.org/T271138) (owner: 10Giuseppe Lavagetto) [15:22:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875957 (owner: 10David Caro) [15:22:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [15:22:43] matthiasmullie: I'm done! [15:22:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kafka-logging: we now support ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/875902 (https://phabricator.wikimedia.org/T271138) (owner: 10Giuseppe Lavagetto) [15:22:56] matthiasmullie: sorry for the disturbance again [15:23:20] (03PS3) 10Clément Goubert: mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) [15:23:25] (03PS6) 10David Caro: puppet-enc: added some tests for the api [puppet] - 10https://gerrit.wikimedia.org/r/875398 [15:23:27] (03PS4) 10David Caro: puppet-enc: rename so it can be imported and mocked [puppet] - 10https://gerrit.wikimedia.org/r/875824 [15:23:29] (03PS4) 10David Caro: puppet-enc: add tests to check if add_git_commit is called [puppet] - 10https://gerrit.wikimedia.org/r/875825 [15:23:31] (03PS4) 10David Caro: puppet-enc: add tests for add_git_commit [puppet] - 10https://gerrit.wikimedia.org/r/875866 [15:23:46] claime: Thanks, and don't worry about it :) I'll resume backports! [15:23:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875907 (owner: 10Matthias Mullie) [15:24:01] (03CR) 10Majavah: [C: 03+1] "looks good, sorry I missed this" [puppet] - 10https://gerrit.wikimedia.org/r/875957 (owner: 10David Caro) [15:24:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42869 and previous config saved to /var/cache/conftool/dbconfig/20230105-152447-ladsgroup.json [15:26:29] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:26:48] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:28:35] (03CR) 10MVernon: hiera: move swift accounts_keys into common (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:29:16] (03Merged) 10jenkins-bot: Fix URL construction [extensions/SearchVue] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875907 (owner: 10Matthias Mullie) [15:29:42] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:875907|Fix URL construction]] [15:31:01] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:31:07] (03PS25) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [15:31:29] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:875907|Fix URL construction]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:32:46] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:00] (03PS14) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [15:34:21] (03PS4) 10Clément Goubert: mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) [15:34:43] (03PS2) 10Ssingh: Release 6.0.10-1wm3 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [15:37:47] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:875907|Fix URL construction]] (duration: 08m 04s) [15:38:03] (03PS2) 10Phedenskog: prometheus: recording rules for webperf metrics. [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) [15:38:51] (03CR) 10CI reject: [V: 04-1] prometheus: recording rules for webperf metrics. [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [15:39:38] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:875907|Fix URL construction]] [15:39:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42870 and previous config saved to /var/cache/conftool/dbconfig/20230105-153956-ladsgroup.json [15:41:26] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:875907|Fix URL construction]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [15:42:46] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:43:28] (03PS3) 10Phedenskog: prometheus: recording rules for webperf metrics. [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) [15:44:45] (03CR) 10Majavah: [C: 04-1] hieradata: disable agent forwarding in eqiad1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [15:45:50] (03CR) 10Phedenskog: prometheus: recording rules for webperf metrics. (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [15:45:54] (03PS18) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:46:52] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:51:29] (03CR) 10David Caro: [C: 03+2] cloudweb: fix typo for labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/869235 (owner: 10David Caro) [15:51:45] (03CR) 10Muehlenhoff: [C: 03+1] hieradata: disable agent forwarding in eqiad1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [15:51:59] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:875907|Fix URL construction]] (duration: 12m 21s) [15:52:03] (03PS19) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:52:08] !log UTC afternoon backports done [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:54] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:53:52] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/875893 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42871 and previous config saved to /var/cache/conftool/dbconfig/20230105-155503-ladsgroup.json [15:55:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:55:07] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [15:55:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42872 and previous config saved to /var/cache/conftool/dbconfig/20230105-155524-ladsgroup.json [15:55:52] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [15:56:51] (03CR) 10Vgutierrez: [C: 03+1] Release 6.0.10-1wm3 (031 comment) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [15:57:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42873 and previous config saved to /var/cache/conftool/dbconfig/20230105-155738-ladsgroup.json [15:58:55] (03CR) 10CI reject: [V: 04-1] Release 6.0.10-1wm3 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [15:58:57] (03CR) 10BryanDavis: [C: 03+1] "+1 assuming someone takes on the community announce and documentation update. There probably will be a few folks who find that this breaks" [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [16:01:19] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/875315/1558/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [16:01:40] (03Merged) 10jenkins-bot: mediawiki: enable geoip by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/875891 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [16:03:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:04:34] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:04:43] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:05:18] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:09:02] (03PS4) 10Ghybu: Add "commons_wordmark" for kuwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875914 (https://phabricator.wikimedia.org/T326067) [16:10:43] (03CR) 10Muehlenhoff: [C: 03+2] Turnilo: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875893 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42874 and previous config saved to /var/cache/conftool/dbconfig/20230105-161245-ladsgroup.json [16:13:18] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 6.0.10-1wm3 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/875452 (https://phabricator.wikimedia.org/T325797) (owner: 10BBlack) [16:14:14] (03PS1) 10Muehlenhoff: Also include the Turnilo staging host in the analytics-tools alias [puppet] - 10https://gerrit.wikimedia.org/r/875970 [16:17:08] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Clement_Goubert) [16:21:32] (03PS1) 10Muehlenhoff: Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 [16:27:15] (03CR) 10JMeybohm: "Just some naming nits left I suppose" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:27:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42875 and previous config saved to /var/cache/conftool/dbconfig/20230105-162751-ladsgroup.json [16:42:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [16:42:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42876 and previous config saved to /var/cache/conftool/dbconfig/20230105-164258-ladsgroup.json [16:43:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:43:02] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [16:43:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:43:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:43:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:43:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:43:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:43:53] (03CR) 10Clément Goubert: [C: 03+2] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [16:43:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T326156)', diff saved to https://phabricator.wikimedia.org/P42877 and previous config saved to /var/cache/conftool/dbconfig/20230105-164358-ladsgroup.json [16:46:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T326156)', diff saved to https://phabricator.wikimedia.org/P42878 and previous config saved to /var/cache/conftool/dbconfig/20230105-164612-ladsgroup.json [16:46:13] (03Merged) 10jenkins-bot: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [16:46:38] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Clement_Goubert) [16:49:12] (03PS1) 10Jgiannelos: maps: Bootstrap tile storage based on prod objects [puppet] - 10https://gerrit.wikimedia.org/r/875973 [16:54:09] (03CR) 10Volans: sre.mediawiki.restart-appservers: Fix clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [16:55:47] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:55:51] (03PS20) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [16:56:44] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:58:15] (03PS26) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [16:59:06] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:59:11] (03PS21) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [17:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:10] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:01:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42880 and previous config saved to /var/cache/conftool/dbconfig/20230105-170119-ladsgroup.json [17:02:38] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10VirginiaPoundstone) >>! In T216815#8472047, @Andrew wrote: > Huh, is anyone tasked with this? This is one of the few cases that's keeping Stretch alive in cloud-v... [17:16:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42882 and previous config saved to /var/cache/conftool/dbconfig/20230105-171626-ladsgroup.json [17:20:14] (03CR) 10JMeybohm: "one rename missing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:27:48] (03CR) 10Dzahn: [C: 03+2] peopleweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/875806 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:27:56] (03CR) 10Dzahn: [C: 03+2] people: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875805 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:29:32] (03PS2) 10Dzahn: peopleweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/875806 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T326156)', diff saved to https://phabricator.wikimedia.org/P42883 and previous config saved to /var/cache/conftool/dbconfig/20230105-173133-ladsgroup.json [17:31:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:31:38] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [17:31:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T326156)', diff saved to https://phabricator.wikimedia.org/P42884 and previous config saved to /var/cache/conftool/dbconfig/20230105-173154-ladsgroup.json [17:34:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T326156)', diff saved to https://phabricator.wikimedia.org/P42885 and previous config saved to /var/cache/conftool/dbconfig/20230105-173408-ladsgroup.json [17:35:20] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) >>! In T324555#8500823, @ayounsi wrote: > Thanks for those great pictures! > > As a first step (and I think that's what you suggested previously!) it might be wor... [17:35:47] (03CR) 10Dzahn: "tested :)" [puppet] - 10https://gerrit.wikimedia.org/r/875806 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:36:46] (03CR) 10Dzahn: admin: add data type for UIDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [17:47:18] (03PS2) 10Dzahn: admin: add data types to validate UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 [17:48:03] (03CR) 10Dzahn: admin: add data types to validate UIDs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [17:48:08] (03CR) 10CI reject: [V: 04-1] admin: add data types to validate UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [17:49:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42886 and previous config saved to /var/cache/conftool/dbconfig/20230105-174915-ladsgroup.json [17:49:33] (03PS3) 10Dzahn: admin: add data types to validate UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 [17:51:48] (03PS3) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) [17:55:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/875450/38984/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/875450 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:57:15] 10Puppet, 10SRE, 10Infrastructure-Foundations: Knock down puppet 4 deprecation warnings - https://phabricator.wikimedia.org/T193664 (10herron) [17:57:28] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10herron) [17:57:32] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171 (10herron) [17:57:38] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10herron) [17:57:51] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10herron) [17:57:58] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10herron) [17:58:20] 10SRE, 10Growth-Team, 10Growth-Team-Filtering, 10Infrastructure-Foundations, and 3 others: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10herron) [17:59:20] (03PS27) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [17:59:46] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:59:48] (03PS22) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [18:00:04] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1800) [18:00:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this patch was fine but there is additionally another warning left." [puppet] - 10https://gerrit.wikimedia.org/r/875450 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:01:09] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) Filed a bug report with Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994 [18:01:46] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:02:25] 10SRE, 10Traffic, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) [18:04:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42887 and previous config saved to /var/cache/conftool/dbconfig/20230105-180421-ladsgroup.json [18:04:46] (03PS1) 10Volans: grammars: remove usage of leaveWhitespace [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 [18:04:48] (03PS1) 10Volans: setup.py: support Python 3.10 and Pyparsing 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 [18:05:09] (03PS1) 10Dzahn: phabricator::vcs: comment out warning about empty listen address [puppet] - 10https://gerrit.wikimedia.org/r/875987 (https://phabricator.wikimedia.org/T296022) [18:07:00] (03CR) 10JMeybohm: [C: 03+1] "🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:13:45] (03CR) 10Dzahn: [C: 03+2] phabricator::vcs: comment out warning about empty listen address [puppet] - 10https://gerrit.wikimedia.org/r/875987 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:16:10] (03CR) 10Dzahn: [C: 03+2] "after previous 2 patches a puppet run on prod phab is now free of warnings" [puppet] - 10https://gerrit.wikimedia.org/r/875987 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:16:23] (03PS1) 10Zabe: actions: Actually store CommentFormatter in McrUndoAction [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) [18:19:08] i just got randomly logged out while doing some cross-wiki edits using the API. is that normal? [18:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T326156)', diff saved to https://phabricator.wikimedia.org/P42888 and previous config saved to /var/cache/conftool/dbconfig/20230105-181928-ladsgroup.json [18:19:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:19:33] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [18:19:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:19:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T326156)', diff saved to https://phabricator.wikimedia.org/P42889 and previous config saved to /var/cache/conftool/dbconfig/20230105-181949-ladsgroup.json [18:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T326156)', diff saved to https://phabricator.wikimedia.org/P42890 and previous config saved to /var/cache/conftool/dbconfig/20230105-182204-ladsgroup.json [18:22:32] !log delete some nostalgiawiki pages using maintenance/deleteBatch.php for T326334 [18:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:35] T326334: Maintenance scripts create pages on the Nostalgia Wikipedia - https://phabricator.wikimedia.org/T326334 [18:29:44] (03PS5) 10Krinkle: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) [18:31:30] (03CR) 10CI reject: [V: 04-1] actions: Actually store CommentFormatter in McrUndoAction [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) (owner: 10Zabe) [18:31:50] (03CR) 10Krinkle: "Scheduled for later tonight https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T2100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [18:31:59] (03CR) 10Zabe: "recheck" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) (owner: 10Zabe) [18:32:14] (03CR) 10Dzahn: [C: 04-1] "This removes Sudo::User[www-data] and other things that might break stuff even without VCS being active." [puppet] - 10https://gerrit.wikimedia.org/r/864852 (owner: 10Dzahn) [18:33:38] 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10RobH) [18:35:16] (03PS2) 10Dzahn: phabricator/cloud: remove vcs related IP settings [puppet] - 10https://gerrit.wikimedia.org/r/865181 [18:37:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42891 and previous config saved to /var/cache/conftool/dbconfig/20230105-183711-ladsgroup.json [18:38:54] 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10RobH) [18:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42892 and previous config saved to /var/cache/conftool/dbconfig/20230105-185217-ladsgroup.json [18:56:01] 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) [18:56:10] 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) [19:00:05] dduvall and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T1900). [19:07:04] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/863050/38987/" [puppet] - 10https://gerrit.wikimedia.org/r/863050 (owner: 10BCornwall) [19:07:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T326156)', diff saved to https://phabricator.wikimedia.org/P42893 and previous config saved to /var/cache/conftool/dbconfig/20230105-190724-ladsgroup.json [19:07:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:07:28] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [19:07:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:07:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2101.codfw.wmnet with reason: Maintenance [19:08:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2101.codfw.wmnet with reason: Maintenance [19:08:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2111.codfw.wmnet with reason: Maintenance [19:08:18] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove since-deleted dstat plugin dir [puppet] - 10https://gerrit.wikimedia.org/r/863050 (owner: 10BCornwall) [19:08:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2111.codfw.wmnet with reason: Maintenance [19:08:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T326156)', diff saved to https://phabricator.wikimedia.org/P42894 and previous config saved to /var/cache/conftool/dbconfig/20230105-190830-ladsgroup.json [19:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T326156)', diff saved to https://phabricator.wikimedia.org/P42895 and previous config saved to /var/cache/conftool/dbconfig/20230105-191046-ladsgroup.json [19:14:55] 10SRE, 10Maps, 10Observability-Metrics, 10observability, and 3 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10herron) [19:22:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875379 (https://phabricator.wikimedia.org/T326275) (owner: 10Zabe) [19:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42896 and previous config saved to /var/cache/conftool/dbconfig/20230105-192553-ladsgroup.json [19:29:16] (03CR) 10Volans: "This should make cumin work with pyparsing 3 (see next patch) and unblock the debianization work." [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 (owner: 10Volans) [19:29:51] (03CR) 10Volans: "This should unblock the debianization work once I'll make a release with it." [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans) [19:31:44] !log creating new cu tables [19:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:33] (03Merged) 10jenkins-bot: actions: Pass CommentFormatter to McrRestoreAction [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875379 (https://phabricator.wikimedia.org/T326275) (owner: 10Zabe) [19:37:36] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10RobH) [19:37:57] !log taavi@deploy1002 Started scap: Backport for [[gerrit:875379|actions: Pass CommentFormatter to McrRestoreAction (T326275)]] [19:38:00] T326275: TypeError: Argument 6 passed to McrUndoAction::__construct() must be an instance of MediaWiki\CommentFormatter\CommentFormatter, instance of GlobalVarConfig given, called in /srv/mediawiki/php-1.40.0-wmf.17/vendor/wikimedia/obj - https://phabricator.wikimedia.org/T326275 [19:38:10] !log reprepro -C main include bullseye-wikimedia varnish_6.0.10-1wm3_amd64.changes: T325797 [19:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:13] T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 [19:38:25] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10RobH) [19:39:54] (03CR) 10Effie Mouzeli: "Do you think it would make sense if we'd have a function to do a dry run (obviously limit the results) and have the ability to do a test r" [puppet] - 10https://gerrit.wikimedia.org/r/875973 (owner: 10Jgiannelos) [19:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42897 and previous config saved to /var/cache/conftool/dbconfig/20230105-194059-ladsgroup.json [19:41:09] !log taavi@deploy1002 taavi and zabe: Backport for [[gerrit:875379|actions: Pass CommentFormatter to McrRestoreAction (T326275)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [19:41:21] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10RobH) [19:42:49] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10RobH) [19:48:08] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:875379|actions: Pass CommentFormatter to McrRestoreAction (T326275)]] (duration: 10m 11s) [19:48:11] T326275: TypeError: Argument 6 passed to McrUndoAction::__construct() must be an instance of MediaWiki\CommentFormatter\CommentFormatter, instance of GlobalVarConfig given, called in /srv/mediawiki/php-1.40.0-wmf.17/vendor/wikimedia/obj - https://phabricator.wikimedia.org/T326275 [19:51:07] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875998 (https://phabricator.wikimedia.org/T325580) [19:51:09] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875998 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:51:58] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875998 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T326156)', diff saved to https://phabricator.wikimedia.org/P42898 and previous config saved to /var/cache/conftool/dbconfig/20230105-195606-ladsgroup.json [19:56:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:56:10] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [19:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T326156)', diff saved to https://phabricator.wikimedia.org/P42899 and previous config saved to /var/cache/conftool/dbconfig/20230105-195627-ladsgroup.json [19:58:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T326156)', diff saved to https://phabricator.wikimedia.org/P42900 and previous config saved to /var/cache/conftool/dbconfig/20230105-195843-ladsgroup.json [19:59:43] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.17 refs T325580 [19:59:46] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [20:02:49] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) Thanks Ayounsi! if any issues come back i still do have more spare optics. would Monday be a good day to schedule line card moves also if no errors return? [20:04:30] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Ahecht) @VirginiaPoundstone, as far as I'm aware, it's not so much that Thumbor is dependent on features of Debian 9 (Stretch) and is incompatible with Debian 10... [20:06:06] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.732e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [20:12:49] (03PS1) 10Andrew Bogott: OpenStack nova: install genisoimage [puppet] - 10https://gerrit.wikimedia.org/r/876003 [20:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42901 and previous config saved to /var/cache/conftool/dbconfig/20230105-201350-ladsgroup.json [20:15:20] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) @Andrew we are just without management on this server at this time [20:16:34] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/876003/38988/" [puppet] - 10https://gerrit.wikimedia.org/r/876003 (owner: 10Andrew Bogott) [20:17:30] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@9568478]: Bumping platform_eng airflow instance to latest [20:17:39] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@9568478]: Bumping platform_eng airflow instance to latest (duration: 00m 09s) [20:23:10] (03CR) 10Dzahn: [C: 03+2] phabricator/cloud: remove vcs related IP settings [puppet] - 10https://gerrit.wikimedia.org/r/865181 (owner: 10Dzahn) [20:28:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42902 and previous config saved to /var/cache/conftool/dbconfig/20230105-202856-ladsgroup.json [20:44:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T326156)', diff saved to https://phabricator.wikimedia.org/P42903 and previous config saved to /var/cache/conftool/dbconfig/20230105-204403-ladsgroup.json [20:44:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [20:44:07] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [20:44:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [20:44:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:44:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:44:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T326156)', diff saved to https://phabricator.wikimedia.org/P42904 and previous config saved to /var/cache/conftool/dbconfig/20230105-204438-ladsgroup.json [20:46:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T326156)', diff saved to https://phabricator.wikimedia.org/P42905 and previous config saved to /var/cache/conftool/dbconfig/20230105-204654-ladsgroup.json [21:00:04] brennen and TheresNoTime: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230105T2100). [21:00:04] zabe and Krinkle: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] hey o/ [21:00:29] I can deploy in about 5 minutes [21:00:32] o/ [21:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42906 and previous config saved to /var/cache/conftool/dbconfig/20230105-210201-ladsgroup.json [21:02:38] zabe: going to start with the backport, 875915 [21:03:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) (owner: 10Zabe) [21:04:11] TheresNoTime: since that's a core patch, maybe do some config changes in the meantime while that's in CI? [21:04:14] eh scrap that, I'll do the config while thats merging [21:04:18] taavi: :D [21:04:24] !log samtar@deploy1002 backport aborted: (duration: 01m 22s) [21:04:39] (03PS2) 10Samtar: Start writing to cuc_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875438 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:05:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875438 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:26] (03Merged) 10jenkins-bot: Start writing to cuc_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875438 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:40] !log samtar@deploy1002 Started scap: Backport for [[gerrit:875438|Start writing to cuc_comment_id everywhere (T233004)]] [21:06:43] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:08:16] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:875438|Start writing to cuc_comment_id everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:08:32] zabe: live on mwdebug, can you test? [21:09:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 403 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:09:11] hm [21:10:05] TheresNoTime, lgtm [21:10:38] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:40] syncing [21:11:06] not seeing any immediately suspicious errors [21:11:26] I suspect a batch of "A database query timeout has occurred." [21:11:34] (791 in the last 15 minutes) [21:12:02] ah, that's filtered out from the mediawiki-errors dashboard [21:12:29] and is filtered as junk on `logspam-watch` :p [21:12:41] there's a toggle for that, fwiw [21:12:55] (yeah just did that to see ^^) [21:14:23] (03PS1) 10FNegri: ToolsDB: stop replicating a big problematic table [puppet] - 10https://gerrit.wikimedia.org/r/876011 (https://phabricator.wikimedia.org/T326261) [21:16:48] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:875438|Start writing to cuc_comment_id everywhere (T233004)]] (duration: 10m 07s) [21:16:51] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:17:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42907 and previous config saved to /var/cache/conftool/dbconfig/20230105-211707-ladsgroup.json [21:18:01] zabe: that's live, and the core patch is almost merged [21:19:58] (03PS1) 10Ahmon Dancy: Remove .pipeline directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876013 [21:21:04] (03Merged) 10jenkins-bot: actions: Actually store CommentFormatter in McrUndoAction [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) (owner: 10Zabe) [21:21:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875915 (https://phabricator.wikimedia.org/T326336) (owner: 10Zabe) [21:21:24] !log samtar@deploy1002 Started scap: Backport for [[gerrit:875915|actions: Actually store CommentFormatter in McrUndoAction (T326336)]] [21:21:27] T326336: Error: Call to a member function formatBlock() on null - https://phabricator.wikimedia.org/T326336 [21:23:04] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:875915|actions: Actually store CommentFormatter in McrUndoAction (T326336)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:23:14] zabe: ^ live on mwdebug, can you test? [21:23:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:24:51] lemme see [21:25:37] TheresNoTime, lgtm [21:25:46] ack, syncing :) [21:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:29:51] (03PS6) 10Samtar: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [21:31:55] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:875915|actions: Actually store CommentFormatter in McrUndoAction (T326336)]] (duration: 10m 31s) [21:31:59] T326336: Error: Call to a member function formatBlock() on null - https://phabricator.wikimedia.org/T326336 [21:32:09] zabe: that's live :) [21:32:13] thanks :) [21:32:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T326156)', diff saved to https://phabricator.wikimedia.org/P42908 and previous config saved to /var/cache/conftool/dbconfig/20230105-213214-ladsgroup.json [21:32:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:32:17] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [21:32:26] Krinkle: ready for 865097 ? [21:32:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42909 and previous config saved to /var/cache/conftool/dbconfig/20230105-213235-ladsgroup.json [21:32:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [21:32:59] TheresNoTime: ack [21:33:41] (03Merged) 10jenkins-bot: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) (owner: 10Krinkle) [21:33:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42910 and previous config saved to /var/cache/conftool/dbconfig/20230105-213351-ladsgroup.json [21:33:56] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865097|Turn off wgNavigationTimingOversampleFactor campaigns (T286703)]] [21:33:59] T286703: Navigation Timing cleanup - https://phabricator.wikimedia.org/T286703 [21:35:35] !log samtar@deploy1002 samtar and krinkle: Backport for [[gerrit:865097|Turn off wgNavigationTimingOversampleFactor campaigns (T286703)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:35:39] Krinkle: live on mwdebug, can you test? [21:35:55] checking [21:36:25] confirmed `mw.loader.moduleRegistry['ext.navigationTiming'] ` version changed, and packageExports.config.json is now showing oversampleFactor=false [21:36:26] Go ahead [21:36:35] syncin' [21:42:42] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865097|Turn off wgNavigationTimingOversampleFactor campaigns (T286703)]] (duration: 08m 45s) [21:42:46] and live :) [21:42:46] T286703: Navigation Timing cleanup - https://phabricator.wikimedia.org/T286703 [21:43:37] !log closing UTC late backport window [21:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:36] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) @Ahecht: If supporting the latest version of librsvg is difficult, some Users prefer (i) [[ https://packages.debian.org/sid/resvg | resvg ]] (faster... [21:48:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42911 and previous config saved to /var/cache/conftool/dbconfig/20230105-214858-ladsgroup.json [21:50:34] 10SRE: rsyslog::conf puppet define types inserts an extraneous newline in the content param - https://phabricator.wikimedia.org/T320569 (10jhathaway) 05Open→03Resolved [21:54:25] (03PS1) 10Andrew Bogott: Codfw1dev Openestack to version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/876020 (https://phabricator.wikimedia.org/T323086) [22:04:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42912 and previous config saved to /var/cache/conftool/dbconfig/20230105-220404-ladsgroup.json [22:06:48] (03PS2) 10Krinkle: Remove .pipeline directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876013 (owner: 10Ahmon Dancy) [22:06:51] (03CR) 10Krinkle: [C: 03+2] Remove .pipeline directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876013 (owner: 10Ahmon Dancy) [22:07:41] (03Merged) 10jenkins-bot: Remove .pipeline directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876013 (owner: 10Ahmon Dancy) [22:09:37] (03PS2) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [22:13:07] 10SRE, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10aaron) [22:14:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [22:14:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10RobH) [22:16:27] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10RobH) [22:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42913 and previous config saved to /var/cache/conftool/dbconfig/20230105-221911-ladsgroup.json [22:19:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [22:19:15] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [22:19:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [22:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T326156)', diff saved to https://phabricator.wikimedia.org/P42914 and previous config saved to /var/cache/conftool/dbconfig/20230105-221932-ladsgroup.json [22:19:41] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/876022 [22:20:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T326156)', diff saved to https://phabricator.wikimedia.org/P42915 and previous config saved to /var/cache/conftool/dbconfig/20230105-222048-ladsgroup.json [22:21:19] (03PS3) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [22:21:25] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/876022 (owner: 10Ahmon Dancy) [22:22:28] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/876022 (owner: 10Ahmon Dancy) [22:24:44] (03CR) 10Andrew Bogott: [C: 03+2] Codfw1dev Openestack to version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/876020 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [22:30:07] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Dzahn) resvg only exists in bullseye https://packages.debian.org/search?suite=all&searchon=names&keywords=resvg and even long-term support has ended for stretch... [22:32:02] PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) @Dzahn: Thanks and packing resvg from https://github.com/RazrFalcon/resvg/ might be out of scope? [22:35:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42916 and previous config saved to /var/cache/conftool/dbconfig/20230105-223554-ladsgroup.json [22:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42917 and previous config saved to /var/cache/conftool/dbconfig/20230105-225101-ladsgroup.json [22:54:27] (03PS1) 10Andrew Bogott: Trove: remove a hack that's no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/876025 [22:54:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10RobH) [22:54:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10RobH) [22:56:53] 10SRE, 10serviceops: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10RobH) [22:57:00] (03Abandoned) 10Ebernhardson: dumpcirrussearch.sh: Replace gzip with lbzip2 [puppet] - 10https://gerrit.wikimedia.org/r/835705 (owner: 10Ebernhardson) [22:57:04] (03CR) 10Andrew Bogott: [C: 03+2] Trove: remove a hack that's no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/876025 (owner: 10Andrew Bogott) [23:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T326156)', diff saved to https://phabricator.wikimedia.org/P42918 and previous config saved to /var/cache/conftool/dbconfig/20230105-230607-ladsgroup.json [23:06:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [23:06:12] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [23:06:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [23:06:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42919 and previous config saved to /var/cache/conftool/dbconfig/20230105-230629-ladsgroup.json [23:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42920 and previous config saved to /var/cache/conftool/dbconfig/20230105-230745-ladsgroup.json [23:18:53] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Ladsgroup) Zabe has done tremendous work in general maintenance of wikimedia infrastructure and specially in refactorings of our database schema (actor migration in checkuser and so on) which f... [23:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42921 and previous config saved to /var/cache/conftool/dbconfig/20230105-232251-ladsgroup.json [23:28:09] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Dzahn) I confirm Zabe has done a lot of helpful work in different areas, including operations/puppet and mwcore. here is a list of patches that have been merged: https://gerrit.wikimedia.org... [23:37:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42922 and previous config saved to /var/cache/conftool/dbconfig/20230105-233758-ladsgroup.json [23:53:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T326156)', diff saved to https://phabricator.wikimedia.org/P42923 and previous config saved to /var/cache/conftool/dbconfig/20230105-235304-ladsgroup.json [23:53:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [23:53:08] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [23:53:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [23:53:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T326156)', diff saved to https://phabricator.wikimedia.org/P42924 and previous config saved to /var/cache/conftool/dbconfig/20230105-235325-ladsgroup.json [23:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T326156)', diff saved to https://phabricator.wikimedia.org/P42925 and previous config saved to /var/cache/conftool/dbconfig/20230105-235543-ladsgroup.json