[00:16:26] !log mwmaint2002: Stop T315510#9312431 instances of extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php (T315510) [00:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:30] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [00:39:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972506 [00:39:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972506 (owner: 10TrainBranchBot) [00:57:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972506 (owner: 10TrainBranchBot) [01:57:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:58:09] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:11] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:28:53] (03CR) 10Pppery: "Note for reviewers: This repository is not set up with Jenkins (except for L10n-bot commits), so any patches will need to be manually give" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969515 (https://phabricator.wikimedia.org/T294754) (owner: 10Pppery) [02:34:45] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:37:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 195 probes of 720 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:12] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:47] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.39 ms [02:47:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 60 probes of 720 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:56:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:59:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:08:12] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:01:58] (03CR) 10DannyS712: wm-checks-api: add PCC build outcome (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [04:07:25] (03PS1) 10Samwilson: planet: Add Wikimedia Australia feed [puppet] - 10https://gerrit.wikimedia.org/r/972534 [04:21:51] (03PS1) 10RLazarus: k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 [05:10:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:14:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:23] (03PS2) 10RLazarus: k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 [05:31:00] (03PS3) 10RLazarus: k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0700) [07:08:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:40:42] (03CR) 10Muehlenhoff: [C: 03+2] Add Puppet aliases for hosts running Puppet 5 and Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972411 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:41:18] (03CR) 10DCausse: [C: 04-1] staging-eqiad: raise rdf-streaming-updater quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [07:42:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: zookeeper::test [07:43:20] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [07:47:00] (03CR) 10Elukey: changeprop: set num_workers to zero (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [07:47:14] (03PS1) 10Muehlenhoff: Switch zookeeper::test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972690 (https://phabricator.wikimedia.org/T349619) [07:51:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch zookeeper::test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972690 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:56:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: zookeeper::test [07:58:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::turnilo::staging [08:00:05] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:05] (03PS1) 10Muehlenhoff: Switch analytics_cluster::turnilo::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972691 (https://phabricator.wikimedia.org/T349619) [08:00:09] (03PS1) 10Slyngshede: Alert on degraded MD RAID devices. [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) [08:02:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::turnilo::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972691 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:02:56] (03PS2) 10Slyngshede: Alert on degraded MD RAID devices. [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) [08:06:13] (03CR) 10Slyngshede: "Severity might be a little high, but we can adjust that." [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:08:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::turnilo::staging [08:11:54] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: increase heap to 4g [puppet] - 10https://gerrit.wikimedia.org/r/972456 (https://phabricator.wikimedia.org/T350434) (owner: 10Herron) [08:16:06] (03CR) 10Filippo Giunchedi: "LGTM as a starting point, after tuning/trial we can even switch to per-team alerts if desired" [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:16:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: druid::test_analytics::worker [08:17:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:19:03] (03CR) 10Filippo Giunchedi: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [08:20:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [08:22:31] (03PS1) 10Muehlenhoff: Switch druid::test_analytics::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972693 (https://phabricator.wikimedia.org/T349619) [08:23:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:23:41] (03PS3) 10Filippo Giunchedi: alertmanager: add alerts-triage on /triage [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) [08:23:43] (03CR) 10Filippo Giunchedi: alertmanager: add alerts-triage on /triage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) (owner: 10Filippo Giunchedi) [08:24:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch druid::test_analytics::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:26:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 15%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53164 and previous config saved to /var/cache/conftool/dbconfig/20231108-082631-arnaudb.json [08:28:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: druid::test_analytics::worker [08:33:25] PROBLEM - Check systemd state on ganeti3007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:38] <_joe_> jouncebot: nowandnext [08:36:38] For the next 0 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0800) [08:36:38] In 0 hour(s) and 23 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0900) [08:37:01] <_joe_> urbanecm: can I commander the remaining time to make a mw on k8s change? [08:37:12] <_joe_> given there were no backport patches [08:37:17] _joe_: sure thing [08:37:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-jobrunner: add virtualhost explicitly for jobrunning [deployment-charts] - 10https://gerrit.wikimedia.org/r/968955 (https://phabricator.wikimedia.org/T349796) (owner: 10Giuseppe Lavagetto) [08:37:23] <_joe_> thanks :) [08:38:07] (03Merged) 10jenkins-bot: mw-jobrunner: add virtualhost explicitly for jobrunning [deployment-charts] - 10https://gerrit.wikimedia.org/r/968955 (https://phabricator.wikimedia.org/T349796) (owner: 10Giuseppe Lavagetto) [08:41:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 30%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53165 and previous config saved to /var/cache/conftool/dbconfig/20231108-084136-arnaudb.json [08:42:41] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds fine to me!" [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [08:49:09] (03CR) 10Brouberol: [C: 03+1] "The code LGTM. I can't speak for the feature." [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [08:49:30] !log oblivian@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [08:49:30] !log oblivian@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [08:49:51] !log oblivian@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [08:49:53] !log oblivian@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [08:51:05] !log installing openjdk-8 security updates [08:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:23] (03PS3) 10Slyngshede: Alert on degraded MD RAID devices. [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) [08:53:13] (03CR) 10Slyngshede: Alert on degraded MD RAID devices. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:54:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45899 [08:54:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:55:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45899 [08:55:25] !log restarting archiva to pick up Java security updates [08:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 45%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53166 and previous config saved to /var/cache/conftool/dbconfig/20231108-085641-arnaudb.json [08:59:57] 10SRE, 10Data-Engineering, 10Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10brouberol) I saw that neither `kafka-logging` nor `kafka-test` have ACLs at all: ` # codfw brouberol@kafka-logging2003:~$ kafka acls... [09:00:05] jnuche and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T0900). [09:01:57] (03PS1) 10Ayounsi: sre.network.peering: Add Auto-Submitted email header [cookbooks] - 10https://gerrit.wikimedia.org/r/972696 (https://phabricator.wikimedia.org/T347835) [09:02:39] !log oblivian@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [09:02:39] !log oblivian@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [09:02:51] !log oblivian@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [09:02:51] !log oblivian@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [09:03:23] <_joe_> jnuche: when you're done with the train, please ping me; I have further fixes to make to mw-on-k8s [09:03:25] _joe_: morning, are you done with that mw on k8s change? [09:03:33] <_joe_> yes yes I am [09:03:36] sure thing, will do [09:03:38] thx [09:03:46] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [09:03:47] <_joe_> but I noticed another minor bug [09:04:30] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972697 (https://phabricator.wikimedia.org/T350080) [09:04:32] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972697 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:05:19] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972697 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:05:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:11:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 60%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53167 and previous config saved to /var/cache/conftool/dbconfig/20231108-091146-arnaudb.json [09:11:52] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.4 refs T350080 [09:13:55] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [09:16:52] (03CR) 10Gehel: [C: 03+1] "Let's try again!" [puppet] - 10https://gerrit.wikimedia.org/r/972250 (owner: 10Stevemunene) [09:17:29] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.4 refs T350080 (duration: 05m 36s) [09:18:01] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10Arnoldokoth) Hey. I'm having a hard time interpreting whether this is still stalled (maybe I'm misinterpreting the discussions or getting mixed mess... [09:20:30] (03CR) 10Nikerabbit: [C: 03+1] Avoid trailing newline in qqq.json [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969515 (https://phabricator.wikimedia.org/T294754) (owner: 10Pppery) [09:22:46] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972699 (https://phabricator.wikimedia.org/T350080) [09:22:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972699 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:23:27] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Revert "Revert "Revert "Revert "airflow-wmde: Create scap deployment source for wmde"""" [puppet] - 10https://gerrit.wikimedia.org/r/972250 (owner: 10Stevemunene) [09:23:29] (03PS1) 10Arnaudb: mariadb: add db1238 and prepare db1138 retirement [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) [09:23:35] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972699 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [09:26:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 75%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53168 and previous config saved to /var/cache/conftool/dbconfig/20231108-092651-arnaudb.json [09:29:46] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.4 refs T350080 [09:30:24] I had to roll back the train [09:30:25] RECOVERY - Check systemd state on ganeti3007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:31] _joe_: I'm done for now [09:30:46] <_joe_> jnuche: thanks [09:41:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 90%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53169 and previous config saved to /var/cache/conftool/dbconfig/20231108-094156-arnaudb.json [09:56:48] (03PS4) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) [09:57:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1236 (re)pooling @ 100%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53170 and previous config saved to /var/cache/conftool/dbconfig/20231108-095701-arnaudb.json [09:58:34] (03PS1) 10Filippo Giunchedi: hieradata: idp_test entry for thanos OIDC [puppet] - 10https://gerrit.wikimedia.org/r/972701 (https://phabricator.wikimedia.org/T331512) [10:00:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:05:04] (03CR) 10Volans: debug: printing results when return object count > 1 (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) (owner: 10Arnaudb) [10:05:37] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/972696 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:06:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-worker1111.eqiad.wmnet [10:07:02] (03PS1) 10Ayounsi: Add support for non EVPN switches on spines [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) [10:08:17] (03PS1) 10Muehlenhoff: Switch an-worker1111 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972703 (https://phabricator.wikimedia.org/T349619) [10:09:17] (03CR) 10Muehlenhoff: [C: 03+2] Switch an-worker1111 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972703 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:15:44] (03PS1) 10Hnowlan: rest-gateway: correct paths incorrectly specified in spreadsheet [deployment-charts] - 10https://gerrit.wikimedia.org/r/972704 (https://phabricator.wikimedia.org/T350747) [10:16:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-worker1111.eqiad.wmnet [10:17:41] (03PS1) 10Effie Mouzeli: profile:k8s::deployment_server::mediawiki: switch php catchall [puppet] - 10https://gerrit.wikimedia.org/r/972705 (https://phabricator.wikimedia.org/T350770) [10:18:34] (03CR) 10Ayounsi: "Example diffs: https://phabricator.wikimedia.org/P53171" [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [10:21:13] (03CR) 10Santiago Faci: [C: 03+1] "It looks good! Thank you very much!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972704 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [10:21:28] (03CR) 10Ayounsi: [C: 03+2] sre.network.peering: Add Auto-Submitted email header [cookbooks] - 10https://gerrit.wikimedia.org/r/972696 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:21:55] (03CR) 10Sg912: [C: 03+1] rest-gateway: correct paths incorrectly specified in spreadsheet [deployment-charts] - 10https://gerrit.wikimedia.org/r/972704 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [10:22:25] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct paths incorrectly specified in spreadsheet [deployment-charts] - 10https://gerrit.wikimedia.org/r/972704 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [10:23:14] (03Merged) 10jenkins-bot: rest-gateway: correct paths incorrectly specified in spreadsheet [deployment-charts] - 10https://gerrit.wikimedia.org/r/972704 (https://phabricator.wikimedia.org/T350747) (owner: 10Hnowlan) [10:24:01] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-esams and A:cp [10:24:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::web::htmldumps [10:24:33] !log brouberol@deploy2002 Started deploy [airflow-dags/analytics@af7f4e5]: (no justification provided) [10:24:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972701 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [10:25:04] !log brouberol@deploy2002 Finished deploy [airflow-dags/analytics@af7f4e5]: (no justification provided) (duration: 00m 31s) [10:25:19] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: idp_test entry for thanos OIDC [puppet] - 10https://gerrit.wikimedia.org/r/972701 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [10:25:53] (03Merged) 10jenkins-bot: sre.network.peering: Add Auto-Submitted email header [cookbooks] - 10https://gerrit.wikimedia.org/r/972696 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:26:14] (03PS1) 10Muehlenhoff: Switch dumps::web::htmldumps to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972726 (https://phabricator.wikimedia.org/T349619) [10:26:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Brown paper bag +1 😊" [puppet] - 10https://gerrit.wikimedia.org/r/972705 (https://phabricator.wikimedia.org/T350770) (owner: 10Effie Mouzeli) [10:29:44] (03CR) 10Muehlenhoff: [C: 03+2] Switch dumps::web::htmldumps to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972726 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:30:53] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [10:33:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::web::htmldumps [10:38:27] (03CR) 10Effie Mouzeli: "PCC ok" [puppet] - 10https://gerrit.wikimedia.org/r/972705 (https://phabricator.wikimedia.org/T350770) (owner: 10Effie Mouzeli) [10:38:31] (03CR) 10Effie Mouzeli: [C: 03+2] profile:k8s::deployment_server::mediawiki: switch php catchall [puppet] - 10https://gerrit.wikimedia.org/r/972705 (https://phabricator.wikimedia.org/T350770) (owner: 10Effie Mouzeli) [10:40:05] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:40:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:40:16] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:40:17] (03CR) 10Ayounsi: Change 'anycast_gw' var in int config to represent type of IRB needed (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [10:40:41] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:40:49] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:43:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::hadoop::worker [10:44:04] (03PS8) 10Hashar: wm-checks-api: add PCC build outcome [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 [10:47:05] (03CR) 10Ayounsi: [C: 03+1] hieradata: cloudgw: drop nfs-maps [puppet] - 10https://gerrit.wikimedia.org/r/971401 (https://phabricator.wikimedia.org/T350259) (owner: 10Majavah) [10:47:07] (03PS1) 10Muehlenhoff: Switch analytics_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972728 (https://phabricator.wikimedia.org/T349619) [10:47:12] (03CR) 10Hashar: [C: 04-1] "With `plugin.registerCustomComponent( 'check-result-expanded', PCCBuildResultElement.is )` the `` element ends up being" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [10:48:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972728 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:48:25] (03CR) 10Ayounsi: [C: 03+1] "Change lgtm, a couple non-blockers comments." [homer/public] - 10https://gerrit.wikimedia.org/r/970767 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [10:50:31] (03PS1) 10Jcrespo: RemoteExecution: Remove errors from cumin logging from basic tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) [10:51:31] (03PS2) 10Majavah: hieradata: cloudgw: drop nfs-maps [puppet] - 10https://gerrit.wikimedia.org/r/971401 (https://phabricator.wikimedia.org/T350259) [10:52:46] (03CR) 10Majavah: [C: 03+2] hieradata: cloudgw: drop nfs-maps [puppet] - 10https://gerrit.wikimedia.org/r/971401 (https://phabricator.wikimedia.org/T350259) (owner: 10Majavah) [10:54:11] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10Cparle) 1. Initial access for a one-time study is really all we need for now, and if the data was ready for us to begin work on... [10:54:39] (03CR) 10Hashar: [C: 04-1] "From the gr-endpoint-decorator.ts:" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [10:57:09] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [software/transferpy] - 10https://gerrit.wikimedia.org/r/972433 (https://phabricator.wikimedia.org/T284150) (owner: 10Jcrespo) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1100) [11:01:13] (03PS1) 10Jbond: utils/setup_workspace.sh: update setup options [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972731 [11:01:36] oh cool right on time for MediaWiki infrastructure deploy window [11:01:37] (03PS1) 10Majavah: nftables: notify correct service resource [puppet] - 10https://gerrit.wikimedia.org/r/972732 [11:01:49] jouncebot: next [11:01:49] In 2 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1400) [11:02:30] (03CR) 10Jbond: [C: 03+2] utils/setup_workspace.sh: update setup options [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972731 (owner: 10Jbond) [11:04:49] !log btullis@cumin1001 Added views for new wiki: fonwiki T347938 [11:04:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:04:54] T347938: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 [11:05:12] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:05:34] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:05:35] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:06:43] (03Merged) 10jenkins-bot: utils/setup_workspace.sh: update setup options [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972731 (owner: 10Jbond) [11:06:52] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:06:53] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:07:19] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:07:20] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:07:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-esams and A:cp [11:08:13] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:29] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:08:30] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:09:04] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:09:05] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:09:06] (03CR) 10MVernon: "Just noticed this is still sitting in the review queue at +1. Are you planning on merging it, or did you want me or Filippo to?" [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [11:09:30] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:09:31] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:10:19] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:10:20] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:10:50] (03CR) 10Muehlenhoff: thanos: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [11:11:12] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:11:13] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:11:15] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:11:16] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:11:19] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:11:20] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [11:12:04] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [11:12:05] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [11:12:28] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [11:15:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: analytics_cluster::hadoop::worker [11:16:45] (03PS1) 10Jbond: Revert "Switch analytics_cluster::hadoop::worker to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/972708 [11:16:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Switch analytics_cluster::hadoop::worker to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/972708 (owner: 10Jbond) [11:18:12] (03PS1) 10Majavah: P:openstack: keystone: sync fernet keys over cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/972737 [11:18:14] (03PS1) 10Majavah: P:openstack: trove: use cloud-private for memcached in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/972738 [11:18:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/352/con" [puppet] - 10https://gerrit.wikimedia.org/r/972738 (owner: 10Majavah) [11:19:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:22:59] (03PS1) 10Ladsgroup: Only take one field in fetchFieldValues [extensions/PageImages] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972709 (https://phabricator.wikimedia.org/T350726) [11:23:08] (03CR) 10Ladsgroup: [C: 03+2] Only take one field in fetchFieldValues [extensions/PageImages] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972709 (https://phabricator.wikimedia.org/T350726) (owner: 10Ladsgroup) [11:23:16] jouncebot: nowandnext [11:23:16] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1100) [11:23:16] In 2 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1400) [11:23:20] coolio [11:23:39] jnuche: I'm about to push the fix, wanna roll the train again afterwards? [11:26:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/PageImages] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972709 (https://phabricator.wikimedia.org/T350726) (owner: 10Ladsgroup) [11:30:14] jayme: just to mention, I've gotten the go-ahead to deploy that wikifeeds change, so your cert manager stuff will be going out with it [11:30:19] (03PS1) 10Majavah: cr-cloud: drop nfs-maps [homer/public] - 10https://gerrit.wikimedia.org/r/972740 (https://phabricator.wikimedia.org/T350259) [11:30:36] hnowlan: cool, thanks! [11:32:16] !log stopping puppet from mc2038 [11:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:22] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [11:33:56] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [11:34:10] (03CR) 10MVernon: [C: 03+1] "Hi," [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:35:36] (03CR) 10Ladsgroup: [C: 03+1] Tranferrer: Enable transfers other than misc, core or x1 sections [software/transferpy] - 10https://gerrit.wikimedia.org/r/972433 (https://phabricator.wikimedia.org/T284150) (owner: 10Jcrespo) [11:36:29] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [11:36:53] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [11:38:07] !log installing logrotate bugfix updates on Bullseye [11:38:11] !log restarting memcached on mc2038 [11:40:58] (03Merged) 10jenkins-bot: Only take one field in fetchFieldValues [extensions/PageImages] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972709 (https://phabricator.wikimedia.org/T350726) (owner: 10Ladsgroup) [11:42:02] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:972709|Only take one field in fetchFieldValues (T350726)]] [11:42:06] T350726: [PageImages] SelectQueryBuilder::fetchFieldValues expects the query to have only one field - https://phabricator.wikimedia.org/T350726 [11:43:23] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:972709|Only take one field in fetchFieldValues (T350726)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:43:46] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:43:53] Amir1: sry, just came back from lunch [11:44:01] yeah, will roll forward once it's backported [11:44:04] no worries [11:44:06] thanks a lot for the fix [11:44:40] I broke it so sorry for the noise :D [11:45:34] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:47:05] ouch [11:47:17] (03PS1) 10Jbond: puppet: move puppet run to files [puppet] - 10https://gerrit.wikimedia.org/r/972742 [11:47:19] (03PS1) 10Jbond: puppet: add check for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/972743 [11:47:21] (03PS1) 10Jbond: puppet: puppet-run fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/972744 [11:48:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/353/con" [puppet] - 10https://gerrit.wikimedia.org/r/972744 (owner: 10Jbond) [11:49:01] (03PS1) 10Hnowlan: rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 [11:49:02] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:972709|Only take one field in fetchFieldValues (T350726)]] (duration: 07m 00s) [11:49:09] T350726: [PageImages] SelectQueryBuilder::fetchFieldValues expects the query to have only one field - https://phabricator.wikimedia.org/T350726 [11:49:16] jnuche: deployed ^ [11:50:22] ack, rolling train to group1 [11:51:12] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972747 (https://phabricator.wikimedia.org/T350080) [11:51:14] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972747 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [11:51:57] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972747 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [11:52:43] (03PS2) 10Jbond: puppet: puppet-run fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/972744 [11:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:54:33] (03CR) 10Peter Fischer: [C: 03+1] cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 (owner: 10Ebernhardson) [11:56:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972742 (owner: 10Jbond) [11:57:39] (03CR) 10Muehlenhoff: puppet: add check for /run/puppet/disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972743 (owner: 10Jbond) [11:58:24] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.4 refs T350080 [11:59:46] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:59:55] (03CR) 10Jbond: [C: 03+1] "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/972743 (owner: 10Jbond) [12:00:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972743 (owner: 10Jbond) [12:00:21] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:03:00] (03CR) 10Jbond: [C: 03+2] puppet: move puppet run to files [puppet] - 10https://gerrit.wikimedia.org/r/972742 (owner: 10Jbond) [12:03:05] (03CR) 10Jbond: [C: 03+2] puppet: add check for /run/puppet/disabled [puppet] - 10https://gerrit.wikimedia.org/r/972743 (owner: 10Jbond) [12:03:30] (03CR) 10Filippo Giunchedi: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [12:04:05] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.4 refs T350080 (duration: 05m 40s) [12:04:28] (03CR) 10Btullis: [C: 03+1] "This looks good as far as I am concerned." [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:05:51] (03PS1) 10Slyngshede: CI: realpath is in /bin on macOS [puppet] - 10https://gerrit.wikimedia.org/r/972749 [12:06:49] 1.42.0-wmf.4 is in group1, logs look clean [12:06:54] Amir1: thx again :) [12:07:23] !log stevemunene@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [12:07:28] awesome. Sorry for the mess [12:08:17] (03CR) 10Btullis: [C: 03+1] mariadb - analytics: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:09:11] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/972732 (owner: 10Majavah) [12:09:23] (03CR) 10Majavah: [C: 03+2] nftables: notify correct service resource [puppet] - 10https://gerrit.wikimedia.org/r/972732 (owner: 10Majavah) [12:09:53] (03CR) 10Btullis: [C: 03+2] "In fact, I think I'll be bold and +2 this, then deploy it myself." [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:10:22] (03CR) 10Btullis: [C: 03+1] mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:11:43] (03CR) 10Btullis: [C: 03+1] mariadb - wikireplicas: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:14:06] (03CR) 10Btullis: [C: 03+1] aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [12:14:12] (03PS2) 10Jbond: realm.pp: drop $other_site global [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) [12:14:22] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:14:52] (03CR) 10Btullis: [C: 03+1] varnishkafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:15:02] (03PS1) 10Hnowlan: edit-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972812 [12:15:12] (03PS2) 10Jbond: realm.pp: drop namservers global as it is no longer used [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) [12:16:17] (03CR) 10Jbond: realm.pp: drop namservers global as it is no longer used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:16:22] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:16:26] (03CR) 10Santiago Faci: [C: 03+1] "It looks good! Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972812 (owner: 10Hnowlan) [12:16:39] (03CR) 10Hnowlan: [C: 03+2] edit-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972812 (owner: 10Hnowlan) [12:17:37] (03Merged) 10jenkins-bot: edit-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972812 (owner: 10Hnowlan) [12:18:09] (03CR) 10Volans: [C: 04-1] "Setting ok_codes=[] will tell cumin to consider as succesful any exit code of the underlying executed commands, as a result the return cod" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [12:18:37] (03PS1) 10Arnaudb: haproxy: remove dbproxy1017 from production [puppet] - 10https://gerrit.wikimedia.org/r/972509 (https://phabricator.wikimedia.org/T348956) [12:20:28] (03PS2) 10Jbond: realm.pp: drop use_puppetdb global [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) [12:20:30] (03PS2) 10Jbond: realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) [12:20:32] (03PS2) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [12:20:34] (03PS2) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [12:20:36] (03PS2) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:20:38] (03PS2) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:20:40] (03PS2) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:20:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:21:15] (03CR) 10CI reject: [V: 04-1] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:21:17] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:21:36] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [12:21:50] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [12:22:28] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [12:22:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [12:22:49] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [12:23:06] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [12:24:15] (03PS3) 10Jbond: realm.pp: drop use_puppetdb global [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) [12:24:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::hadoop::worker [12:24:17] (03PS3) 10Jbond: realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) [12:24:19] (03PS3) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [12:24:21] (03PS3) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [12:24:23] (03PS3) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [12:24:25] (03PS3) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [12:24:27] (03PS3) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [12:24:33] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:25:17] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:25:38] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:26:42] (03CR) 10Jcrespo: RemoteExecution: Remove errors from cumin logging from basic tests (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [12:28:21] (03PS1) 10Muehlenhoff: Switch analytics_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972815 (https://phabricator.wikimedia.org/T349619) [12:30:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::hadoop::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972815 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:32:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/972749 (owner: 10Slyngshede) [12:33:35] (03CR) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:33:59] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:34:03] (03CR) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:35:17] (03CR) 10Majavah: puppet.puppet.get_puppet_ca_hostname: return hardcoded start (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:35:36] (03CR) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:37:44] Hi, I'm trying to unprotect pages on Serbian Wikipedia. [12:38:19] There is one page which I can't unprotect, because it exists and actually doesn't have protection, but when I follow a given link by user, it's showing me that it is protected. [12:38:32] When I click on option to open article, it opens article. [12:38:45] Can someone remove row from protected_titles? [12:38:52] SELECT * FROM protected_titles WHERE pt_title='Феjшл'; on Serbian Wikipedia shows this: [12:38:59] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:39:00] | pt_namespace | pt_title | pt_user | pt_reason_id | pt_timestamp | pt_expiry | pt_create_perm | [12:39:00] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:39:01] | 0 | Феjшл | 133 | 2383 | 20111215021350 | infinity | sysop | [12:39:01] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:40:07] (03CR) 10Hashar: [C: 03+2] Make serve:plugins emit a 404 for missing files [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972460 (owner: 10Hashar) [12:40:39] (03Merged) 10jenkins-bot: Make serve:plugins emit a 404 for missing files [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972460 (owner: 10Hashar) [12:40:43] (03CR) 10Hashar: [C: 03+2] Remap serving plugins under /r/plugins/ [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972396 (owner: 10Hashar) [12:41:13] (03Merged) 10jenkins-bot: Remap serving plugins under /r/plugins/ [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972396 (owner: 10Hashar) [12:41:41] (03CR) 10Jbond: [C: 03+2] prometheus: update ssl CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:42:45] https://sr.wikipedia.org/w/index.php?action=edit&preload=&editintro=&title=%D0%A4%D0%B5j%D1%88%D0%BB%E2%80%8F%E2%80%8E&create=%D0%9D%D0%B0%D0%BF%D1%80%D0%B0%D0%B2%D0%B8+%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D1%83 is showing me that I can change protection. When I click on that, it actually loads existing page which doesn't have protection. [12:44:31] I think that there is some bug with encoding, so you can run DELETE FROM protected_titles WHERE pt_timestamp='20111215021350'; [12:44:58] There is only one page so it won't cause a chaos with replication. [12:44:59] MariaDB [srwiki_p]> SELECT * FROM protected_titles WHERE pt_timestamp='20111215021350'; [12:45:00] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:45:00] | pt_namespace | pt_title | pt_user | pt_reason_id | pt_timestamp | pt_expiry | pt_create_perm | [12:45:01] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:45:01] | 0 | Феjшл | 133 | 2383 | 20111215021350 | infinity | sysop | [12:45:02] +--------------+-----------+---------+--------------+----------------+-----------+----------------+ [12:45:02] 1 row in set (0.002 sec) [12:45:32] (03CR) 10Jbond: [C: 03+2] "deployed and tested by running" [puppet] - 10https://gerrit.wikimedia.org/r/972368 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:45:56] (03CR) 10Jbond: [C: 03+2] varnishkafka: update SSL CA to use shared CA [puppet] - 10https://gerrit.wikimedia.org/r/972369 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:50:57] (03CR) 10Ayounsi: [C: 03+1] cr-cloud: drop nfs-maps [homer/public] - 10https://gerrit.wikimedia.org/r/972740 (https://phabricator.wikimedia.org/T350259) (owner: 10Majavah) [12:51:32] (03CR) 10Majavah: [C: 03+2] cr-cloud: drop nfs-maps [homer/public] - 10https://gerrit.wikimedia.org/r/972740 (https://phabricator.wikimedia.org/T350259) (owner: 10Majavah) [12:52:19] (03Merged) 10jenkins-bot: cr-cloud: drop nfs-maps [homer/public] - 10https://gerrit.wikimedia.org/r/972740 (https://phabricator.wikimedia.org/T350259) (owner: 10Majavah) [12:53:18] pt_timestamp 20150912171151 is bugged as well, needs removal via DB as well. [12:54:04] And 20190908114554 [12:54:09] (03CR) 10Slyngshede: [C: 03+2] CI: realpath is in /bin on macOS [puppet] - 10https://gerrit.wikimedia.org/r/972749 (owner: 10Slyngshede) [12:55:57] (03PS3) 10Jbond: prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 [12:56:15] (03CR) 10Jbond: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [12:58:07] (03PS1) 10Stevemunene: Revert "Revert "airflow-wmde: configure wmde airflow instance"" [puppet] - 10https://gerrit.wikimedia.org/r/972712 [13:00:09] (03CR) 10Majavah: [C: 03+2] users: add network device access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/970850 (https://phabricator.wikimedia.org/T350267) (owner: 10Majavah) [13:00:20] hi, I'm going to roll back the train again due to https://phabricator.wikimedia.org/T350777 [13:00:28] will do that in the next few minutes [13:00:46] (03Merged) 10jenkins-bot: users: add network device access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/970850 (https://phabricator.wikimedia.org/T350267) (owner: 10Majavah) [13:03:36] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972816 (https://phabricator.wikimedia.org/T350080) [13:03:38] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972816 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [13:04:22] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972816 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [13:04:42] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: openldap::replica [13:05:13] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10ngkountas) [13:06:15] (03PS1) 10Jbond: openldap::replica: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972820 (https://phabricator.wikimedia.org/T349619) [13:06:41] (03CR) 10Jbond: [C: 03+2] openldap::replica: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/972820 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:08:57] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid public cluster: Roll restart of Druid jvm daemons. [13:10:39] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [13:10:50] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.4 refs T350080 [13:10:58] T350080: 1.42.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T350080 [13:12:39] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: free up nfs-maps IPs T350259 - taavi@cumin1001" [13:12:43] T350259: Check if nfs-maps.wikimedia.org is still in use - https://phabricator.wikimedia.org/T350259 [13:13:34] Disregard my previous messages, I've created task: https://phabricator.wikimedia.org/T350780 [13:13:57] PROBLEM - Check systemd state on ganeti6001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:04] mutante: the sync-netbox-hiera cookbook is showing me a diff adding stewards1001, ok to merge that? [13:14:52] !log taavi@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: free up nfs-maps IPs T350259 - taavi@cumin1001" [13:14:52] !log taavi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:15:10] nevermind, I'll leave it for you to run the cookbook and merge [13:15:18] since I'm only doing a DNS change anyway [13:19:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: openldap::replica [13:19:09] (03PS1) 10Hnowlan: trafficserver: revert to aqs1 for editor and pageview metrics [puppet] - 10https://gerrit.wikimedia.org/r/972822 (https://phabricator.wikimedia.org/T350708) [13:20:05] (03PS3) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [13:21:35] (03PS2) 10Arnaudb: mariadb: add db1238 and prepare db1138 retirement [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) [13:25:46] (03PS1) 10Btullis: Switch datahub to use the new an-mariadb servers instead of an-coord [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150) [13:31:57] (03CR) 10Jcrespo: RemoteExecution: Remove errors from cumin logging from basic tests (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [13:34:12] !log installing libxpm security updates [13:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:45] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [13:49:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::hadoop::worker [13:51:16] (03PS1) 10JMeybohm: Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) [13:51:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:55:17] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [13:55:40] (03CR) 10Ottomata: [C: 03+1] Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) (owner: 10JMeybohm) [13:55:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::test::broker [13:57:26] (03PS1) 10Muehlenhoff: Switch kafka::test::broker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972826 (https://phabricator.wikimedia.org/T349619) [13:59:12] <_joe_> jouncebot: nowandnext [13:59:12] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [13:59:12] In 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1400) [13:59:19] <_joe_> lol [13:59:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [13:59:53] <_joe_> actually, it would be great to get the replicas bump out with a deployment [13:59:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1400). [14:00:05] Daimona, Kizule, and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:26] \o [14:00:34] (03PS1) 10Effie Mouzeli: ipoid: staging cron fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/972827 [14:01:03] Daimona: I think that your backported patch for wmf.4 needs to be cherry-picked into wmf.3 as well, since wmf.4 was reverted to wmf.3. [14:01:05] (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka::test::broker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972826 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:01:33] Kizule: ah, yes, you're right! Thanks! [14:01:43] <_joe_> who's deploying today? [14:01:47] I can I guess [14:01:58] I completely forgot about wmf.3 still being around [14:02:13] No problem, always happy to help. :) [14:02:14] <_joe_> taavi: ok, can I do a small resource bump for mw on k8s before you deploy the first patch? [14:02:19] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: staging cron fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/972827 (owner: 10Effie Mouzeli) [14:02:29] <_joe_> taavi: that will imply a small risk of failing deployment [14:02:34] _joe_: sure! lmk when I start deploying [14:02:36] (03PS1) 10Daimona Eaytoy: Remove feature flag for email [extensions/CampaignEvents] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972713 (https://phabricator.wikimedia.org/T347067) [14:02:37] <_joe_> but we can just safely roll it back [14:02:38] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972738 (owner: 10Majavah) [14:02:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-api-ext, mw-web: Raise replicas 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/964457 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [14:02:55] <_joe_> taavi: as soon as ^^ is merged I guess [14:03:00] ack [14:03:08] (03Merged) 10jenkins-bot: ipoid: staging cron fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/972827 (owner: 10Effie Mouzeli) [14:03:47] (03Merged) 10jenkins-bot: mw-api-ext, mw-web: Raise replicas 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/964457 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [14:03:57] Kizule: I'm not executing random SQL commands in production on rows where I don't know where they came from, please get a proper maintenance script merged into master or get someone who knows about why those rows are there and why they need to be removed to execute the SQL by themselves [14:03:58] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:04:02] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:04:39] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [14:04:50] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [14:05:00] Daimona: why does your patch need a backport? [14:05:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972428 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:05:43] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10Ottomata) > So is the way forward to add @AndrewTavis_WMDE to the analytics-search-users group? This would accomplish the goal of the task, but is... [14:05:44] Because the feature flag removal didn't make it in time for wmf.4 [14:05:46] taavi: "Random SQL command" is same as in https://gerrit.wikimedia.org/g/mediawiki/core/+/e009f9acb556a56340b83eeeb05d3eedc9131bda/includes/Permissions/RestrictionStore.php#200 [14:06:01] (03PS2) 10JMeybohm: Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) [14:06:10] https://gerrit.wikimedia.org/g/mediawiki/core/+/e009f9acb556a56340b83eeeb05d3eedc9131bda/includes/page/WikiPage.php#2303 [14:06:12] (03Merged) 10jenkins-bot: beta: Stop setting $wgCampaignEventsEnableEmail, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972428 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:06:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::test::broker [14:06:19] I backported it so that the flag can be unset in the prog config too, without behaviour changes [14:06:27] https://gerrit.wikimedia.org/g/mediawiki/core/+/e009f9acb556a56340b83eeeb05d3eedc9131bda/includes/title/Title.php#2459 [14:06:38] If Amir1 is around, that would be helpful to me. :) [14:07:06] (03PS2) 10Majavah: prod: Stop setting $wgCampaignEventsEnableEmail, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972429 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:07:07] I'm around, is it urgent? [14:07:21] Not really urgent, but it would be worth to take a look. [14:07:32] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972737 (owner: 10Majavah) [14:07:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/CampaignEvents] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972713 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:07:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/CampaignEvents] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972260 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:07:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972429 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:07:38] Daimona: I don't really see the point of that as compared to removing the flag in a week, but whatever :P [14:08:20] Here is the task so you don't have to search for it. https://phabricator.wikimedia.org/T350780 [14:08:30] (03Merged) 10jenkins-bot: prod: Stop setting $wgCampaignEventsEnableEmail, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972429 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:08:37] I just wanted to get rid of that task ASAP :D [14:09:15] (03PS3) 10JMeybohm: Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) [14:09:39] (Also, I was under the impression that we missed the train by just a few hours, but then I didn't think about wmf.3...) [14:10:24] (03Merged) 10jenkins-bot: Remove feature flag for email [extensions/CampaignEvents] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/972713 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:10:30] (03Merged) 10jenkins-bot: Remove feature flag for email [extensions/CampaignEvents] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972260 (https://phabricator.wikimedia.org/T347067) (owner: 10Daimona Eaytoy) [14:10:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:10:42] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS bookworm [14:10:57] !log taavi@deploy2002 Started scap: Backport for [[gerrit:972713|Remove feature flag for email (T347067)]], [[gerrit:972260|Remove feature flag for email (T347067)]], [[gerrit:972429|prod: Stop setting $wgCampaignEventsEnableEmail, unused (T347067)]] [14:11:00] T347067: Remove feature flag for email participants - https://phabricator.wikimedia.org/T347067 [14:11:02] (03CR) 10Marostegui: [C: 03+1] mariadb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972363 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:11:09] jouncebot: next [14:11:09] In 0 hour(s) and 48 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1500) [14:11:11] <_joe_> frankly I agree that executing SQL by hand in production if it's not reviewed is a bad policy, *expecially if repeatedly* [14:11:22] marostegui: hi, it's a backport window, I'm deploying [14:11:41] (03CR) 10Brouberol: [C: 03+2] Renew skein certificate every month via systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971196 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [14:12:02] taavi: Yeah I am aware [14:12:08] taavi: ETA? [14:12:11] _joe_: That's understandable, and that's why I just provided resources in order to make review easier, and task done. [14:12:11] RECOVERY - Check systemd state on ganeti6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:22] !log taavi@deploy2002 taavi and daimona: Backport for [[gerrit:972713|Remove feature flag for email (T347067)]], [[gerrit:972260|Remove feature flag for email (T347067)]], [[gerrit:972429|prod: Stop setting $wgCampaignEventsEnableEmail, unused (T347067)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:27] Daimona: please test [14:12:46] Yup, testing now [14:12:47] marostegui: a very rough guess would be about half an hour or less [14:13:22] I'll ping you when I'm done, ok? [14:14:11] cheers [14:14:31] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:15] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:15:26] <_joe_> marostegui: we might be slowed down by me though :) [14:15:37] that's ok [14:17:20] taavi: LGTM! [14:17:38] Daimona: thanks, syncing [14:17:43] !log taavi@deploy2002 taavi and daimona: Continuing with sync [14:18:14] * _joe_ crossing fingers [14:18:20] (03CR) 10Volans: [C: 04-1] "reply inline, I'll comment directly on the task" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [14:18:59] it's starting sync-prod-k8s, let's see what happens [14:19:36] mw-web finished [14:19:42] so did mw-api-ext [14:19:54] _joe_: everything ok on your side? [14:20:19] <_joe_> taavi: yes [14:20:19] Superpes: hi, around? yours are up next [14:20:30] Yep I'm here taavi :D [14:21:13] Ah, sorry, forgot to ping earlier lol [14:22:00] (03PS4) 10JMeybohm: Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) [14:22:02] Superpes: I'm a bit confused by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/971518/, https://www.mediawiki.org/wiki/Extension:AbuseFilter shows -log-private being assigned to sysops by default [14:22:24] Uhm Looking [14:23:08] I also thought it was by default by didn't check (I trusted the request) [14:23:16] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:972713|Remove feature flag for email (T347067)]], [[gerrit:972260|Remove feature flag for email (T347067)]], [[gerrit:972429|prod: Stop setting $wgCampaignEventsEnableEmail, unused (T347067)]] (duration: 12m 19s) [14:23:20] T347067: Remove feature flag for email participants - https://phabricator.wikimedia.org/T347067 [14:23:25] Daimona: yours is live [14:23:31] although https://pl.wikipedia.org/wiki/Specjalna:Grupy_u%C5%BCytkownik%C3%B3w doesn't show it. odd [14:23:45] taavi, Superpes: It's because it's set to false in wmf-config/abusefilter.php. [14:24:12] Revert changes in core-Permissions.php and apply them in wmf-config/abusefilter.php. [14:24:24] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10Nikerabbit) I approve. Let us know if further details are needed. [14:24:41] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972831 [14:24:45] Uhm looking if there's any particular reason [14:25:00] aha, seems to be from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/7e95f6d842994e82708a127b7c9ab6c08964462f%5E%21/wmf-config/abusefilter.php [14:25:16] (03PS46) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:25:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:25:37] (03CR) 10JMeybohm: [C: 03+2] Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) (owner: 10JMeybohm) [14:25:47] so you should just drop the current `= false` line in abusefilter.php and revert the changes from core-Permissions.php [14:25:55] Yep doing [14:25:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:26:00] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:26:08] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:26:22] Or maybe is it better to create another task? [14:26:24] taavi: Thanks :) [14:26:30] Ciao Daimona :P [14:26:37] (03CR) 10Marostegui: "Remember this needs to be removed from zarcillo too" [puppet] - 10https://gerrit.wikimedia.org/r/972509 (https://phabricator.wikimedia.org/T348956) (owner: 10Arnaudb) [14:26:43] (03CR) 10Marostegui: [C: 03+1] haproxy: remove dbproxy1017 from production [puppet] - 10https://gerrit.wikimedia.org/r/972509 (https://phabricator.wikimedia.org/T348956) (owner: 10Arnaudb) [14:26:49] :D [14:27:05] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [14:27:59] (03Merged) 10jenkins-bot: Increase CPU limit quota of eventgate-analytics by 10 cpus [deployment-charts] - 10https://gerrit.wikimedia.org/r/972825 (https://phabricator.wikimedia.org/T350707) (owner: 10JMeybohm) [14:28:05] what would you need another task for? this is doing the exact thing they're asking for, just in a different way than what you initially went for [14:28:16] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:28:18] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:28:37] Ahhh yep I thought you were asking to change settings for all wikis! Doing immediately :D [14:28:46] (03CR) 10Herron: [V: 03+1 C: 03+2] logstash: increase heap to 4g [puppet] - 10https://gerrit.wikimedia.org/r/972456 (https://phabricator.wikimedia.org/T350434) (owner: 10Herron) [14:28:49] (03PS1) 10Slyngshede: Improve installation and setup procedures for running locally. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972834 [14:30:28] (03PS4) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) [14:30:53] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:30:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: move 15% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964447 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [14:30:57] (03PS2) 10Slyngshede: Improve installation and setup procedures for running locally. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972834 [14:31:06] (03CR) 10Fabfur: [C: 03+1] "ok" [puppet] - 10https://gerrit.wikimedia.org/r/972822 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [14:31:08] (03PS5) 10Majavah: [plwiki] Add 'abusefilter-log-private' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) (owner: 10Superpes15) [14:31:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482) (owner: 10Superpes15) [14:31:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:31:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) (owner: 10Superpes15) [14:31:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:31:58] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [14:32:11] (03Merged) 10jenkins-bot: [bnwikisource] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482) (owner: 10Superpes15) [14:32:14] (03Merged) 10jenkins-bot: [plwiki] Add 'abusefilter-log-private' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) (owner: 10Superpes15) [14:32:24] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:32:27] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:32:28] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:32:34] (03PS1) 10Marostegui: production-m5.sql: Add DROP [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) [14:32:38] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:32:38] !log taavi@deploy2002 Started scap: Backport for [[gerrit:971517|[bnwikisource] Change the wordmark (T350482)]], [[gerrit:971518|[plwiki] Add 'abusefilter-log-private' flag to sysops (T350509)]] [14:32:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:32:45] T350509: Add (abusefilter-log-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T350509 [14:32:45] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:32:45] T350482: Replace the wordmark for Bengali Wikisource - https://phabricator.wikimedia.org/T350482 [14:32:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:33:39] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:33:55] (03CR) 10Marostegui: "This is a noop and requires a change in the DB" [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [14:34:02] !log taavi@deploy2002 taavi and superpes: Backport for [[gerrit:971517|[bnwikisource] Change the wordmark (T350482)]], [[gerrit:971518|[plwiki] Add 'abusefilter-log-private' flag to sysops (T350509)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:34:18] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:34:19] Superpes: both of your patches are available for testing [14:34:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::zookeeper [14:34:50] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add DROP [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [14:34:57] And both looks fine on WMdebug taavi [14:35:05] ok, syncing [14:35:07] !log taavi@deploy2002 taavi and superpes: Continuing with sync [14:35:14] And thanks for the fixing [14:35:18] :) [14:35:55] <_joe_> !log Running puppet on cp-text to pick up the increase in traffic to mw on k8s [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:08] (03PS1) 10Jforrester: Revert "Fix remaining uses of 'parent'->'super'" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972715 (https://phabricator.wikimedia.org/T350777) [14:37:39] (03CR) 10Bking: rdf-streaming-updater: update values for application mode (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:38:02] (03PS1) 10Muehlenhoff: Switch analytics_cluster::zookeeper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972837 (https://phabricator.wikimedia.org/T349619) [14:38:12] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::zookeeper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972837 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:40:24] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:971517|[bnwikisource] Change the wordmark (T350482)]], [[gerrit:971518|[plwiki] Add 'abusefilter-log-private' flag to sysops (T350509)]] (duration: 07m 45s) [14:40:30] aaand it's live [14:40:31] T350509: Add (abusefilter-log-private) right to sysop group on plwiki - https://phabricator.wikimedia.org/T350509 [14:40:31] T350482: Replace the wordmark for Bengali Wikisource - https://phabricator.wikimedia.org/T350482 [14:40:48] marostegui: I'm done deploying [14:40:55] taavi: thanks!!! [14:41:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2012,2014].codfw.wmnet,pc1012.eqiad.wmnet with reason: Upgrade [14:41:33] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972831 (owner: 10Marostegui) [14:41:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2012,2014].codfw.wmnet,pc1012.eqiad.wmnet with reason: Upgrade [14:42:14] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972831 (owner: 10Marostegui) [14:42:38] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972716 [14:42:41] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:972831|ProductionServices.php: Promote pc2014 to pc2 master]] [14:42:44] (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972716 (owner: 10Marostegui) [14:44:04] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:972831|ProductionServices.php: Promote pc2014 to pc2 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::zookeeper [14:46:05] !log marostegui@deploy2002 marostegui: Continuing with sync [14:48:42] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:49:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:51:22] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:972831|ProductionServices.php: Promote pc2014 to pc2 master]] (duration: 08m 41s) [14:51:34] (03PS47) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:51:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::staging::worker [14:52:51] (03CR) 10CI reject: [V: 04-1] Revert "Fix remaining uses of 'parent'->'super'" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972715 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [14:53:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:53:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:14] (03PS1) 10Muehlenhoff: Switch kubernetes::staging::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972838 (https://phabricator.wikimedia.org/T349619) [14:53:27] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:54:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::staging::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972838 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:54:22] (03CR) 10Stevemunene: [C: 03+2] Revert "Revert "airflow-wmde: configure wmde airflow instance"" [puppet] - 10https://gerrit.wikimedia.org/r/972712 (owner: 10Stevemunene) [14:55:07] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:55:41] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:55] (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972716 (owner: 10Marostegui) [14:57:01] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972716 (owner: 10Marostegui) [14:57:41] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972716 (owner: 10Marostegui) [14:58:05] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:972716|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] [14:59:00] (03PS1) 10Marostegui: pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/972841 [14:59:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::staging::worker [14:59:29] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:972716|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:37] !log marostegui@deploy2002 marostegui: Continuing with sync [15:00:06] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1500) [15:01:00] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/972841 (owner: 10Marostegui) [15:03:13] (03PS1) 10Arnaudb: auto_schema: add T348183.py [software] - 10https://gerrit.wikimedia.org/r/972510 (https://phabricator.wikimedia.org/T348183) [15:03:26] (03CR) 10CI reject: [V: 04-1] auto_schema: add T348183.py [software] - 10https://gerrit.wikimedia.org/r/972510 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [15:04:29] (03CR) 10Arnaudb: "I'm not sure this is the right way to go for https://phabricator.wikimedia.org/T348183 as I run all alter table in a single command" [software] - 10https://gerrit.wikimedia.org/r/972510 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [15:04:56] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:972716|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] (duration: 06m 51s) [15:07:08] (03CR) 10Ladsgroup: "Wrong repo: That's auto_schema itself, you need to make a MR in https://gitlab.wikimedia.org/repos/sre/schema-changes" [software] - 10https://gerrit.wikimedia.org/r/972510 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [15:07:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::staging::master [15:08:20] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972744 (owner: 10Jbond) [15:08:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2014.codfw.wmnet with OS bookworm [15:09:38] (03CR) 10JHathaway: "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:09:43] (03PS1) 10Muehlenhoff: Switch kubernetes::staging::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972843 (https://phabricator.wikimedia.org/T349619) [15:11:03] (03PS1) 10Arnaudb: mariadb: re-enable notifications for db1236 [puppet] - 10https://gerrit.wikimedia.org/r/972511 (https://phabricator.wikimedia.org/T344036) [15:11:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::staging::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972843 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:11:32] (03CR) 10Marostegui: "The commit message says prepare db1138 retirement, but there's nothing related to db1138 here, is that expected?" [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:11:50] (03CR) 10Marostegui: [C: 03+1] mariadb: re-enable notifications for db1236 [puppet] - 10https://gerrit.wikimedia.org/r/972511 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:12:01] (03CR) 10Arnaudb: [C: 03+2] mariadb: re-enable notifications for db1236 [puppet] - 10https://gerrit.wikimedia.org/r/972511 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:12:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972834 (owner: 10Slyngshede) [15:14:39] (03PS3) 10Jbond: puppet: puppet-run fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/972744 [15:14:57] (03PS1) 10Arnaudb: mariadb: removing master candidacy info for db1136 [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) [15:16:05] (03CR) 10Marostegui: mariadb: removing master candidacy info for db1136 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:16:25] (03CR) 10Volans: puppet.puppet.get_puppet_ca_hostname: return hardcoded start (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:17:51] (03CR) 10Jbond: [C: 03+2] puppet: puppet-run fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/972744 (owner: 10Jbond) [15:17:54] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) (owner: 10Filippo Giunchedi) [15:18:23] (03PS2) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [15:18:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::staging::master [15:19:40] (03CR) 10CI reject: [V: 04-1] Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:19:44] (03PS1) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [15:20:35] (03PS1) 10Effie Mouzeli: ipoid: staging testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/972845 [15:20:58] (03PS2) 10Arnaudb: mariadb: removing master candidacy info for db1136 [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) [15:21:13] (03CR) 10Arnaudb: mariadb: removing master candidacy info for db1136 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:21:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] mariadb: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972363 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:22:01] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: staging testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/972845 (owner: 10Effie Mouzeli) [15:22:56] (03Merged) 10jenkins-bot: ipoid: staging testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/972845 (owner: 10Effie Mouzeli) [15:23:40] (03PS1) 10Muehlenhoff: Extend MOU for west1 [puppet] - 10https://gerrit.wikimedia.org/r/972848 [15:24:44] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for west1 [puppet] - 10https://gerrit.wikimedia.org/r/972848 (owner: 10Muehlenhoff) [15:24:59] (PuppetFailure) firing: Puppet has failed on netflow2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:25:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2014.codfw.wmnet with reason: host reimage [15:25:14] (03CR) 10Marostegui: [C: 03+1] mariadb: removing master candidacy info for db1136 [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:25:32] (03CR) 10Arnaudb: [C: 03+2] mariadb: removing master candidacy info for db1136 [puppet] - 10https://gerrit.wikimedia.org/r/972512 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:25:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:26:10] (03PS48) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:26:31] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:26:52] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:26:58] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [15:27:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:27:26] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:27:33] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:27:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2014.codfw.wmnet with reason: host reimage [15:28:51] (03CR) 10Volans: "Taking the liberty to abandon this as it was superseeded by I5e9fd661f2be03099bd4b0c234c972093dd7cb85" [software/spicerack] - 10https://gerrit.wikimedia.org/r/497764 (owner: 10Gehel) [15:29:02] (03Abandoned) 10Volans: Expose failed results as part of RemoteExecutionError. [software/spicerack] - 10https://gerrit.wikimedia.org/r/497764 (owner: 10Gehel) [15:29:29] volans: thanks for the cleanup! [15:29:38] :D [15:29:47] yw, I came across it by chance [15:30:11] (03Abandoned) 10Arnaudb: auto_schema: add T348183.py [software] - 10https://gerrit.wikimedia.org/r/972510 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [15:30:26] (03CR) 10Hnowlan: [C: 03+2] trafficserver: revert to aqs1 for editor and pageview metrics [puppet] - 10https://gerrit.wikimedia.org/r/972822 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [15:31:10] (03CR) 10Andrew Bogott: [C: 03+1] mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:32:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [15:33:34] !log brion running requeueTranscodes.php to batch-remove old low-res VP9 WebM transcodes (should be low impact) [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:07] (03PS1) 10Hnowlan: editor-analytics: bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972849 [15:34:16] (03CR) 10Majavah: [C: 04-1] mariadb - wmcs: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:34:59] (PuppetFailure) resolved: Puppet has failed on netflow2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:36:17] (03CR) 10Santiago Faci: [C: 03+1] "It looks good! Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972849 (owner: 10Hnowlan) [15:37:51] (03CR) 10Hnowlan: [C: 03+2] editor-analytics: bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972849 (owner: 10Hnowlan) [15:38:37] (03Merged) 10jenkins-bot: editor-analytics: bump docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972849 (owner: 10Hnowlan) [15:39:37] (03PS7) 10Jbond: mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [15:39:39] (03PS6) 10Jbond: mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) [15:39:49] (03PS6) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [15:39:53] (03PS6) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [15:39:57] (03PS6) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [15:40:01] (03PS6) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [15:40:35] (03CR) 10Jbond: "fixed cheers" [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:41:27] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [15:41:39] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:42:02] (03CR) 10Majavah: [C: 03+1] mariadb - wmcs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968665 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:42:22] (03Abandoned) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:43:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [15:43:24] (03PS1) 10Stevemunene: Revert "Revert "airflow-wmde: Add wmde service user to the Yarn production queue"" [puppet] - 10https://gerrit.wikimedia.org/r/972718 [15:43:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet [15:44:22] (03PS1) 10Stevemunene: Revert "Revert "airflow-wmde: Place airflow1007 in airflow-wmde role"" [puppet] - 10https://gerrit.wikimedia.org/r/972719 [15:44:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:45:54] !log brion running requeueTranscodes.php on mwmaint2002 to continue backfill for iOS-compatible low-res video (throttled) [15:46:05] (03Abandoned) 10Jcrespo: miniloader: Draft small utility to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [15:48:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2014.codfw.wmnet with OS bookworm [15:49:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1014.eqiad.wmnet with OS bookworm [15:50:46] (03PS1) 10Brouberol: Setup partman reuse recipe for an-druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/972851 (https://phabricator.wikimedia.org/T332604) [15:51:19] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:51:31] (03PS1) 10Jforrester: Modify regex to reflect updated DOM [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) [15:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet [15:52:07] (03CR) 10Stevemunene: [C: 03+2] Revert "Revert "airflow-wmde: Add wmde service user to the Yarn production queue"" [puppet] - 10https://gerrit.wikimedia.org/r/972718 (owner: 10Stevemunene) [15:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:53:33] (03CR) 10Jbond: [C: 03+2] mariadb - wikireplicas: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:53:59] (PuppetFailure) firing: Puppet has failed on ganeti1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:17] (03CR) 10Brouberol: Setup partman reuse recipe for an-druid hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972851 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [15:56:18] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:57:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [15:57:58] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dse_k8s::worker [15:58:48] (03CR) 10Jcrespo: "Thanks, I will amend this as a comment only/linting change only being noop to keep Matthew's useful comments. I will open a new one with V" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [15:59:06] (03Abandoned) 10Jforrester: Revert "Fix remaining uses of 'parent'->'super'" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972715 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [15:59:18] (03PS1) 10Muehlenhoff: Switch dse_k8s::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972853 (https://phabricator.wikimedia.org/T349619) [16:01:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch dse_k8s::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972853 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:01:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1014.eqiad.wmnet with reason: host reimage [16:03:38] (03CR) 10Stevemunene: [C: 03+2] Revert "Revert "airflow-wmde: Place airflow1007 in airflow-wmde role"" [puppet] - 10https://gerrit.wikimedia.org/r/972719 (owner: 10Stevemunene) [16:03:59] (PuppetFailure) resolved: Puppet has failed on ganeti1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:04:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1014.eqiad.wmnet with reason: host reimage [16:04:57] jouncebot: nowandnext [16:04:57] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [16:04:58] In 1 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1800) [16:05:30] jnuche: OK for me to deploy that wmf.4 WBMI fix? Did you want to do so? [16:06:39] (03PS6) 10Kamila Součková: [WIP] add kube-state-metrics helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) [16:06:56] James_F: happy with you going ahead if you don't mind doing it [16:07:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:07:07] Sure. [16:09:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dse_k8s::worker [16:09:09] (03CR) 10CI reject: [V: 04-1] Modify regex to reflect updated DOM [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:09:31] Dear Vector… [16:09:34] PROBLEM - MariaDB Replica Lag: s3 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 570.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:42] PROBLEM - MariaDB Replica Lag: s7 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 577.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:11] (03PS1) 10Jforrester: Skip PerformanceBudgetTest::testTotalModulesSize [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972721 (https://phabricator.wikimedia.org/T350338) [16:10:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972721 (https://phabricator.wikimedia.org/T350338) (owner: 10Jforrester) [16:10:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:10:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972721 (https://phabricator.wikimedia.org/T350338) (owner: 10Jforrester) [16:10:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:11:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dse_k8s::master [16:11:18] PROBLEM - MariaDB Replica Lag: s6 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 674.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:15:04] PROBLEM - MariaDB Replica Lag: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 901.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:15:54] PROBLEM - MariaDB Replica SQL: s6 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:15:58] PROBLEM - MariaDB Replica SQL: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:04] PROBLEM - MariaDB Replica IO: s3 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:06] PROBLEM - MariaDB Replica IO: s7 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:06] PROBLEM - MariaDB Replica IO: s4 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:10] PROBLEM - MariaDB Replica IO: s5 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:32] PROBLEM - MariaDB Replica SQL: s1 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:32] PROBLEM - MariaDB Replica IO: s6 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:16:52] PROBLEM - MariaDB Replica IO: s8 on clouddb1021 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:27] (03PS1) 10Muehlenhoff: Switch dse_k8s::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972856 (https://phabricator.wikimedia.org/T349619) [16:17:28] PROBLEM - MariaDB Replica Lag: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:18:04] PROBLEM - MariaDB read only wikireplica-s5 on clouddb1021 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:22] PROBLEM - MariaDB read only s3 on clouddb1021 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:26] PROBLEM - MariaDB read only s2 on clouddb1021 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:28] PROBLEM - MariaDB read only s7 on clouddb1021 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:28] PROBLEM - MariaDB read only s4 on clouddb1021 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:32] PROBLEM - MariaDB read only wikireplica-s7 on clouddb1021 is CRITICAL: Could not connect to localhost:3317 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:32] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1021 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:34] PROBLEM - MariaDB read only wikireplica-s8 on clouddb1021 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:40] PROBLEM - MariaDB read only s5 on clouddb1021 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:46] PROBLEM - mysqld processes on clouddb1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:18:50] PROBLEM - MariaDB read only s6 on clouddb1021 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:54] PROBLEM - MariaDB read only wikireplica-s2 on clouddb1021 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:54] PROBLEM - MariaDB read only s1 on clouddb1021 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:18:56] PROBLEM - MariaDB read only wikireplica-s1 on clouddb1021 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:19:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1014.eqiad.wmnet with OS bookworm [16:19:29] (03CR) 10Muehlenhoff: [C: 03+2] Switch dse_k8s::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/972856 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:19:36] PROBLEM - MariaDB read only s8 on clouddb1021 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:19:56] PROBLEM - MariaDB Replica Lag: s4 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:21:12] PROBLEM - MariaDB Replica SQL: s3 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:21:54] PROBLEM - MariaDB read only wikireplica-s4 on clouddb1021 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:22:03] (03CR) 10CI reject: [V: 04-1] Modify regex to reflect updated DOM [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:22:05] * James_F sighs so very much at Vector. [16:22:13] (03CR) 10Jforrester: [C: 03+2] Modify regex to reflect updated DOM [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:22:30] PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:22:34] RECOVERY - MariaDB read only wikireplica-s2 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 1s, read_only: True, event_scheduler: False, 11.56 QPS, connection latency: 0.019665s, query latency: 0.017255s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:22:36] RECOVERY - MariaDB read only s1 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 2s, read_only: True, event_scheduler: False, 11.68 QPS, connection latency: 0.405751s, query latency: 0.009362s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:22:36] RECOVERY - MariaDB read only wikireplica-s1 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 3s, read_only: True, event_scheduler: False, 11.77 QPS, connection latency: 0.005181s, query latency: 0.000601s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:22:56] RECOVERY - MariaDB read only wikireplica-s5 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 23s, read_only: True, event_scheduler: False, 11.75 QPS, connection latency: 0.005656s, query latency: 0.000409s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:08] RECOVERY - MariaDB read only wikireplica-s4 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 35s, read_only: True, event_scheduler: False, 21.62 QPS, connection latency: 0.004534s, query latency: 0.000499s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:16] RECOVERY - MariaDB read only s3 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 9s, read_only: True, event_scheduler: False, 11.48 QPS, connection latency: 0.006813s, query latency: 0.017737s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:18] RECOVERY - MariaDB read only s8 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 44s, read_only: True, event_scheduler: False, 20.60 QPS, connection latency: 0.005525s, query latency: 0.000557s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:20] RECOVERY - MariaDB read only s2 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 15s, read_only: True, event_scheduler: False, 11.78 QPS, connection latency: 0.004924s, query latency: 0.000509s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:22] RECOVERY - MariaDB read only s7 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 48s, read_only: True, event_scheduler: False, 11.76 QPS, connection latency: 0.004756s, query latency: 0.000520s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:22] RECOVERY - MariaDB read only s4 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 4s, read_only: True, event_scheduler: False, 11.77 QPS, connection latency: 0.005265s, query latency: 0.000471s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:23] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [16:23:26] RECOVERY - MariaDB read only wikireplica-s7 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 11.80 QPS, connection latency: 0.005719s, query latency: 0.000480s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:26] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 18s, read_only: True, event_scheduler: False, 11.78 QPS, connection latency: 0.005648s, query latency: 0.000498s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:30] RECOVERY - MariaDB read only wikireplica-s8 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 56s, read_only: True, event_scheduler: False, 20.60 QPS, connection latency: 0.004390s, query latency: 0.000482s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:36] RECOVERY - MariaDB read only s5 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 63s, read_only: True, event_scheduler: False, 20.55 QPS, connection latency: 0.004686s, query latency: 0.000417s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:40] RECOVERY - mysqld processes on clouddb1021 is OK: PROCS OK: 8 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:23:46] RECOVERY - MariaDB read only s6 on clouddb1021 is OK: Version 10.6.14-MariaDB, Uptime 72s, read_only: True, event_scheduler: False, 11.61 QPS, connection latency: 0.010574s, query latency: 0.000761s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:23:59] (PuppetFailure) firing: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:24:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dse_k8s::master [16:25:06] PROBLEM - MariaDB Replica Lag: s8 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1502.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:25:21] Vector tests be moody today... [16:26:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:27:04] (03PS1) 10Krinkle: noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 [16:28:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:29:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972721 (https://phabricator.wikimedia.org/T350338) (owner: 10Jforrester) [16:29:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:29:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules: add base.statsd:1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972339 (owner: 10Giuseppe Lavagetto) [16:29:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:30:24] (03Merged) 10jenkins-bot: modules: add base.statsd:1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/972339 (owner: 10Giuseppe Lavagetto) [16:31:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 (owner: 10Giuseppe Lavagetto) [16:31:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:31:44] (03CR) 10CI reject: [V: 04-1] base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 (owner: 10Giuseppe Lavagetto) [16:31:57] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [16:33:00] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) Ok. Then it seems we are ok for the most part- I will start working then on access, as this is the first time such acc... [16:33:38] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@869cca4]: Set group ownership of processed sparql queries [16:33:59] (PuppetFailure) resolved: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:34:05] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@869cca4]: Set group ownership of processed sparql queries (duration: 00m 27s) [16:35:56] (03PS3) 10Giuseppe Lavagetto: base.statsd: add prestop sleep helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/972340 [16:36:20] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) One last thing- legal rarely comments on stuff here in public on Phab- you may want to reach them directly. [16:37:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:38:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [16:39:04] (03Merged) 10jenkins-bot: Skip PerformanceBudgetTest::testTotalModulesSize [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972721 (https://phabricator.wikimedia.org/T350338) (owner: 10Jforrester) [16:39:07] (03Merged) 10jenkins-bot: Modify regex to reflect updated DOM [extensions/WikibaseMediaInfo] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/972720 (https://phabricator.wikimedia.org/T350777) (owner: 10Jforrester) [16:39:15] Finally. [16:39:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:39:33] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:972721|Skip PerformanceBudgetTest::testTotalModulesSize (T350338)]], [[gerrit:972720|Modify regex to reflect updated DOM (T350777)]] [16:39:45] (03PS3) 10Giuseppe Lavagetto: mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 [16:39:49] T350338: Vector PerformanceBudgetTest::testTotalModulesSize CI break - https://phabricator.wikimedia.org/T350338 [16:39:49] T350777: 1.42.0-wmf.4: Structured Data on Wikimedia Commons not longer available - https://phabricator.wikimedia.org/T350777 [16:40:56] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:972721|Skip PerformanceBudgetTest::testTotalModulesSize (T350338)]], [[gerrit:972720|Modify regex to reflect updated DOM (T350777)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:41:26] !log jforrester@deploy2002 jforrester: Continuing with sync [16:43:44] (03PS7) 10Kamila Součková: Add WIP kube-state-metrics deployment to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) [16:44:07] (03CR) 10Kamila Součková: [C: 03+2] Add WIP kube-state-metrics deployment to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:44:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 (owner: 10Giuseppe Lavagetto) [16:45:21] (03Merged) 10jenkins-bot: mediawiki: update statsd module [deployment-charts] - 10https://gerrit.wikimedia.org/r/972341 (owner: 10Giuseppe Lavagetto) [16:47:03] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:972721|Skip PerformanceBudgetTest::testTotalModulesSize (T350338)]], [[gerrit:972720|Modify regex to reflect updated DOM (T350777)]] (duration: 07m 29s) [16:47:11] T350338: Vector PerformanceBudgetTest::testTotalModulesSize CI break - https://phabricator.wikimedia.org/T350338 [16:47:11] T350777: 1.42.0-wmf.4: Structured Data on Wikimedia Commons not longer available - https://phabricator.wikimedia.org/T350777 [16:47:38] jnuche: Train should be unblocked [16:48:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [16:48:21] (03CR) 10Kamila Součková: [C: 03+2] Initial commit of kube-state-metrics chart from prometheus-community (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:49:00] (03Merged) 10jenkins-bot: Initial commit of kube-state-metrics chart from prometheus-community [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:49:49] James_F: thanks a lot, that patch was painful [16:49:55] jouncebot: nowandnext [16:49:55] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [16:49:55] In 1 hour(s) and 10 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1800) [16:50:02] ok, rolling forward [16:50:47] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972861 (https://phabricator.wikimedia.org/T350080) [16:50:49] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972861 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [16:50:56] (03Merged) 10jenkins-bot: Add WIP kube-state-metrics deployment to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:51:35] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972861 (https://phabricator.wikimedia.org/T350080) (owner: 10TrainBranchBot) [16:51:48] <_joe_> jnuche: please lmk when you're done [16:51:51] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump to v1.29.0 to pick up retry logic change [deployment-charts] - 10https://gerrit.wikimedia.org/r/972463 (https://phabricator.wikimedia.org/T347884) (owner: 10Ottomata) [16:51:59] will do [16:52:14] <_joe_> I was about to make a change to mw on k8s, I'll wait for you to be done [16:52:59] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - bump to v1.29.0 to pick up retry logic change [deployment-charts] - 10https://gerrit.wikimedia.org/r/972463 (https://phabricator.wikimedia.org/T347884) (owner: 10Ottomata) [16:54:13] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:54:23] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:54:35] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) While checking the things I need to apply the change, I need 2 additional data points- * The list of ips where the fi... [16:56:30] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:56:41] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:56:55] (03CR) 10VolkerE: [C: 03+1] noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle) [16:57:08] (03CR) 10VolkerE: [C: 03+1] "Can't +2 here" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle) [16:57:10] (03PS1) 10Hnowlan: editor-analytics: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972864 [16:58:03] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.4 refs T350080 [17:02:03] (03PS3) 10Giuseppe Lavagetto: mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) [17:02:05] (03PS3) 10Giuseppe Lavagetto: mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) [17:03:56] !log jnuche@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.4 refs T350080 (duration: 05m 52s) [17:05:26] logs look clean and commons is showing structured data again, it's looking good [17:05:30] _joe_: please go ahead [17:05:38] <_joe_> jnuche: perfect, thanks [17:05:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [17:06:17] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:06:18] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:07:05] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:07:06] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:07:13] (03Merged) 10jenkins-bot: mw-debug: add statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972342 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [17:08:20] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:08:57] (03PS1) 10Effie Mouzeli: Dummy commit to bump image [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/972866 (https://phabricator.wikimedia.org/T348647) [17:09:06] (03PS4) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) [17:09:09] (03CR) 10Krinkle: noc: fix indentation in base.css (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle) [17:09:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [17:10:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 (owner: 10RLazarus) [17:10:46] (03CR) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [17:18:56] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:19:41] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:20:17] !log depool cp3079.esams.wmnet for BIOS settings update [17:20:48] (03Abandoned) 10Effie Mouzeli: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [17:22:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on cp3079.esams.wmnet with reason: BIOS settings change [17:22:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp3079.esams.wmnet with reason: BIOS settings change [17:23:02] (03CR) 10Effie Mouzeli: [C: 03+2] Dummy commit to bump image [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/972866 (https://phabricator.wikimedia.org/T348647) (owner: 10Effie Mouzeli) [17:23:04] (03CR) 10Kamila Součková: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972400 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [17:23:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3079.esams.wmnet [17:23:51] (03Merged) 10jenkins-bot: Dummy commit to bump image [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/972866 (https://phabricator.wikimedia.org/T348647) (owner: 10Effie Mouzeli) [17:23:54] (03PS3) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [17:23:56] (03PS2) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [17:27:54] (03PS2) 10Hnowlan: wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) [17:28:55] (03CR) 10CI reject: [V: 04-1] wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:30:36] (03PS4) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [17:30:38] (03PS3) 10JMeybohm: api-gateway,rest-gateway: Switch to cert-manager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) [17:30:46] (03PS3) 10Hnowlan: wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) [17:31:48] (03CR) 10CI reject: [V: 04-1] wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:43:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3079.esams.wmnet [17:43:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp3079.esams.wmnet [17:43:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3079.esams.wmnet [17:46:22] !log pool cp3079.esams.wmnet [17:51:03] (03PS1) 10Kamila Součková: kube-state-metrics: bump chart version, add upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972869 [17:52:10] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [17:52:21] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:53:00] (03PS2) 10Kamila Součková: kube-state-metrics: bump chart version, add upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972869 (https://phabricator.wikimedia.org/T264625) [17:56:30] (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: bump chart version, add upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972869 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [17:57:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: elastic relforge cluster restart - bking@cumin2002 - T350703 [17:59:04] (03Merged) 10jenkins-bot: kube-state-metrics: bump chart version, add upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/972869 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1800) [18:01:56] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: elastic relforge cluster restart - bking@cumin2002 - T350703 [18:05:51] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad elastic cluster restart - bking@cumin2002 - T350703 [18:12:24] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:14:00] 10SRE, 10LDAP-Access-Requests: Grant Access to WMF for ecarg - https://phabricator.wikimedia.org/T350818 (10ecarg) [18:15:38] (03CR) 10Dzahn: [C: 03+2] "thanks, Sam!" [puppet] - 10https://gerrit.wikimedia.org/r/972534 (owner: 10Samwilson) [18:18:07] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:18:32] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:20:44] taavi: it's new to me that I would have to run a sync-netbox-hiera cookbook in addition to using the makevm cookbook. when I run it with -c to check it tells me I must run --dry-run too and when I do that it says --dry-run is an unrecognized argument, so .. a bit confused why it showed that to you [18:21:19] mutante: did you have makevm crash or something similar? that would explain why the last part would not have been ran [18:21:39] it will prompt you before merging anything, so you can safely run it without --dry-run [18:21:54] taavi: there was a bug in the cookbook that got fixed.. but that happened with stewards2001, not 1001 .. [18:22:21] error: unrecognized arguments: --dry-run [18:22:32] trying [18:22:38] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test - dzahn@cumin1001" [18:24:00] yes, it does show a diff to add that host to netbox data.. for the other host that did not happen.. [18:24:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test - dzahn@cumin1001" [18:24:15] anyways it should be cleaned up now [18:26:57] (03CR) 10BryanDavis: "Will this help with my problem of sidecars in the Toolhub job pods at T292861?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 (owner: 10RLazarus) [18:30:03] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS bookworm [18:30:26] (03Restored) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:32:13] 10SRE, 10Infrastructure-Foundations, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10herron) I've had success mounting non-root filesystems that were unreliable (networked fs, external arrays, these kinds of things) using autofs, which these days can be done in... [18:32:51] (03CR) 10Bking: staging-eqiad: raise rdf-streaming-updater quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:50:41] Krinkle can you join us in #wikimedia-cloud-admin for help with a mcrounter/memcached/wikitech mystery? [18:54:42] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:05] jnuche and dduvall: OwO what's this, a deployment window?? Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1900). nyaa~ [19:00:05] jnuche and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T1900). [19:00:44] (03PS1) 10Dzahn: admin: create group for stewards VMs [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) [19:01:43] (03CR) 10CI reject: [V: 04-1] admin: create group for stewards VMs [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:06:44] 10SRE, 10observability: Add monitoring for nutcracker - https://phabricator.wikimedia.org/T95231 (10Krinkle) 05Open→03Declined Nutcracker for MW (apart from cloudweb, T202431) has been replaced with mcrouter. [19:07:11] (03PS2) 10Dzahn: admin: create group for stewards VMs [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) [19:08:19] (03CR) 10Muehlenhoff: admin: create group for stewards VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:11:02] (03PS3) 10Dzahn: admin: create group for stewards VMs [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) [19:11:11] (03CR) 10Dzahn: admin: create group for stewards VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:12:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:16:19] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) >>! In T344164#9314186, @Urbanecm wrote: > Thanks @dzahn for making the VM! Following our IRC conversations, I'm putti... [19:16:28] (03CR) 10Dzahn: [C: 03+2] admin: create group for stewards VMs [puppet] - 10https://gerrit.wikimedia.org/r/972874 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:18:51] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9317517, @Dzahn wrote: >> ** A managed clone of a Git repository (https://gitlab.wikimedia.org/repos... [19:20:14] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) >>! In T344164#9317520, @Urbanecm wrote: > Will do. Is there some sort of standard/preferred location? Not really, de... [19:22:57] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9317522, @Dzahn wrote: >>>! In T344164#9317520, @Urbanecm wrote: >> Will do. Is there some sort of s... [19:24:06] (03CR) 10Volans: [C: 04-1] "Gentle reminder for Serviceops to provide feedback on this one" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [19:24:18] (03CR) 10Volans: "Gentle reminder to provide feedback on this one" [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [19:24:33] (03CR) 10Volans: "Gentle reminder for Serviceops to provide feedback on this one" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [19:25:47] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad elastic cluster restart - bking@cumin2002 - T350703 [19:26:20] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10EBernhardson) This should now be resolved, existing partitions are owned by `analytics-privatedata-users` and new datasets going forward should also... [19:27:37] (03PS1) 10Milimetric: Update mediawiki_history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/972880 [19:32:57] 10SRE, 10Observability-Logging: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10herron) [19:37:09] (03PS1) 10Dzahn: stewards: add git::clone of stewards/onboarding-system from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) [19:37:35] (03CR) 10CI reject: [V: 04-1] stewards: add git::clone of stewards/onboarding-system from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:39:08] (03CR) 10Dzahn: "what does jenkins dislike..." [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:40:12] mutante: it says syntax error [19:40:31] Could not parse for environment *root*: Syntax error at 'ensure' (file: /srv/workspace/puppet/modules/profile/manifests/stewards.pp, line: 12, column: 9) [19:40:33] (03CR) 10Eevans: [C: 03+2] aqs: add .../aqs/deploy/src/ to Environment [puppet] - 10https://gerrit.wikimedia.org/r/972461 (https://phabricator.wikimedia.org/T349228) (owner: 10Eevans) [19:41:06] !log Manually pinned wikitech to 1.42.0-wmf.3 via local hacks on cloudweb100{3,4} [19:41:50] RhinosF1: ah, missing a : :) [19:42:35] (03PS2) 10Dzahn: stewards: add git::clone of stewards/onboarding-system from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) [19:42:39] (03PS1) 10Ssingh: hiera: update authdns_servers (PCC test commit, DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/972883 (https://phabricator.wikimedia.org/T347054) [19:42:50] (03CR) 10Ssingh: [C: 04-2] hiera: update authdns_servers (PCC test commit, DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/972883 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:43:23] (03PS2) 10Ssingh: hiera: update authdns_servers (PCC test commit, DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/972883 (https://phabricator.wikimedia.org/T347054) [19:44:19] The wikitech rollback was to find out if 1.42.0-wmf.4 is what broken OAuth owner-only auth on wikitech. And that seems to have been confirmed. [19:44:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/972883 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:45:11] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) >>! In T344164#9317524, @Urbanecm wrote: > Sorry for the confusion, I meant for the "where you want them to be written... [19:45:14] mutante: looks like that fixed it [19:45:58] RhinosF1: where did the compiler output go ? heh [19:46:52] (03Abandoned) 10Ssingh: hiera: update authdns_servers (PCC test commit, DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/972883 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:47:49] mutante: PCC still works [19:48:27] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972884 [19:49:15] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host acmechief-test1001.eqiad.wmnet with OS bookworm [19:49:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host acmechief-test1001.eqiad.wmnet with OS bookworm [19:51:56] RhinosF1: as in "you HAVE to run it locally now" ? [19:52:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:52:54] no more output on https://puppet-compiler.wmflabs.org/ ? [19:53:12] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:53:59] mutante: no still in wmcloud [19:54:35] (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [19:54:49] mutante: did you run it [19:55:04] RhinosF1: yes, it tells me that it ran succesfully but there is no more link to the output? [19:56:00] mutante: should be [19:56:01] (03PS1) 10BCornwall: acme_chief: Set acmechief-test1001 as active host [puppet] - 10https://gerrit.wikimedia.org/r/972886 (https://phabricator.wikimedia.org/T342154) [19:56:04] I ran it again [19:56:05] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972884 (owner: 10Ebernhardson) [19:57:12] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Applying JVM security upgrade - eevans@cumin1001 [19:58:50] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/972884 (owner: 10Ebernhardson) [20:01:26] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:01:32] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/972882/356/stewards1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [20:01:38] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:02:12] RhinosF1: thanks, i see it now :) [20:02:15] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief-test1001.eqiad.wmnet with reason: host reimage [20:02:38] (03CR) 10Dzahn: [V: 03+1 C: 03+2] stewards: add git::clone of stewards/onboarding-system from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/972882 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [20:03:23] I think what confused me was that there is now puppet 5 and puppet 7.. ack, saw that mail [20:05:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief-test1001.eqiad.wmnet with reason: host reimage [20:06:00] puppet git clones, just getting the usual "dubious ownership" warning. I kind of want to fix that instead of adding the exception for the dir. [20:06:09] happened before [20:07:54] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, and 2 others: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) >>! In T344164#9314186, @Urbanecm wrote: > ** A managed clone of a Git repository (https://gitlab.wikimedia.org/repos/... [20:08:03] mutante: now that i think about it...i guess a system account would be helpful, to run the code under? especially a regular check to verify everything's in order. [20:08:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Applying JVM security upgrade - eevans@cumin1001 [20:09:45] urbanecm: not sure yet, could be solved through group ownership. all shell users are in "wikidev" group [20:10:03] i see [20:10:15] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*.eqiad.wmnet: Applying JVM security upgrade - eevans@cumin1001 [20:10:32] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) p:05Triage→03High [20:12:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Problems encountered while checking and current situation (just a note for bookeeping): * Virtualization is enabled on cp1102 (`egrep -q "vmx|svm" /proc/cpuinf... [20:13:30] (03CR) 10Ottomata: [C: 03+2] Update mediawiki_history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/972880 (owner: 10Milimetric) [20:19:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*.eqiad.wmnet: Applying JVM security upgrade - eevans@cumin1001 [20:21:26] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [20:21:35] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:23:41] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*.codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [20:25:21] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:26:09] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:28:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief-test1001.eqiad.wmnet with OS bookworm [20:28:49] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host acmechief-test1001.eqiad.wmnet with OS bookworm completed: - acmechief-test1001 (**WARN**) - Do... [20:30:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*.codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [20:33:34] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief-test1001.eqiad.wmnet [20:33:35] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief-test1001.eqiad.wmnet [20:37:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:39:01] 10SRE, 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [20:39:59] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[19-21,28,31].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [20:45:30] (03PS7) 10BPirkle: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) [20:49:32] 10SRE, 10Traffic, 10GitLab (Project Migration): Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) Yes, indeed! Thanks for pointing that out. [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T2100). [21:00:06] bpirkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:23] I'm here [21:00:31] I can deploy [21:03:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [21:04:14] (03Merged) 10jenkins-bot: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [21:04:41] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:968384|Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (T348731)]] [21:06:06] !log kindrobot@deploy2002 bpirkle and kindrobot: Backport for [[gerrit:968384|Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (T348731)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:20] bpirkle: can you confirm the changes? [21:06:43] Confirmed, looks as expected [21:06:55] Very good, syncing. [21:07:12] !log kindrobot@deploy2002 bpirkle and kindrobot: Continuing with sync [21:09:24] (03PS4) 10RLazarus: k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 (https://phabricator.wikimedia.org/T348284) [21:10:05] (03CR) 10RLazarus: [C: 03+2] k8s-controller-sidecars: Initial release (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [21:10:19] (03CR) 10RLazarus: [V: 03+2 C: 03+2] k8s-controller-sidecars: Initial release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/972535 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [21:12:59] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:968384|Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (T348731)]] (duration: 08m 17s) [21:13:12] !log otto@deploy2002 Started deploy [analytics/refinery@25ef91f] (hadoop-test): deploying refinery with refinery-source 0.2.25 jars for T321854 - hadoop-test [analytics/refinery@25ef91f2] [21:13:12] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[19-21,28,31].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [21:13:24] !log finish UTC late backport window [21:13:31] Thank you! [21:13:54] No problem :) [21:15:13] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[22-24,29,32].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [21:15:47] (03PS2) 10Andrew Bogott: openstack::glance::service: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881887 (owner: 10Muehlenhoff) [21:16:13] (03CR) 10CI reject: [V: 04-1] openstack::glance::service: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881887 (owner: 10Muehlenhoff) [21:16:21] !log otto@deploy2002 Finished deploy [analytics/refinery@25ef91f] (hadoop-test): deploying refinery with refinery-source 0.2.25 jars for T321854 - hadoop-test [analytics/refinery@25ef91f2] (duration: 03m 10s) [21:20:54] (03PS3) 10Andrew Bogott: openstack::glance::service: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881887 (owner: 10Muehlenhoff) [21:24:01] (03PS1) 10Ottomata: test/refine - update refinery jar version for analytics test cluster refine job [puppet] - 10https://gerrit.wikimedia.org/r/972894 (https://phabricator.wikimedia.org/T321854) [21:43:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack::glance::service: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881887 (owner: 10Muehlenhoff) [21:48:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[22-24,29,32].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [21:50:43] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[25-27,30,33].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [21:51:22] (03PS1) 10Dzahn: stewards: use git:clone parameters to shared repo among several users [puppet] - 10https://gerrit.wikimedia.org/r/972896 (https://phabricator.wikimedia.org/T344164) [21:54:24] (03CR) 10Dzahn: [C: 03+2] stewards: use git:clone parameters to shared repo among several users [puppet] - 10https://gerrit.wikimedia.org/r/972896 (https://phabricator.wikimedia.org/T344164) (owner: 10Dzahn) [21:54:30] (03PS2) 10Dzahn: stewards: use git:clone parameters to shared repo among several users [puppet] - 10https://gerrit.wikimedia.org/r/972896 (https://phabricator.wikimedia.org/T344164) [21:54:49] (03PS3) 10Dzahn: stewards: use git:clone parameters to share repo among several users [puppet] - 10https://gerrit.wikimedia.org/r/972896 (https://phabricator.wikimedia.org/T344164) [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231108T2200) [22:02:51] (03CR) 10Btullis: [C: 03+1] "Looks great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/972851 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [22:04:05] 10SRE, 10SRE-Access-Requests: Requesting access to stewards-users for urbanecm - https://phabricator.wikimedia.org/T350834 (10Urbanecm) [22:06:54] (03PS1) 10Milimetric: Update to latest snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/972897 [22:07:09] (03CR) 10Milimetric: [C: 03+2] Update to latest snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/972897 (owner: 10Milimetric) [22:08:17] (03Merged) 10jenkins-bot: Update to latest snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/972897 (owner: 10Milimetric) [22:08:27] (03CR) 10Btullis: [C: 03+1] Update to latest snapshot [deployment-charts] - 10https://gerrit.wikimedia.org/r/972897 (owner: 10Milimetric) [22:08:56] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703 [22:09:10] T350703: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 [22:12:26] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [22:12:41] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [22:12:55] !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [22:13:05] !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [22:13:08] !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [22:13:10] !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [22:13:16] !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [22:13:23] !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [22:13:33] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [22:14:04] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [22:17:39] PROBLEM - thanos.wikimedia.org tls expiry on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:17:41] (03PS1) 10Milimetric: Copy version from production service [deployment-charts] - 10https://gerrit.wikimedia.org/r/972904 [22:17:50] (03CR) 10Milimetric: [C: 03+2] Copy version from production service [deployment-charts] - 10https://gerrit.wikimedia.org/r/972904 (owner: 10Milimetric) [22:18:07] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:18:23] PROBLEM - thanos.wikimedia.org requires authentication on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:18:40] (03Merged) 10jenkins-bot: Copy version from production service [deployment-charts] - 10https://gerrit.wikimedia.org/r/972904 (owner: 10Milimetric) [22:19:23] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:19:39] RECOVERY - thanos.wikimedia.org requires authentication on titan1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:19:47] (03PS1) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972906 [22:20:19] RECOVERY - thanos.wikimedia.org tls expiry on titan1002 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 20 Nov 2023 08:56:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:23:28] !log milimetric@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [22:23:40] !log milimetric@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [22:23:46] !log milimetric@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [22:24:14] !log milimetric@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [22:24:22] !log milimetric@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [22:24:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[25-27,30,33].eqiad.wmnet: Applying JVM security upgrade (row A) - eevans@cumin1001 [22:24:41] !log milimetric@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [22:25:20] (03CR) 10Milimetric: [C: 03+2] "I see now that version was only deployed to staging, so it makes me think I should revert. For reference later, the version deployed to p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972904 (owner: 10Milimetric) [22:27:02] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) With the last puppet change above the git repo is now shared between users and there are no more warnings about the... [22:27:37] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) 05Open→03Resolved [22:28:32] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Dzahn) [22:39:10] 10SRE, 10SRE-Access-Requests: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) [22:40:49] 10SRE, 10SRE-Access-Requests: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) In addition to adding urbanecm to the "stewards-users" group this is also a request to add him as the group approver for future group additions. I approve... [22:44:19] (03PS1) 10Dzahn: admin: add Martin Urbanec as group approver for stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) [22:44:40] (03CR) 10CI reject: [V: 04-1] admin: add Martin Urbanec as group approver for stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [22:45:54] (03PS2) 10Dzahn: admin: add Martin Urbanec as group approver for stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) [22:51:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) requestor has existing shell access so no need to worry about L3, NDA, keys.. just a group addition. [22:57:20] (03PS1) 10Dzahn: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) [22:58:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:59] (03PS1) 10Zoranzoki21: DNM Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972912 [22:59:51] (03Abandoned) 10Zoranzoki21: DNM Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972912 (owner: 10Zoranzoki21) [23:00:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Dzahn) @DMburugu Namely tells me you are Martin's manager. Would you approve this access please? [23:28:18] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (java 11 sec updates) - ryankemper@cumin1001 - T350703 [23:28:22] T350703: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703 [23:51:18] (03PS1) 10Andrew Bogott: wmcs-backup: also remove expired backups via delete-expired [puppet] - 10https://gerrit.wikimedia.org/r/972915 [23:53:13] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:59:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.549560901581813s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded