Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1048 items:

2023-03-27 02:06:45 <jinxer-wm> (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-03-27 02:09:28 <wikibugs> 'ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (''Andrew) @Jclark-ctr it'll be another week or two before we have workloads moved off of this.'
2023-03-27 02:26:45 <jinxer-wm> (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-03-27 02:29:39 <jinxer-wm> (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2023-03-27 05:10:09 <wikibugs> ('PS2) ''KartikMistry: Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379)'
2023-03-27 05:13:01 <wikibugs> ('PS3) ''Marostegui: mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)'
2023-03-27 05:14:21 <logmsgbot> !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
2023-03-27 05:14:27 <stashbot> T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
2023-03-27 05:14:37 <logmsgbot> !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
2023-03-27 05:16:56 <kart_> Updating cxserver, minor changes.
2023-03-27 05:18:10 <wikibugs> ('CR) ''KartikMistry: [C: ''+2] Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: ''KartikMistry)'
2023-03-27 05:19:04 <wikibugs> ('PS1) ''Marostegui: db1179: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292)'
2023-03-27 05:19:35 <wikibugs> ('PS15) ''KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - ''https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)'
2023-03-27 05:19:41 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T332292', diff saved to https://phabricator.wikimedia.org/P45942 and previous config saved to /var/cache/conftool/dbconfig/20230327-051941-root.json
2023-03-27 05:19:46 <stashbot> T332292: Move db1179 to x1 - https://phabricator.wikimedia.org/T332292
2023-03-27 05:19:53 <wikibugs> ('CR) ''Marostegui: [C: ''+2] db1179: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292) (owner: ''Marostegui)'
2023-03-27 05:22:56 <wikibugs> ('Merged) ''jenkins-bot: Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: ''KartikMistry)'
2023-03-27 05:23:42 <wikibugs> ('PS1) ''Marostegui: mariadb: Move db1179 to x1 [puppet] - ''https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292)'
2023-03-27 05:23:47 <logmsgbot> !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
2023-03-27 05:24:14 <wikibugs> ('CR) ''Marostegui: [C: ''+2] mariadb: Move db1179 to x1 [puppet] - ''https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292) (owner: ''Marostegui)'
2023-03-27 05:24:27 <logmsgbot> !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
2023-03-27 05:28:00 <logmsgbot> !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
2023-03-27 05:28:52 <logmsgbot> !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
2023-03-27 05:37:57 <logmsgbot> !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
2023-03-27 05:38:42 <logmsgbot> !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
2023-03-27 05:40:49 <kart_> !log Updated cxserver to 2023-03-17-133444-production (T332379 + build changes)
2023-03-27 05:40:53 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 05:40:54 <stashbot> T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379
2023-03-27 05:57:34 <wikibugs> ('PS1) ''KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834)'
2023-03-27 06:19:47 <wikibugs> ('CR) ''Krinkle: Fix PHP string interpolation (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: ''Reedy)'
2023-03-27 06:29:39 <jinxer-wm> (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2023-03-27 06:36:42 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45944 and previous config saved to /var/cache/conftool/dbconfig/20230327-063642-root.json
2023-03-27 06:40:20 <marostegui> !log Rename flaggedrevs tables on db1123 ptwikisource T332594
2023-03-27 06:40:24 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 06:40:25 <stashbot> T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
2023-03-27 06:51:47 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45945 and previous config saved to /var/cache/conftool/dbconfig/20230327-065147-root.json
2023-03-27 06:51:53 <marostegui> !log dbmaint s3 eqiad Rename flaggedrevs tables on db1123 ptwikisource T332594
2023-03-27 06:51:57 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 06:51:58 <stashbot> T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
2023-03-27 06:54:22 <wikibugs> ('PS1) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
2023-03-27 07:00:05 <jouncebot> Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T0700).
2023-03-27 07:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2023-03-27 07:06:49 <icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 07:06:52 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45946 and previous config saved to /var/cache/conftool/dbconfig/20230327-070651-root.json
2023-03-27 07:07:57 <icinga-wm> PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 07:09:15 <wikibugs> ('PS1) ''Marostegui: backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)'
2023-03-27 07:12:08 <marostegui> jynus: also that one ^ :)
2023-03-27 07:12:19 <jynus> oh
2023-03-27 07:12:29 <jynus> I forgot
2023-03-27 07:12:46 <jynus> needs 2 changes actually
2023-03-27 07:12:59 <marostegui> ah yes
2023-03-27 07:13:00 <marostegui> I see it
2023-03-27 07:13:02 <marostegui> let me fix it
2023-03-27 07:13:27 <wikibugs> ('PS2) ''Marostegui: backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)'
2023-03-27 07:13:29 <marostegui> jynus: ^
2023-03-27 07:13:41 <wikibugs> ('PS4) ''Marostegui: mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)'
2023-03-27 07:13:47 <wikibugs> ('CR) ''Jcrespo: [C: ''+1] backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
2023-03-27 07:14:06 <jynus> one sec because I was looking and there are backups still running
2023-03-27 07:14:27 <marostegui> sure no problem
2023-03-27 07:21:57 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45947 and previous config saved to /var/cache/conftool/dbconfig/20230327-072156-root.json
2023-03-27 07:30:28 <wikibugs> ('CR) ''Jcrespo: [C: ''+1] mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
2023-03-27 07:32:29 <wikibugs> ('CR) ''Marostegui: [C: ''+2] mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
2023-03-27 07:33:11 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Joe)'
2023-03-27 07:34:11 <urbanecm> goes to do some MW deployment, since B&C is empty
2023-03-27 07:34:16 <wikibugs> ('CR) ''Urbanecm: [C: ''+2] SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: ''Urbanecm)'
2023-03-27 07:34:32 <wikibugs> ('CR) ''Urbanecm: [C: ''+2] GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: ''Urbanecm)'
2023-03-27 07:36:40 <wikibugs> ('Merged) ''jenkins-bot: SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: ''Urbanecm)'
2023-03-27 07:37:01 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45948 and previous config saved to /var/cache/conftool/dbconfig/20230327-073701-root.json
2023-03-27 07:38:50 <logmsgbot> !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]]
2023-03-27 07:38:58 <stashbot> T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
2023-03-27 07:39:50 <jynus> !log disabling puppet and shutding down bacula at backup1001 T331510
2023-03-27 07:39:55 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 07:39:55 <stashbot> T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
2023-03-27 07:41:52 <jynus> a prometheus availability job will alert because of the above log, as the job only monitors that 1 host
2023-03-27 07:44:25 <wikibugs> ('PS1) ''Jcrespo: bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896)'
2023-03-27 07:46:45 <jinxer-wm> (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-03-27 07:48:21 <logmsgbot> !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
2023-03-27 07:48:26 <stashbot> T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
2023-03-27 07:48:39 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''fundraising-tech-ops, ''netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (''ayounsi) ''Open''Resolved a:''ayounsi Done!'
2023-03-27 07:51:39 <wikibugs> ('CR) ''Jcrespo: [C: ''+2] bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: ''Jcrespo)'
2023-03-27 07:52:06 <logmsgbot> !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45949 and previous config saved to /var/cache/conftool/dbconfig/20230327-075206-root.json
2023-03-27 07:52:55 <wikibugs> ('Merged) ''jenkins-bot: GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: ''Urbanecm)'
2023-03-27 07:55:13 <icinga-wm> RECOVERY - PHP7 rendering on parse2017 is OK: HTTP OK: HTTP/1.1 302 Found - 519 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
2023-03-27 07:55:36 <logmsgbot> !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] (duration: 16m 45s)
2023-03-27 07:55:41 <stashbot> T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
2023-03-27 07:58:36 <logmsgbot> !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]]
2023-03-27 07:58:41 <stashbot> T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
2023-03-27 07:59:58 <logmsgbot> !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
2023-03-27 08:00:57 <icinga-wm> RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 08:01:12 <wikibugs> ('CR) ''Tacsipacsi: [huwiki] Add Draft and Draft_talk namespaces (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
2023-03-27 08:02:04 <wikibugs> ('PS1) ''Ladsgroup: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941)'
2023-03-27 08:02:27 <wikibugs> ('CR) ''Ladsgroup: [C: ''+2] EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: ''Ladsgroup)'
2023-03-27 08:02:53 <wikibugs> ('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40331/console"; [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
2023-03-27 08:03:43 <marostegui> !log Failover m1 from db1164 to db1101 - T331510
2023-03-27 08:03:48 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 08:03:49 <stashbot> T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
2023-03-27 08:03:54 <urbanecm> Amir1: fyi my scap backport's just about to finish
2023-03-27 08:04:14 <Amir1> mine takes twenty minutes to merge, don't worry
2023-03-27 08:04:14 <marostegui> all done jynus
2023-03-27 08:04:21 <urbanecm> ok
2023-03-27 08:04:28 <jynus> ok to merge the backup patches?
2023-03-27 08:04:47 <wikibugs> ('CR) ''Marostegui: [C: ''+2] backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
2023-03-27 08:05:02 <marostegui> Etherpad looks fie
2023-03-27 08:05:03 <marostegui> fine
2023-03-27 08:05:35 <jynus> it is a bit slow for me
2023-03-27 08:05:53 <marostegui> I guess it's warming up
2023-03-27 08:06:04 <marostegui> I can open the test pad fine
2023-03-27 08:06:29 <logmsgbot> !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] (duration: 07m 52s)
2023-03-27 08:06:34 <stashbot> T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
2023-03-27 08:06:51 <jynus> it is ok for me now
2023-03-27 08:06:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 08:07:11 <jynus> what else to test?
2023-03-27 08:07:30 <marostegui> jynus: librenms, which also works fine for me
2023-03-27 08:07:31 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Marostegui)'
2023-03-27 08:07:48 <jynus> orch is complaining about lag, I guess not real?
2023-03-27 08:07:53 <marostegui> reload :)
2023-03-27 08:08:18 <jynus> still happening
2023-03-27 08:08:29 <urbanecm> done
2023-03-27 08:08:31 <marostegui> ah I know why
2023-03-27 08:09:14 <jynus> cleanup of the table maybe?
2023-03-27 08:09:32 <wikibugs> ('PS1) ''Marostegui: db1101: Make it master [puppet] - ''https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510)'
2023-03-27 08:09:33 <marostegui> jynus: nope, this ^
2023-03-27 08:09:37 <jynus> I see
2023-03-27 08:09:44 <wikibugs> ('CR) ''Marostegui: [V: ''+2 C: ''+2] db1101: Make it master [puppet] - ''https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
2023-03-27 08:10:40 <marostegui> jynus: fixed!
2023-03-27 08:11:25 <jynus> looking at the original path to see why I didn't see that
2023-03-27 08:11:29 <jynus> *patch
2023-03-27 08:11:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 08:12:50 <jynus> let me run puppet on backup hosts
2023-03-27 08:12:52 <wikibugs> ('CR) ''Jelto: [V: ''+1 C: ''+1] "lgtm, left one little question in-line" [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
2023-03-27 08:12:53 <jynus> to apply the change
2023-03-27 08:16:40 <wikibugs> 'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''Marostegui)'
2023-03-27 08:17:27 <wikibugs> ('CR) ''Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: ''Ladsgroup)'
2023-03-27 08:17:29 <wikibugs> ('PS1) ''Marostegui: Revert "backups: Replace db1164 with db1101" [puppet] - ''https://gerrit.wikimedia.org/r/903188'
2023-03-27 08:17:32 <urbanecm> rollouts one more change
2023-03-27 08:17:35 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: ''Urbanecm)'
2023-03-27 08:17:39 <wikibugs> ('CR) ''Marostegui: [C: ''-2] "Wait for the failover date" [puppet] - ''https://gerrit.wikimedia.org/r/903188 (owner: ''Marostegui)'
2023-03-27 08:17:59 <wikibugs> ('Merged) ''jenkins-bot: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: ''Ladsgroup)'
2023-03-27 08:18:18 <wikibugs> ('Merged) ''jenkins-bot: [Growth] eswiki: Enable mentorship for 50% of newcomers [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: ''Urbanecm)'
2023-03-27 08:18:25 <wikibugs> ('PS2) ''Filippo Giunchedi: prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:18:33 <logmsgbot> !log urbanecm@deploy2002 Backport cancelled.
2023-03-27 08:19:45 <wikibugs> ('PS1) ''Marostegui: mariadb: Promote db1164 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123)'
2023-03-27 08:19:58 <wikibugs> ('CR) ''Marostegui: [C: ''-2] "Wait for the failover date" [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: ''Marostegui)'
2023-03-27 08:20:54 <wikibugs> ('CR) ''Jcrespo: [C: ''+1] Revert "backups: Replace db1164 with db1101" [puppet] - ''https://gerrit.wikimedia.org/r/903188 (owner: ''Marostegui)'
2023-03-27 08:20:59 <logmsgbot> !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]]
2023-03-27 08:21:01 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Marostegui)'
2023-03-27 08:21:05 <stashbot> T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
2023-03-27 08:23:23 <wikibugs> ('PS1) ''Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - ''https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:24:59 <wikibugs> ('PS1) ''Filippo Giunchedi: graphite: check graphite2004 [puppet] - ''https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:25:23 <wikibugs> ('CR) ''Marostegui: [C: ''-2] mariadb: Promote db1164 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: ''Marostegui)'
2023-03-27 08:25:47 <logmsgbot> !log urbanecm@deploy2002 Synchronized wmf-config/InitialiseSettings.php: 63dd23b5ceaba35c8d9682493dd21d99a20fc8f7: [Growth] eswiki: Enable mentorship for 50% of newcomers (T332737, T285235) (duration: 06m 09s)
2023-03-27 08:25:54 <stashbot> T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235
2023-03-27 08:25:54 <stashbot> T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737
2023-03-27 08:26:40 <wikibugs> ('PS1) ''Filippo Giunchedi: statsd: move writes to graphite2004 [puppet] - ''https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:26:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 08:28:10 <wikibugs> ('PS1) ''Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - ''https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:28:14 <jynus> !log restarting bacula at backup1001 T331510
2023-03-27 08:28:19 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 08:28:20 <stashbot> T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
2023-03-27 08:30:09 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''Ladsgroup)'
2023-03-27 08:30:29 <logmsgbot> !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
2023-03-27 08:30:34 <stashbot> T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
2023-03-27 08:31:25 <wikibugs> ('PS1) ''Filippo Giunchedi: Failover statsd to graphite2004 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 08:31:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 08:32:29 <wikibugs> ('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40332/console"; (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
2023-03-27 08:32:31 <wikibugs> ('CR) ''JMeybohm: [C: ''+1] "SGTM" [puppet] - ''https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: ''Elukey)'
2023-03-27 08:34:48 <icinga-wm> RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
2023-03-27 08:35:43 <wikibugs> ('CR) ''Elukey: [C: ''+1] k8s: Force to be explicit about k8s and calico versions [puppet] - ''https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 08:36:04 <wikibugs> ('CR) ''Elukey: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - ''https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 08:36:11 <wikibugs> ('PS2) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
2023-03-27 08:36:44 <wikibugs> ('CR) ''Jelto: [V: ''+1 C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
2023-03-27 08:36:45 <jinxer-wm> (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-03-27 08:38:42 <wikibugs> ('CR) ''CI reject: [V: ''-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
2023-03-27 08:39:15 <logmsgbot> !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] (duration: 18m 15s)
2023-03-27 08:39:22 <stashbot> T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
2023-03-27 08:40:24 <wikibugs> ('CR) ''Elukey: [C: ''+1] k8s: Remove 1.16 related code (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 08:40:53 <wikibugs> ('CR) ''Elukey: [C: ''+2] role::kafka::jumbo::broker: enable PKI migration settings [puppet] - ''https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: ''Elukey)'
2023-03-27 08:43:54 <wikibugs> 'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''fgiunchedi)'
2023-03-27 08:45:18 <wikibugs> ('PS2) ''Hashar: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - ''https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068)'
2023-03-27 08:46:46 <wikibugs> ('CR) ''Clément Goubert: [C: ''+1] prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: ''Filippo Giunchedi)'
2023-03-27 08:47:02 <logmsgbot> !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
2023-03-27 08:50:14 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+2] prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: ''Filippo Giunchedi)'
2023-03-27 08:51:17 <jinxer-wm> (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 08:52:14 <godog> hah that was me, false alarm
2023-03-27 08:52:37 <godog> prometheus1005 was also depooled, I've repooled it now
2023-03-27 08:53:39 <wikibugs> ('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40333/console"; (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
2023-03-27 08:55:02 <logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 08:56:17 <jinxer-wm> (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 08:57:02 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''fgiunchedi)'
2023-03-27 08:57:19 <logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
2023-03-27 08:58:24 <logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
2023-03-27 08:58:24 <logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 09:00:45 <wikibugs> ('PS1) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120)'
2023-03-27 09:02:18 <wikibugs> ('CR) ''Jelto: [V: ''+1] "looks mostly good, one question in-line" [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
2023-03-27 09:02:52 <icinga-wm> RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 09:03:06 <icinga-wm> RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 09:03:10 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+1] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:03:33 <wikibugs> ('CR) ''Clément Goubert: [C: ''+2] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:04:08 <wikibugs> 'SRE, ''Commons, ''Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (''Aklapper)'
2023-03-27 09:06:20 <wikibugs> ('PS1) ''Clément Goubert: Revert "mw-api-int: Add records" [dns] - ''https://gerrit.wikimedia.org/r/903190'
2023-03-27 09:08:25 <wikibugs> ('CR) ''Clément Goubert: [C: ''+2] Revert "mw-api-int: Add records" [dns] - ''https://gerrit.wikimedia.org/r/903190 (owner: ''Clément Goubert)'
2023-03-27 09:12:17 <jinxer-wm> (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 09:12:59 <wikibugs> ('PS1) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)'
2023-03-27 09:13:20 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''ayounsi) I pondered multiple options for the Netbox `server_bgp` custom field, feedback from ServiceOps welcome ba...'
2023-03-27 09:15:23 <wikibugs> ('PS2) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)'
2023-03-27 09:17:17 <jinxer-wm> (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 09:17:18 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+1] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:17:20 <wikibugs> ('CR) ''Thiemo Kreuz (WMDE): [C: ''+1] mediawiki: Reduce the frequency of flaggedrevs updates (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: ''Ladsgroup)'
2023-03-27 09:18:04 <wikibugs> ('CR) ''Clément Goubert: [C: ''+2] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:24:56 <wikibugs> ('PS3) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
2023-03-27 09:25:43 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1] "This change is ready for review." [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:27:00 <wikibugs> ('CR) ''CI reject: [V: ''-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
2023-03-27 09:33:54 <wikibugs> ('PS4) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
2023-03-27 09:39:55 <logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 09:40:10 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+1] "LGTM, optional nits inline." [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:40:59 <wikibugs> ('PS1) ''Jbond: Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)'
2023-03-27 09:41:05 <logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 09:43:37 <wikibugs> ('PS2) ''Jbond: Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)'
2023-03-27 09:44:49 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45295
2023-03-27 09:45:25 <wikibugs> ('CR) ''Jbond: [C: ''+2] Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) (owner: ''Jbond)'
2023-03-27 09:45:41 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45295
2023-03-27 09:46:49 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1] service_catalog: Add mw-api-int k8s service (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:47:07 <wikibugs> ('PS2) ''Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)'
2023-03-27 09:47:09 <wikibugs> ('CR) ''Effie Mouzeli: [C: ''+1] P:kubernetes::node: Use performance governor [puppet] - ''https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: ''Clément Goubert)'
2023-03-27 09:47:13 <logmsgbot> !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
2023-03-27 09:47:26 <logmsgbot> !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
2023-03-27 09:50:02 <wikibugs> ('PS3) ''Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)'
2023-03-27 09:50:53 <wikibugs> ('CR) ''Clément Goubert: service_catalog: Add mw-api-int k8s service (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:51:57 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40336/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
2023-03-27 09:54:32 <wikibugs> ('PS7) ''Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
2023-03-27 09:54:34 <wikibugs> ('PS3) ''Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - ''https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
2023-03-27 09:57:46 <wikibugs> ('PS2) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
2023-03-27 09:58:10 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 09:59:29 <wikibugs> ('CR) ''LSobanski: [C: ''-1] "The change has not been confirmed yet so let's not jump the gun on this." [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 09:59:32 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V: ''+1 C: ''+2] mediawiki::errorpage: rationalize usage [puppet] - ''https://gerrit.wikimedia.org/r/902446 (owner: ''Giuseppe Lavagetto)'
2023-03-27 10:00:04 <jouncebot> Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
2023-03-27 10:02:10 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''hnowlan)'
2023-03-27 10:02:30 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''ArielGlenn)'
2023-03-27 10:02:49 <wikibugs> ('CR) ''Jelto: monitoring/alerting: globally replace serviceops-collab with sre-collab (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:03:13 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''hnowlan)'
2023-03-27 10:03:49 <wikibugs> ('PS3) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
2023-03-27 10:03:51 <Emperor> !log depool ms-fe2009
2023-03-27 10:03:54 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 10:04:17 <wikibugs> ('CR) ''Jbond: [C: ''+2] "thanks" [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
2023-03-27 10:05:30 <wikibugs> ('Merged) ''jenkins-bot: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
2023-03-27 10:05:33 <wikibugs> ('CR) ''Jbond: [C: ''+2] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert (''1 comment) [alerts] - ''https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
2023-03-27 10:06:12 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (''Clement_Goubert)'
2023-03-27 10:06:24 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 4 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Clement_Goubert) ''Open''In progress p:''Triage''Medium a:''Clement_Goubert'
2023-03-27 10:06:44 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/resource: Add disk space [alerts] - ''https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 10:06:54 <wikibugs> ('CR) ''Jbond: [C: ''+2] team-sre/resource: Add disk space [alerts] - ''https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 10:07:10 <wikibugs> ('CR) ''Filippo Giunchedi: "Ben, does this look good to you? thanks!" [alerts] - ''https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: ''AOkoth)'
2023-03-27 10:08:32 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] "Untested but LGTM, thank you Daniel" [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:08:44 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] releases: remove Icinga monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
2023-03-27 10:09:15 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:09:43 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:10:09 <elukey> !log dist-upgrade kafka-main1003 manually to bullseye - T332013
2023-03-27 10:10:14 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 10:10:15 <stashbot> T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013
2023-03-27 10:13:21 <wikibugs> ('CR) ''Filippo Giunchedi: "LGTM modulo alert name" [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 10:15:18 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 10:15:34 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:17:04 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+1] releases: remove Icinga monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
2023-03-27 10:17:52 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:20:11 <wikibugs> ('PS1) ''Jbond: cinga: drop nfraison from ACL's [puppet] - ''https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135)'
2023-03-27 10:20:17 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V: ''+1 C: ''+2] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - ''https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
2023-03-27 10:21:16 <wikibugs> ('CR) ''JMeybohm: k8s: Force docker storage-driver to overlay2 (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 10:21:26 <wikibugs> ('CR) ''Jbond: [C: ''+2] cinga: drop nfraison from ACL's [puppet] - ''https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135) (owner: ''Jbond)'
2023-03-27 10:21:30 <icinga-wm> PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 10:22:16 <wikibugs> ('CR) ''JMeybohm: [C: ''+1] "PCC (expected to fail on alert) https://puppet-compiler.wmflabs.org/output/902318/40337/"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 10:22:17 <jinxer-wm> (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
2023-03-27 10:22:30 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 10:24:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 10:24:44 <wikibugs> 'SRE, ''Infrastructure Security, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 10:24:49 <jinxer-wm> (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
2023-03-27 10:24:57 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 10:25:31 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 10:27:08 <_joe_> jouncebot: next
2023-03-27 10:27:09 <jouncebot> In 2 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
2023-03-27 10:27:15 <_joe_> jouncebot: now
2023-03-27 10:27:15 <jouncebot> For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
2023-03-27 10:27:17 <jinxer-wm> (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
2023-03-27 10:27:28 <_joe_> elukey: this sounds promising ^^
2023-03-27 10:27:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 10:28:00 <logmsgbot> !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
2023-03-27 10:28:02 <elukey> yep all recovered :)
2023-03-27 10:28:39 <logmsgbot> !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
2023-03-27 10:28:49 <jinxer-wm> (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
2023-03-27 10:29:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 10:29:40 <jinxer-wm> (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2023-03-27 10:30:55 <logmsgbot> !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
2023-03-27 10:31:20 <logmsgbot> !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
2023-03-27 10:32:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 10:33:49 <jinxer-wm> (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
2023-03-27 10:34:39 <jinxer-wm> (NodeTextfileStale) resolved: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2023-03-27 10:34:49 <jinxer-wm> (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
2023-03-27 10:35:50 <wikibugs> ('PS4) ''EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245)'
2023-03-27 10:36:17 <jinxer-wm> (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 10:39:22 <elukey> this is due to the roll restart --^
2023-03-27 10:39:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 10:41:17 <jinxer-wm> (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 10:41:17 <wikibugs> 'SRE-tools, ''Infrastructure-Foundations, ''Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (''SLyngshede-WMF) We're missing a "dry_run" for services and puppet, but Puppet doesn't need is as the decorator also checks for _remote_hosts.'
2023-03-27 10:41:28 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+2] mesh.configuration: add support for custom error pages [deployment-charts] - ''https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
2023-03-27 10:42:45 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 10:43:18 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond) @Dzahn can you take care of password store'
2023-03-27 10:44:58 <jinxer-wm> (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 10:45:17 <wikibugs> ('PS4) ''Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - ''https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)'
2023-03-27 10:47:02 <wikibugs> ('Merged) ''jenkins-bot: mesh.configuration: add support for custom error pages [deployment-charts] - ''https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
2023-03-27 10:48:05 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''BTullis) I will take care of the HBase/Haddoop permissions and any leftover files.'
2023-03-27 10:48:19 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 10:52:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 10:54:14 <wikibugs> ('CR) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
2023-03-27 10:55:29 <wikibugs> ('PS6) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)'
2023-03-27 10:55:37 <wikibugs> ('PS7) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)'
2023-03-27 10:56:49 <icinga-wm> PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 10:57:03 <icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 10:59:26 <wikibugs> 'SRE-tools, ''Infrastructure-Foundations, ''Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (''SLyngshede-WMF) PuppetMaster Class needs dry_run, this can be done by letting the class inherit from RemoteHostsAdapter. Service class should have a...'
2023-03-27 11:01:10 <wikibugs> ('CR) ''Tacsipacsi: [C: ''+1] "LGTM, thanks!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
2023-03-27 11:02:17 <icinga-wm> RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:02:35 <icinga-wm> RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:03:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 11:04:00 <wikibugs> ('Abandoned) ''Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: ''Samtar)'
2023-03-27 11:06:59 <wikibugs> ('PS2) ''Jbond: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)'
2023-03-27 11:07:07 <wikibugs> ('CR) ''CI reject: [V: ''-1] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:07:09 <wikibugs> ('CR) ''Jbond: team-sre/systemd: add Check systemd state rule (''1 comment) [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:07:11 <wikibugs> ('PS5) ''Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - ''https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)'
2023-03-27 11:07:13 <wikibugs> ('PS3) ''Jbond: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)'
2023-03-27 11:07:28 <wikibugs> ('CR) ''Jbond: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:07:47 <wikibugs> ('CR) ''Jbond: [C: ''+2] team-sre/hardware: Add alert for sel events [alerts] - ''https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: ''Jbond)'
2023-03-27 11:08:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 11:09:54 <wikibugs> ('PS2) ''Jbond: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764)'
2023-03-27 11:10:20 <wikibugs> ('Merged) ''jenkins-bot: team-sre/hardware: Add alert for sel events [alerts] - ''https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: ''Jbond)'
2023-03-27 11:10:35 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''BTullis) I have deleted most of the leftover files and moved useful to my own home directory, but I don't have permission to update the description of this ticket...'
2023-03-27 11:11:01 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''cmooney) Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for th...'
2023-03-27 11:11:27 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 11:13:23 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''cmooney) On the Netbox side I'm happy with the current status, or having it as a dropdown. I think it's good to k...'
2023-03-27 11:13:44 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: ''Kamila Součková)'
2023-03-27 11:15:37 <wikibugs> ('CR) ''Jbond: [C: ''+1] "LGTM cheers" [software/spicerack] - ''https://gerrit.wikimedia.org/r/902460 (owner: ''Volans)'
2023-03-27 11:17:32 <wikibugs> ('PS1) ''Slyngshede: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537)'
2023-03-27 11:19:07 <icinga-wm> PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:19:25 <icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:20:43 <wikibugs> ('CR) ''Jbond: [V: ''+1 C: ''+2] Remove l10nupdate support [puppet] - ''https://gerrit.wikimedia.org/r/896318 (owner: ''Majavah)'
2023-03-27 11:20:58 <jbond> taavi: fyi merging ^^
2023-03-27 11:23:37 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [cookbooks] - ''https://gerrit.wikimedia.org/r/902449 (owner: ''Volans)'
2023-03-27 11:24:40 <wikibugs> 'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''Jelto)'
2023-03-27 11:24:45 <icinga-wm> RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:25:03 <icinga-wm> RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-03-27 11:25:37 <icinga-wm> PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.044e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
2023-03-27 11:27:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 11:34:13 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Jelto)'
2023-03-27 11:36:43 <wikibugs> ('CR) ''Volans: [C: ''+1] "LGTM, thanks for the addition" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 11:38:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 11:38:55 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:39:17 <jinxer-wm> (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 11:39:21 <volans> jbond: did you run the logout cookbook? it seems to affect some puppet runs ^^^
2023-03-27 11:43:29 <wikibugs> ('CR) ''Filippo Giunchedi: "LGTM, modulo Ben's vote" [puppet] - ''https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: ''Cathal Mooney)'
2023-03-27 11:44:17 <jinxer-wm> (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
2023-03-27 11:44:24 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:44:32 <wikibugs> ('CR) ''CI reject: [V: ''-1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:45:48 <godog> fixing ^
2023-03-27 11:46:13 <wikibugs> ('PS3) ''Filippo Giunchedi: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:46:50 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:47:55 <wikibugs> ('Merged) ''jenkins-bot: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 11:48:15 <wikibugs> 'SRE, ''serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (''Clement_Goubert) ''Open''Resolved'
2023-03-27 11:48:20 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (''Clement_Goubert)'
2023-03-27 11:55:50 <logmsgbot> !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
2023-03-27 11:57:05 <wikibugs> ('PS3) ''Clément Goubert: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - ''https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: ''Ahmon Dancy)'
2023-03-27 12:00:15 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''RobH) Not sure why procurement was added (so it showed up in my notifications) as this user isn't in the acl*procurement review, they are in the acl*sre-team so I...'
2023-03-27 12:00:22 <wikibugs> ('CR) ''Clément Goubert: [C: ''+2] Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - ''https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: ''Ahmon Dancy)'
2023-03-27 12:00:24 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''RobH) @jbond, this task isn't editable by most users (so i cannot remove the invalid project), please remove the procurement project.'
2023-03-27 12:01:06 <wikibugs> ('PS1) ''Slyngshede: Service: Ensure that dry_run is parsed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
2023-03-27 12:06:33 <wikibugs> 'SRE-OnFire, ''SRE-Sprint-Week-Sustainability-March2023, ''Gerrit, ''serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (''hashar) ''Open''Resolved a:''Clement_Goubert I have finally filled the follow up task: {T333143} Marking this on...'
2023-03-27 12:07:39 <wikibugs> ('CR) ''Jbond: [C: ''-1] "a few nits and i think an bug" [puppet] - ''https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: ''JHathaway)'
2023-03-27 12:08:53 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 12:09:04 <wikibugs> ('CR) ''Slyngshede: [C: ''+2] Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 12:09:11 <wikibugs> ('PS1) ''Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238'
2023-03-27 12:10:02 <wikibugs> ('PS2) ''Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939)'
2023-03-27 12:12:40 <wikibugs> ('Merged) ''jenkins-bot: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 12:13:01 <wikibugs> ('PS1) ''EoghanGaffney: Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)'
2023-03-27 12:13:18 <wikibugs> ('CR) ''CI reject: [V: ''-1] Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 12:13:24 <wikibugs> ('PS2) ''EoghanGaffney: Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)'
2023-03-27 12:13:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 12:15:54 <wikibugs> ('CR) ''JMeybohm: k8s: Remove 1.16 related code (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 12:15:58 <wikibugs> ('CR) ''Jbond: [C: ''+2] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 12:17:09 <wikibugs> ('Merged) ''jenkins-bot: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
2023-03-27 12:17:16 <wikibugs> ('PS1) ''Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192'
2023-03-27 12:17:25 <wikibugs> ('CR) ''CI reject: [V: ''-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
2023-03-27 12:17:28 <wikibugs> ('CR) ''Jbond: [V: ''+2 C: ''+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
2023-03-27 12:17:41 <wikibugs> ('CR) ''CI reject: [V: ''-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
2023-03-27 12:19:10 <wikibugs> ('PS2) ''Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192'
2023-03-27 12:19:52 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40338/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: ''Filippo Giunchedi)'
2023-03-27 12:19:54 <wikibugs> ('CR) ''Jbond: [C: ''+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
2023-03-27 12:20:48 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+1] hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: ''Filippo Giunchedi)'
2023-03-27 12:21:06 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+2] hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: ''Filippo Giunchedi)'
2023-03-27 12:21:44 <wikibugs> ('Merged) ''jenkins-bot: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
2023-03-27 12:22:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 12:23:06 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - ''https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 12:23:42 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] k8s: Force to be explicit about k8s and calico versions [puppet] - ''https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 12:27:03 <wikibugs> ('CR) ''Jbond: "lgtm but im not sure we need this in the service class, the alertmanager instance is already set correctly which from what i see is the on" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 12:32:36 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40339/console"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 12:36:03 <jinxer-wm> (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 12:40:40 <wikibugs> ('PS4) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
2023-03-27 12:41:03 <jinxer-wm> (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 12:42:34 <godog> !log flip alert* to overlay2 - T329939
2023-03-27 12:42:39 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 12:42:40 <stashbot> T329939: alert hosts short of root disk space / docker devicemapper vs overlayfs - https://phabricator.wikimedia.org/T329939
2023-03-27 12:46:36 <jinxer-wm> (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 12:47:01 <jinxer-wm> (SystemdUnitFailed) firing: (5) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 12:49:01 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 12:49:09 <wikibugs> ('CR) ''Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 12:50:17 <wikibugs> ('CR) ''JMeybohm: [C: ''+2] k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 12:50:44 <wikibugs> ('CR) ''JMeybohm: [C: ''+2] k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 12:51:31 <jinxer-wm> (SystemdUnitFailed) firing: (15) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 12:51:32 <jinxer-wm> (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 12:52:56 <wikibugs> ('CR) ''Hashar: [C: ''+1] "Awesome! Feel free to deploy at any time. If Apache2 needs to be restarted that can be done at anytime (the impact is minimal, it is simpl" [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 12:57:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 12:58:50 <wikibugs> ('PS1) ''Btullis: Upgrade the research airflow instance [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193)'
2023-03-27 12:59:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 13:00:05 <jouncebot> RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
2023-03-27 13:00:05 <jouncebot> Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2023-03-27 13:00:14 <taavi> o/ I can deploy
2023-03-27 13:00:20 <Superpes> Hi taavi :)
2023-03-27 13:00:37 <wikibugs> ('CR) ''Btullis: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40340/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
2023-03-27 13:02:17 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
2023-03-27 13:03:08 <wikibugs> ('Merged) ''jenkins-bot: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
2023-03-27 13:03:32 <logmsgbot> !log taavi@deploy2002 Started scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]]
2023-03-27 13:03:38 <stashbot> T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
2023-03-27 13:03:55 <wikibugs> ('CR) ''Hashar: "To clarify: +1 overall, the remarks I have made in the diff comment can be implemented or ruled out later ;)" [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 13:04:14 <wikibugs> ('PS2) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
2023-03-27 13:04:58 <logmsgbot> !log taavi@deploy2002 superpes and taavi: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
2023-03-27 13:05:02 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''WMF-Legal, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''oleksandr_tsyba_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public...'
2023-03-27 13:05:02 <taavi> Superpes: please test
2023-03-27 13:05:06 <Superpes> Looking
2023-03-27 13:05:19 <wikibugs> ('CR) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. (''3 comments) [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 13:05:32 <wikibugs> ('PS3) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
2023-03-27 13:06:07 <Superpes> Looks fine thanks :) taavi
2023-03-27 13:07:26 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek)'
2023-03-27 13:07:56 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek) Cleaned up the tags a bit, apologies @oleksandr_tsyba_WMDE. we have used a wrong template, again'
2023-03-27 13:08:07 <wikibugs> ('CR) ''Jbond: "thanks" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 13:08:40 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek) On that note, I endorse this request on WMDE's end.'
2023-03-27 13:11:31 <jinxer-wm> (SystemdUnitFailed) firing: (19) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 13:12:17 <logmsgbot> !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] (duration: 08m 45s)
2023-03-27 13:12:23 <stashbot> T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
2023-03-27 13:12:29 <taavi> done!
2023-03-27 13:13:44 <Superpes> Thanks taavi (maybe you have to run NamespaceDupes.php) :)
2023-03-27 13:13:51 <taavi> ohhhh right
2023-03-27 13:13:52 <taavi> a sec
2023-03-27 13:14:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 13:16:31 <jinxer-wm> (SystemdUnitFailed) firing: (64) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 13:17:29 <wikibugs> ('CR) ''Volans: [C: ''+1] "LGTM, thanks!" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 13:18:01 <taavi> hm I suspect the script might be broken, it's just printing the same few pagelinks rows over and over again
2023-03-27 13:18:05 <taavi> Amir1: ^ any clues why?
2023-03-27 13:18:18 <wikibugs> ('PS1) ''Elukey: Move kafka-jumbo1001's kafka broker to PKI certs [puppet] - ''https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064)'
2023-03-27 13:19:18 <wikibugs> ('PS1) ''Ssingh: hiera: temporarily removed dns1003 from authdns_servers [puppet] - ''https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 13:19:44 <wikibugs> ('CR) ''Elukey: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40341/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: ''Elukey)'
2023-03-27 13:20:23 <taavi> looking at wmf.1 changelog I don't see anything helpful
2023-03-27 13:24:17 <Amir1> sorry I was having lunch, let me check
2023-03-27 13:24:22 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''cmooney) Hoping to kick-start some more discussion around this and try to close this out. I still firmly believe tha...'
2023-03-27 13:24:50 <taavi> basically namespaceDupes seems to not update the WHERE condition when printing the list of pagelinks rows it would need to update
2023-03-27 13:24:54 <Amir1> yeah, NameSpacesDupes is broken
2023-03-27 13:25:03 <zabe> I got the same issue a week or so age (sorry, forgot to create a task), but it didn't show up when running with --fix
2023-03-27 13:25:06 <taavi> I don't know if it's pagelinks specific or a wider issue
2023-03-27 13:27:11 <TheresNoTime> oh, I got that in https://phabricator.wikimedia.org/P45894 too, about a week ago
2023-03-27 13:27:16 <Amir1> yeah it's broken, file a task and I'll take a look
2023-03-27 13:27:19 <wikibugs> ('PS1) ''Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 13:27:47 <Amir1> it used to cause data corruption, I'm fine with the current state to be honest
2023-03-27 13:27:56 <elukey> hey folks lemme know when the backport window close (no rush), after that I'll start some maintenance to redis misc clusters
2023-03-27 13:28:02 <elukey> *closes
2023-03-27 13:28:33 <taavi> elukey: we're debugging a maintenance script, might take a while
2023-03-27 13:29:14 <taavi> Amir1: yeah I think I'd prefer leaving some broken rows for now over blindly running with --fix
2023-03-27 13:29:25 <Amir1> I don't think we can fix the issue right now
2023-03-27 13:29:52 <Amir1> let it be, links tables always have some sorta drifts
2023-03-27 13:30:14 <Amir1> my hope would be to do the important fixes and the links one as an argument
2023-03-27 13:30:16 <Amir1> but meh
2023-03-27 13:30:38 <taavi> hm
2023-03-27 13:31:00 <taavi> although this is breaking access to those actual pages
2023-03-27 13:32:14 <taavi> so I don't want to leave that broken either
2023-03-27 13:34:51 <wikibugs> ('CR) ''Btullis: [V: ''+1 C: ''+2] Upgrade the research airflow instance [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
2023-03-27 13:35:23 <logmsgbot> !log fab@deploy2002 Started deploy [airflow-dags/research@d2c115d]: (no justification provided)
2023-03-27 13:35:44 <logmsgbot> !log fab@deploy2002 Finished deploy [airflow-dags/research@d2c115d]: (no justification provided) (duration: 00m 21s)
2023-03-27 13:36:31 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] rbd2backy2: clean up a debugging line [puppet] - ''https://gerrit.wikimedia.org/r/900648 (owner: ''Andrew Bogott)'
2023-03-27 13:37:12 <Amir1> taavi: can you just comment out the links updates in maint script and re-run it?
2023-03-27 13:40:37 <taavi> Amir1: I think I found the issue
2023-03-27 13:41:26 <taavi> https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829293 should have removed the addQuotes() calls from namespaceDupes.php as buildComparison does it for you
2023-03-27 13:41:51 <taavi> patch incoming
2023-03-27 13:45:07 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''ayounsi) Overall I agree it's an improvement to have the parent interfaces defined in Netbox. I lost a bit context o...'
2023-03-27 13:46:14 <taavi> https://gerrit.wikimedia.org/r/c/mediawiki/core/+/903253
2023-03-27 13:46:44 <wikibugs> ('PS1) ''Btullis: Remove stray referece to ariflow db from research instance [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193)'
2023-03-27 13:47:57 <Amir1> thanks for catching it
2023-03-27 13:48:24 <wikibugs> ('CR) ''Btullis: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40342/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
2023-03-27 13:48:50 <wikibugs> ('PS1) ''Majavah: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166)'
2023-03-27 13:49:15 <wikibugs> ('CR) ''Btullis: [V: ''+1 C: ''+2] Remove stray referece to ariflow db from research instance [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
2023-03-27 13:49:21 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by taavi@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: ''Majavah)'
2023-03-27 13:50:37 <icinga-wm> RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 2 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
2023-03-27 13:53:11 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] dumps: properly absent enterprise timers [puppet] - ''https://gerrit.wikimedia.org/r/902833 (owner: ''Majavah)'
2023-03-27 13:55:58 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''JJMC89)'
2023-03-27 13:58:22 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''cmooney) >>! In T296832#8729090, @ayounsi wrote: > I lost a bit context on how it will be done on a day to bay basis,...'
2023-03-27 13:58:39 <wikibugs> ('PS1) ''Majavah: hieradata: swap eqiad1 dns server order [puppet] - ''https://gerrit.wikimedia.org/r/903257'
2023-03-27 13:58:41 <wikibugs> ('PS1) ''Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - ''https://gerrit.wikimedia.org/r/903258'
2023-03-27 14:00:21 <wikibugs> ('CR) ''Jelto: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 14:00:58 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] hieradata: swap eqiad1 dns server order [puppet] - ''https://gerrit.wikimedia.org/r/903257 (owner: ''Majavah)'
2023-03-27 14:01:22 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 14:01:25 <wikibugs> ('PS1) ''Majavah: Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - ''https://gerrit.wikimedia.org/r/903259'
2023-03-27 14:02:07 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+2] Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - ''https://gerrit.wikimedia.org/r/902078 (owner: ''Giuseppe Lavagetto)'
2023-03-27 14:04:57 <wikibugs> ('CR) ''Slyngshede: [C: ''+2] Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 14:05:59 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - ''https://gerrit.wikimedia.org/r/903259 (owner: ''Majavah)'
2023-03-27 14:06:20 <wikibugs> ('Merged) ''jenkins-bot: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: ''Majavah)'
2023-03-27 14:06:36 <logmsgbot> !log taavi@deploy2002 Started scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]]
2023-03-27 14:06:43 <stashbot> T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
2023-03-27 14:07:17 <wikibugs> ('Merged) ''jenkins-bot: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - ''https://gerrit.wikimedia.org/r/902078 (owner: ''Giuseppe Lavagetto)'
2023-03-27 14:08:00 <logmsgbot> !log taavi@deploy2002 taavi: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
2023-03-27 14:08:03 <wikibugs> ('PS1) ''Hashar: gerrit: set gitiles clone url to http (Gerrit 3.6.2) [puppet] - ''https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049)'
2023-03-27 14:09:18 <wikibugs> ('Merged) ''jenkins-bot: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
2023-03-27 14:10:57 <elukey> jouncebot: next
2023-03-27 14:10:57 <jouncebot> In 1 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
2023-03-27 14:11:19 <taavi> elukey: give me just a few more minutes please
2023-03-27 14:12:13 <elukey> sure, I was just checking next windows :)
2023-03-27 14:14:39 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:14:39 <logmsgbot> !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
2023-03-27 14:14:51 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:15:04 <logmsgbot> !log taavi@deploy2002 Finished scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] (duration: 08m 27s)
2023-03-27 14:15:09 <stashbot> T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
2023-03-27 14:15:09 <logmsgbot> !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
2023-03-27 14:15:32 <wikibugs> ('CR) ''JHathaway: [C: ''+1] "Thanks for removing this cruft, looks good to me!" [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
2023-03-27 14:16:03 <taavi> !log taavi@mwmaint2002 ~ $ mwscript namespaceDupes.php --wiki=huwiki --fix # T333083
2023-03-27 14:16:07 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 14:16:08 <stashbot> T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
2023-03-27 14:16:09 <taavi> elukey: all done!
2023-03-27 14:16:20 <elukey> nice thanks!
2023-03-27 14:16:31 <jinxer-wm> (SystemdUnitFailed) firing: (71) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 14:16:59 <Superpes> Wow wonderful taavi
2023-03-27 14:17:09 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:17:16 <Superpes> Thanks :)
2023-03-27 14:17:17 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:21:08 <wikibugs> ('PS2) ''Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)'
2023-03-27 14:21:10 <wikibugs> ('PS1) ''Andrew Bogott: Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - ''https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169)'
2023-03-27 14:21:25 <wikibugs> ('PS8) ''Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)'
2023-03-27 14:21:32 <jinxer-wm> (SystemdUnitFailed) firing: (73) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 14:24:26 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) (owner: ''Andrew Bogott)'
2023-03-27 14:24:53 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - ''https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
2023-03-27 14:27:22 <wikibugs> ('PS1) ''EoghanGaffney: Assign insetup role to new aphlict vm [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369)'
2023-03-27 14:27:28 <wikibugs> ('PS1) ''Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - ''https://gerrit.wikimedia.org/r/903265'
2023-03-27 14:27:45 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
2023-03-27 14:28:05 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
2023-03-27 14:28:14 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
2023-03-27 14:28:33 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
2023-03-27 14:28:57 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
2023-03-27 14:29:13 <wikibugs> ('PS1) ''Bking: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675)'
2023-03-27 14:29:14 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
2023-03-27 14:29:23 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
2023-03-27 14:29:39 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
2023-03-27 14:29:45 <wikibugs> ('CR) ''DCausse: [C: ''+1] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
2023-03-27 14:30:04 <wikibugs> ('CR) ''Bking: [C: ''+2] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
2023-03-27 14:30:25 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
2023-03-27 14:33:44 <wikibugs> ('PS2) ''Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - ''https://gerrit.wikimedia.org/r/903265'
2023-03-27 14:34:52 <wikibugs> ('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40344/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
2023-03-27 14:35:45 <wikibugs> ('Merged) ''jenkins-bot: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
2023-03-27 14:38:31 <wikibugs> ('CR) ''Hnowlan: [C: ''+1] changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
2023-03-27 14:39:21 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:39:28 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:40:16 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:40:32 <wikibugs> ('CR) ''Slyngshede: [V: ''+1] "During sprint-week I noticed that we're not collecting Squid access logs from the urldownload servers." [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
2023-03-27 14:40:34 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
2023-03-27 14:40:56 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
2023-03-27 14:41:56 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''MNadrofsky) Approved.'
2023-03-27 14:43:28 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:43:30 <wikibugs> ('PS1) ''Andrew Bogott: Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - ''https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169)'
2023-03-27 14:43:59 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:44:29 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:44:55 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:45:06 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:45:15 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:46:27 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:46:35 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
2023-03-27 14:46:37 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - ''https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
2023-03-27 14:47:41 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
2023-03-27 14:47:55 <logmsgbot> !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
2023-03-27 14:48:05 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
2023-03-27 14:48:15 <logmsgbot> !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
2023-03-27 14:52:34 <logmsgbot> !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host aphlict1002.eqiad.wmnet
2023-03-27 14:52:35 <logmsgbot> !log eoghan@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 14:53:41 <wikibugs> 'SRE-Access-Requests, ''Lift-Wing, ''Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (''isarantopoulos)'
2023-03-27 14:55:03 <logmsgbot> !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
2023-03-27 14:55:07 <wikibugs> 'SRE-Access-Requests, ''Lift-Wing, ''Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (''isarantopoulos)'
2023-03-27 14:56:07 <logmsgbot> !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
2023-03-27 14:56:07 <logmsgbot> !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 14:56:07 <logmsgbot> !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache aphlict1002.eqiad.wmnet on all recursors
2023-03-27 14:56:10 <logmsgbot> !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict1002.eqiad.wmnet on all recursors
2023-03-27 14:57:30 <wikibugs> 'SRE, ''Data-Persistence, ''serviceops, ''Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (''Trizek-WMF)'
2023-03-27 14:57:42 <wikibugs> 'SRE, ''serviceops, ''CommRel-Specialists-Support (Jan-Mar-2023), ''Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (''Trizek-WMF) ''In progress''Resolved A post-action document has been created. There is nothing special to highl...'
2023-03-27 14:57:50 <wikibugs> ('CR) ''Jbond: "see inline" [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
2023-03-27 14:58:18 <wikibugs> ('PS4) ''Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - ''https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782)'
2023-03-27 15:01:31 <jinxer-wm> (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:05:52 <logmsgbot> !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aphlict1002.eqiad.wmnet
2023-03-27 15:11:31 <jinxer-wm> (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:11:49 <wikibugs> ('PS10) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
2023-03-27 15:11:52 <wikibugs> ('PS20) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
2023-03-27 15:14:02 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
2023-03-27 15:14:14 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
2023-03-27 15:15:17 <wikibugs> ('PS1) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
2023-03-27 15:16:19 <wikibugs> ('PS2) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
2023-03-27 15:16:32 <jinxer-wm> (SystemdUnitFailed) firing: (26) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:16:32 <jinxer-wm> (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:17:00 <logmsgbot> !log elukey@deploy2002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 10s)
2023-03-27 15:17:45 <icinga-wm> PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 15:17:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 15:19:57 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
2023-03-27 15:20:32 <wikibugs> ('CR) ''Vgutierrez: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40345/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: ''Vgutierrez)'
2023-03-27 15:20:39 <logmsgbot> !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
2023-03-27 15:20:44 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
2023-03-27 15:20:48 <logmsgbot> !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
2023-03-27 15:21:32 <jinxer-wm> (SystemdUnitFailed) firing: (53) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:21:32 <jinxer-wm> (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:22:18 <wikibugs> ('PS21) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
2023-03-27 15:22:22 <wikibugs> ('PS3) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
2023-03-27 15:22:25 <wikibugs> ('PS8) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
2023-03-27 15:22:35 <wikibugs> ('PS9) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
2023-03-27 15:22:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 15:23:27 <icinga-wm> PROBLEM - puppet last run on rdb2008 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
2023-03-27 15:24:49 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135 (owner: ''Jbond)'
2023-03-27 15:25:12 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
2023-03-27 15:26:19 <wikibugs> 'SRE-swift-storage, ''Data-Engineering-Planning, ''Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (''Ottomata) > usefulness of cross-DC replication After asking @dcausse, I unde...'
2023-03-27 15:27:16 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''Volans) Looks ok to me too, I'm no sure about all the details involved if we need to patch things like the dns genera...'
2023-03-27 15:29:13 <icinga-wm> PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 15:29:23 <icinga-wm> RECOVERY - puppet last run on rdb2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
2023-03-27 15:29:45 <wikibugs> ('CR) ''BBlack: [C: ''+1] "Seems right to me, for this testing!" [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: ''Vgutierrez)'
2023-03-27 15:30:05 <jouncebot> jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
2023-03-27 15:32:44 <wikibugs> ('PS22) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
2023-03-27 15:32:55 <wikibugs> ('PS10) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
2023-03-27 15:33:00 <wikibugs> ('PS11) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
2023-03-27 15:34:49 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
2023-03-27 15:35:46 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135 (owner: ''Jbond)'
2023-03-27 15:36:36 <wikibugs> 'SRE, ''MediaWiki-extensions-OAuth, ''Performance-Team, ''Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (''Samwalton9)'
2023-03-27 15:42:03 <wikibugs> 'SRE, ''SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (''Ottomata) Hi, just back from vacation too. @FNavas-foundation can you update the task description with exactly what you need access too? Your comment mentions a 'spe...'
2023-03-27 15:44:15 <icinga-wm> RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 15:46:31 <jinxer-wm> (SystemdUnitFailed) firing: (11) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 15:46:51 <wikibugs> ('PS1) ''Ayounsi: Varnish: prefix 403 and 429 with a unique ID [puppet] - ''https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973)'
2023-03-27 15:46:55 <wikibugs> ('PS1) ''Filippo Giunchedi: alertmanager: default to IRC for foundations [puppet] - ''https://gerrit.wikimedia.org/r/903285'
2023-03-27 15:47:02 <godog> jbond: ^
2023-03-27 15:50:26 <wikibugs> ('CR) ''Jelto: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
2023-03-27 15:50:53 <wikibugs> ('Abandoned) ''Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/903228 (owner: ''L10n-bot)'
2023-03-27 15:51:53 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ottomata) I think they need [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Levels | sql_lab role permissions in Superset ]]. Pinging @Milimetr...'
2023-03-27 15:53:35 <wikibugs> ('CR) ''Hnowlan: [C: ''+2] admin: add user kamila [puppet] - ''https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: ''Kamila Součková)'
2023-03-27 15:53:38 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''hnowlan)'
2023-03-27 15:54:04 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Clement_Goubert)'
2023-03-27 15:54:52 <logmsgbot> !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided)
2023-03-27 15:55:03 <logmsgbot> !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided) (duration: 00m 11s)
2023-03-27 15:55:44 <wikibugs> ('PS11) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
2023-03-27 15:56:40 <wikibugs> ('PS4) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
2023-03-27 15:57:13 <wikibugs> ('CR) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (''6 comments) [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
2023-03-27 15:57:56 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
2023-03-27 15:58:10 <wikibugs> ('CR) ''Filippo Giunchedi: [C: ''+2] alertmanager: default to IRC for foundations [puppet] - ''https://gerrit.wikimedia.org/r/903285 (owner: ''Filippo Giunchedi)'
2023-03-27 15:58:14 <wikibugs> ('CR) ''Dzahn: [C: ''+2] planet: Add Wikimedia category of Jan Ainali's blog [puppet] - ''https://gerrit.wikimedia.org/r/902829 (owner: ''Legoktm)'
2023-03-27 15:58:45 <wikibugs> ('CR) ''Dzahn: [C: ''+2] planet: Add Nemo_bis's new blog [puppet] - ''https://gerrit.wikimedia.org/r/902828 (owner: ''Legoktm)'
2023-03-27 15:59:16 <wikibugs> ('PS2) ''Dzahn: planet: Add Wikimedia category of Jan Ainali's blog [puppet] - ''https://gerrit.wikimedia.org/r/902829 (owner: ''Legoktm)'
2023-03-27 16:03:53 <wikibugs> ('PS12) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
2023-03-27 16:04:12 <wikibugs> ('CR) ''Dzahn: [C: ''+2] planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - ''https://gerrit.wikimedia.org/r/902832 (owner: ''Krinkle)'
2023-03-27 16:04:15 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+2] admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - ''https://gerrit.wikimedia.org/r/902066 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:04:22 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+2] changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:06:10 <wikibugs> ('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
2023-03-27 16:06:52 <wikibugs> ('CR) ''Jbond: "for the mypy alerts we need to wait for a spicerack release" [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
2023-03-27 16:08:14 <wikibugs> ('CR) ''Dzahn: [C: ''+1] "looks good, I do want to rename the role to sre_collab, but that will require rebasing one way or another" [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
2023-03-27 16:10:03 <wikibugs> ('Merged) ''jenkins-bot: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - ''https://gerrit.wikimedia.org/r/902066 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:10:05 <wikibugs> ('Merged) ''jenkins-bot: changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:14:06 <wikibugs> ('PS1) ''Alexandros Kosiaris: admin: Grant kserve API group read access to deploy user [deployment-charts] - ''https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174)'
2023-03-27 16:15:05 <wikibugs> ('CR) ''Alexandros Kosiaris: "Luca, Janis, regardless of the outcome of the discussion in the linked task, let me know if this is the preferable way of doing this." [deployment-charts] - ''https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: ''Alexandros Kosiaris)'
2023-03-27 16:20:52 <wikibugs> ('PS1) ''Alexandros Kosiaris: admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298'
2023-03-27 16:25:33 <wikibugs> ('PS1) ''Alexandros Kosiaris: admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299'
2023-03-27 16:26:14 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+2] admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:29:40 <wikibugs> ('PS1) ''Jbond: os-reports: fix yaml data for apt_repo [puppet] - ''https://gerrit.wikimedia.org/r/903301'
2023-03-27 16:30:22 <wikibugs> ('CR) ''Jbond: [V: ''+2 C: ''+2] os-reports: fix yaml data for apt_repo [puppet] - ''https://gerrit.wikimedia.org/r/903301 (owner: ''Jbond)'
2023-03-27 16:31:05 <wikibugs> ('Merged) ''jenkins-bot: admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:31:29 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+2] admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:32:35 <icinga-wm> RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 16:33:28 <logmsgbot> !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
2023-03-27 16:34:14 <logmsgbot> !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
2023-03-27 16:34:20 <logmsgbot> !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
2023-03-27 16:34:31 <jinxer-wm> (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 16:34:35 <logmsgbot> !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
2023-03-27 16:36:05 <wikibugs> ('Merged) ''jenkins-bot: admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299 (owner: ''Alexandros Kosiaris)'
2023-03-27 16:39:09 <icinga-wm> RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
2023-03-27 16:39:42 <logmsgbot> !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
2023-03-27 16:39:58 <logmsgbot> !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
2023-03-27 16:40:03 <akosiaris> hashar: changeprop-jobqueue resource-quotas doubled
2023-03-27 16:40:12 <logmsgbot> !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
2023-03-27 16:40:59 <logmsgbot> !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
2023-03-27 16:43:45 <akosiaris> sigh, pinged the wrong person, sorry Antoine
2023-03-27 16:43:50 <akosiaris> hnowlan: changeprop-jobqueue resource-quotas doubled
2023-03-27 16:44:00 <wikibugs> ('PS1) ''Jbond: idm: remove auto restart for apache-htcacheclean [puppet] - ''https://gerrit.wikimedia.org/r/903302'
2023-03-27 16:44:28 <wikibugs> ('CR) ''Jbond: [V: ''+2 C: ''+2] idm: remove auto restart for apache-htcacheclean [puppet] - ''https://gerrit.wikimedia.org/r/903302 (owner: ''Jbond)'
2023-03-27 16:45:23 <wikibugs> ('PS1) ''Jbond: Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - ''https://gerrit.wikimedia.org/r/903199'
2023-03-27 16:45:29 <wikibugs> ('CR) ''Jbond: [V: ''+2 C: ''+2] Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - ''https://gerrit.wikimedia.org/r/903199 (owner: ''Jbond)'
2023-03-27 16:47:38 <wikibugs> 'ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (''Jhancock.wm) I logged into the ilo and as of now there are no errors on that link. Papaul pointed me to T330218 where he suggested moving the network port from ge-6/0/6 to ge-6/0/1. Since this issue comes back intermittently, i...'
2023-03-27 16:48:13 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) @jbond Is there a process for offboarding that describes how to do it correctly? As basically every single time I am trying to edit pwstore it is blocked by an invalid key...'
2023-03-27 16:49:29 <hnowlan> akosiaris: thanks!
2023-03-27 16:49:57 <icinga-wm> RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-03-27 16:53:59 <icinga-wm> PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
2023-03-27 16:54:23 <icinga-wm> RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
2023-03-27 16:58:29 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) I removed both nfraison and eoghan from the .users file, re-signed it and then re-encrypted all files that I could encrypt, then pushed to repo. This does not change their...'
2023-03-27 16:59:40 <jinxer-wm> (SystemdUnitFailed) firing: (9) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 17:00:05 <jouncebot> Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
2023-03-27 17:00:05 <jouncebot> ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
2023-03-27 17:05:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 17:06:26 <sukhe> er what's this lvs failure, checking
2023-03-27 17:08:25 <sukhe> hmm ran agent manually, resolved. must be transient
2023-03-27 17:15:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 17:19:43 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
2023-03-27 17:19:57 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond) ''In progress''Resolved > @jbond Is there a process for offboarding that describes how to do it correctly? not really the [[ https://wikitech.wikimedia.org/wiki/SRE_Of...'
2023-03-27 17:20:35 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''ayounsi) > Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer? Th...'
2023-03-27 17:23:59 <wikibugs> ('CR) ''Dzahn: "Should this be merged before the upgrade or should it wait until the upgrade?" [puppet] - ''https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: ''Hashar)'
2023-03-27 17:25:26 <wikibugs> ('CR) ''Dzahn: "We already have bullseye doc machines thanks to Andrea's work. We should just switch to those." [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
2023-03-27 17:27:55 <wikibugs> ('CR) ''Dzahn: "just means there will be a lot more rebasing because we keep adding to this. in that case it's easier to abandon it" [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 17:28:04 <wikibugs> ('Abandoned) ''Dzahn: monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 17:31:52 <wikibugs> 'SRE, ''SRE-Access-Requests: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (''ssastry)'
2023-03-27 17:34:59 <wikibugs> ('PS1) ''Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206)'
2023-03-27 17:37:39 <wikibugs> ('CR) ''Slyngshede: [V: ''+1] P:url_downloader send Squid access logs to Logstash (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
2023-03-27 17:41:05 <wikibugs> 'SRE, ''ops-eqiad, ''DBA, ''Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (''Cmjohnson) ''Open''Resolved The DIMM has been replaced, I updated the idrac and bios while it was offline.'
2023-03-27 17:45:13 <wikibugs> ('CR) ''Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 17:56:01 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 17:59:16 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 18:06:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 18:07:35 <wikibugs> ('PS3) ''Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - ''https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477)'
2023-03-27 18:09:33 <wikibugs> ('CR) ''CI reject: [V: ''-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - ''https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 18:11:15 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops, ''cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (''Papaul) We moved the fist batch of servers today all went well.'
2023-03-27 18:15:56 <wikibugs> ('PS1) ''Dzahn: alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587)'
2023-03-27 18:16:26 <wikibugs> ('PS1) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
2023-03-27 18:19:00 <wikibugs> ('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40349/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 18:19:21 <wikibugs> ('PS2) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
2023-03-27 18:20:23 <wikibugs> ('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40350/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 18:20:42 <wikibugs> ('PS3) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
2023-03-27 18:21:47 <wikibugs> ('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40351/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 18:25:05 <wikibugs> ('PS4) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
2023-03-27 18:28:50 <wikibugs> ('CR) ''Dzahn: "It looks ok in compiler, and I can check in devtools, but I don't want to get into follow-ups in deployment-prep and other projects." [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
2023-03-27 18:29:05 <wikibugs> ('CR) ''Dzahn: [C: ''+2] deployment_server: ensure Docker is installed [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
2023-03-27 18:31:27 <wikibugs> ('CR) ''Dzahn: [C: ''+1] "lgtm!" [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 18:37:08 <wikibugs> ('PS14) ''Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - ''https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)'
2023-03-27 18:37:18 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops, ''cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (''Papaul) @cmooney second batch proposal below |Host|U space|Existing port|New port| |cloudcephosd2002-de...'
2023-03-27 18:37:51 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''KFrancis) Hello, @oleksandr_tsyba_WMDE, I'll be helping with this request. Would you please send your WMDE email address to kfrancis@wikimedia.org?'
2023-03-27 18:41:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 18:41:42 <wikibugs> ('CR) ''CI reject: [V: ''-1] Refactor and centralize BGPpeer config [deployment-charts] - ''https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
2023-03-27 18:43:47 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "confirmed noop on production deployment servers, deploy1002 and deploy2002 - fails in devtools, mostly expected" [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
2023-03-27 18:45:35 <wikibugs> 'SRE, ''MediaWiki-extensions-OAuth, ''Datacenter-Switchover, ''Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (''larissagaulia)'
2023-03-27 18:45:48 <wikibugs> ('PS1) ''Dzahn: Revert "deployment_server: ensure Docker is installed" [puppet] - ''https://gerrit.wikimedia.org/r/903200'
2023-03-27 18:46:00 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives don't match any configuration option: st" [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
2023-03-27 18:51:02 <wikibugs> ('CR) ''Dzahn: "what fails about them? or rather, what bothered you about them?" [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: ''Jcrespo)'
2023-03-27 18:52:43 <wikibugs> ('PS1) ''Dzahn: Revert "bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs" [puppet] - ''https://gerrit.wikimedia.org/r/903201'
2023-03-27 18:54:42 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "puppet works again on deploy-1004.devtools after reverting so should be fine in deployment-prep as well" [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
2023-03-27 18:56:30 <wikibugs> ('CR) ''Dzahn: "changes to admin groups might require access request tickets, this should be done between clinic duty and serviceops team. I don't have co" [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: ''Subramanya Sastry)'
2023-03-27 18:57:37 <wikibugs> ('CR) ''Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
2023-03-27 18:57:46 <wikibugs> ('CR) ''Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 19:06:56 <wikibugs> ('CR) ''Dzahn: zuul: fix up service enable and ensure (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
2023-03-27 19:07:48 <wikibugs> ('PS1) ''Jdlrobson: Expand list of wikis with language button at top. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777)'
2023-03-27 19:10:10 <wikibugs> ('PS1) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
2023-03-27 19:11:17 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "https://puppet-compiler.wmflabs.org/output/901576/40355/"; [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
2023-03-27 19:15:00 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "confirmed this changed nothing on all 3 contint* servers. zuul is still running on contint2001, masked on contint1002 and unknown on conti" [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
2023-03-27 19:18:29 <wikibugs> ('PS2) ''Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093)'
2023-03-27 19:21:22 <logmsgbot> !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2
2023-03-27 19:21:37 <logmsgbot> !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2 (duration: 00m 14s)
2023-03-27 19:25:29 <wikibugs> ('PS1) ''Superpes15: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326'
2023-03-27 19:26:01 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 19:29:11 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 19:30:03 <wikibugs> 'SRE, ''Traffic, ''HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (''greg) Hi @BCornwall ! I'm just jumping in as an FR-Tech representative. I think I've got the summary here (basically, in the end, Shopify can't meet our hsts header needs which blocks overal...'
2023-03-27 19:40:47 <wikibugs> ('PS1) ''Dzahn: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - ''https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868)'
2023-03-27 19:43:25 <wikibugs> ('PS2) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
2023-03-27 19:44:08 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) @larissagaulia Thank you for adding the information. Does "until July" mean "until last day of June"? I uploaded a code change above that is now in review. Access requests...'
2023-03-27 19:45:13 <wikibugs> ('CR) ''Ahmon Dancy: Revert "deployment_server: ensure Docker is installed" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
2023-03-27 19:45:32 <wikibugs> ('PS3) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
2023-03-27 19:46:20 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) ''Open''In progress p:''Triage''Medium'
2023-03-27 19:46:31 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) a:''Ladsgroup'
2023-03-27 19:49:11 <wikibugs> ('CR) ''Dzahn: "this may have caused that you can't include the docker class on new hosts anymore without a puppet error. : https://gerrit.wikimedia.org/r"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 19:52:02 <wikibugs> ('CR) ''Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 19:56:01 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 19:59:11 <jinxer-wm> (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:00:04 <jouncebot> RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2000).
2023-03-27 20:00:04 <jouncebot> jdlrobson and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2023-03-27 20:00:31 <Jdlrobson> here
2023-03-27 20:00:47 <kindrobot> I can deploy
2023-03-27 20:00:54 <Superpes> Hi :)
2023-03-27 20:01:11 <kindrobot> Jdlrobson: is it safe to deploy your two together?
2023-03-27 20:01:51 <Jdlrobson> kindrobot: yep
2023-03-27 20:01:55 <kindrobot> !log start UTC late backport window
2023-03-27 20:01:58 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 20:02:56 <wikibugs> ('PS1) ''Ahmon Dancy: k8s: Use storage-driver instead of storage_driver [puppet] - ''https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803)'
2023-03-27 20:04:58 <kindrobot> Jdlrobson: what's modern-manpage?
2023-03-27 20:05:04 <wikibugs> ('CR) ''Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
2023-03-27 20:05:53 <kindrobot> Oh, mainpage. I feel silly
2023-03-27 20:06:43 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: ''Jdlrobson)'
2023-03-27 20:06:46 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
2023-03-27 20:07:31 <wikibugs> ('Merged) ''jenkins-bot: Expand list of wikis with language button at top. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: ''Jdlrobson)'
2023-03-27 20:08:50 <wikibugs> ('PS3) ''Stef Dunlap: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
2023-03-27 20:09:03 <wikibugs> ('CR) ''TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
2023-03-27 20:09:47 <wikibugs> ('Merged) ''jenkins-bot: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
2023-03-27 20:10:02 <logmsgbot> !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]]
2023-03-27 20:10:09 <stashbot> T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
2023-03-27 20:10:10 <stashbot> T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
2023-03-27 20:11:29 <logmsgbot> !log kindrobot@deploy2002 jdlrobson and kindrobot: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
2023-03-27 20:12:08 <kindrobot> Jdlrobson: ready to check
2023-03-27 20:13:18 <Jdlrobson> looking :)
2023-03-27 20:14:01 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1017.eqiad.wmnet
2023-03-27 20:14:11 <jinxer-wm> (SystemdUnitFailed) firing: (13) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:14:33 <Jdlrobson> kindrobot: LGTM!
2023-03-27 20:15:12 <kindrobot> Great, syncing.
2023-03-27 20:15:53 <kindrobot> Superpes is it safe to deploy your two patches together?
2023-03-27 20:16:01 <jinxer-wm> (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:16:44 <Superpes> Yep no issue :) kindrobot
2023-03-27 20:17:12 <wikibugs> ('PS1) ''Andrew Bogott: Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - ''https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169)'
2023-03-27 20:18:27 <wikibugs> ('CR) ''Andrew Bogott: [C: ''+2] Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - ''https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
2023-03-27 20:19:11 <jinxer-wm> (SystemdUnitFailed) firing: (15) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:19:23 <wikibugs> ('CR) ''Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: ''Subramanya Sastry)'
2023-03-27 20:20:52 <logmsgbot> !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] (duration: 10m 50s)
2023-03-27 20:21:01 <jinxer-wm> (SystemdUnitFailed) firing: (28) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:21:02 <stashbot> T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
2023-03-27 20:21:02 <stashbot> T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
2023-03-27 20:21:25 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 20:22:36 <kindrobot> Amir1: should we be worried about these systemd units failing before proceeding with the backports?
2023-03-27 20:23:20 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:24:11 <jinxer-wm> (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:25:01 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:25:01 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 20:25:02 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1017.eqiad.wmnet
2023-03-27 20:25:20 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1021.eqiad.wmnet
2023-03-27 20:26:01 <jinxer-wm> (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:27:15 <Amir1> kindrobot: which one is it?
2023-03-27 20:27:37 <Jdlrobson> thanks kindrobot ! looking good on production!
2023-03-27 20:27:54 <Amir1> the alert2001 one, it should be fine for now
2023-03-27 20:28:36 <kindrobot> SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 | (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100
2023-03-27 20:28:57 <taavi> it's a new alert, I suspect it's actually failing for longer
2023-03-27 20:29:07 <Amir1> there is way too many systemd unit fail, sigh https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:29:19 <Amir1> it's ongoing for a while it seems
2023-03-27 20:29:44 <Amir1> cwhite: maybe you know what's going on? speically on alert2001
2023-03-27 20:30:00 <kindrobot> So would you advise continuing with the backports?
2023-03-27 20:31:04 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 20:31:28 <taavi> kindrobot: I'd ignore the alerts and continue
2023-03-27 20:31:40 <kindrobot> OK, thank you. :)
2023-03-27 20:31:46 <Amir1> yeah, it's not related for sure
2023-03-27 20:32:51 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] Assign insetup role to new aphlict vm [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
2023-03-27 20:33:16 <wikibugs> ('PS4) ''Stef Dunlap: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
2023-03-27 20:33:16 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:34:31 <cwhite> Amir1: thanks for the heads up, I'll look into the auto restart failure
2023-03-27 20:35:13 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:35:13 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 20:35:14 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1021.eqiad.wmnet
2023-03-27 20:35:36 <wikibugs> ('PS2) ''Stef Dunlap: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
2023-03-27 20:35:42 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
2023-03-27 20:35:44 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
2023-03-27 20:36:10 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1022.eqiad.wmnet
2023-03-27 20:37:32 <wikibugs> ('Merged) ''jenkins-bot: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
2023-03-27 20:41:26 <wikibugs> ('CR) ''JMeybohm: [C: ''+1] "Very true. Sorry for causing trouble!" [puppet] - ''https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: ''Ahmon Dancy)'
2023-03-27 20:41:49 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.dns.netbox
2023-03-27 20:43:50 <logmsgbot> !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:45:04 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
2023-03-27 20:45:04 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-03-27 20:45:05 <logmsgbot> !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1022.eqiad.wmnet
2023-03-27 20:45:19 <wikibugs> 'ops-eqiad, ''cloud-services-team, ''decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (''Andrew) a:''Andrew''Jclark-ctr'
2023-03-27 20:46:41 <wikibugs> ('PS1) ''Jgreen: payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - ''https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892)'
2023-03-27 20:50:29 <Amir1> cwhite: it might be related but we are getting systemd unit fail on db1101 but the alert doesn't make sense
2023-03-27 20:50:47 <Amir1> as it's really not failing
2023-03-27 20:51:03 <Amir1> (maybe? I'll check)
2023-03-27 20:51:24 <cwhite> From the auto-restart timer?
2023-03-27 20:51:25 <jinxer-wm> (SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 20:52:08 <wikibugs> 'SRE, ''Traffic, ''HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (''BCornwall) Hey, @greg. It's not blocking overall improvement, it's just not [[ https://wikitech.wikimedia.org/wiki/HTTPS#Current_policies_and_standards | complying with standards ]]. Since s...'
2023-03-27 20:52:36 <wikibugs> ('CR) ''Jgreen: [C: ''+2] payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - ''https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892) (owner: ''Jgreen)'
2023-03-27 20:55:34 <cwhite> Amir1: db1101 is not an s7 host anymore?
2023-03-27 20:55:54 <Amir1> probably Manuel moved it but he is not around
2023-03-27 20:56:10 <Amir1> I think he said he reset the systemd timer
2023-03-27 20:57:42 <icinga-wm> PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
2023-03-27 20:57:44 <icinga-wm> PROBLEM - Host restbase1033 is DOWN: PING CRITICAL - Packet loss = 100%
2023-03-27 20:57:56 <cwhite> I'm guessing there are auto-restart timers lingering that aren't being cleaned up by puppet.
2023-03-27 20:57:58 <icinga-wm> PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1,
2023-03-27 20:57:58 <icinga-wm> th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
2023-03-27 20:59:36 <marostegui> cwhite: it's not a S7 no
2023-03-27 20:59:40 <marostegui> it's in M1
2023-03-27 21:00:03 <marostegui> I disabled both systemd units
2023-03-27 21:00:05 <jouncebot> Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100).
2023-03-27 21:00:42 <kindrobot> Note: the backport deploy window is still in progress
2023-03-27 21:01:06 <jinxer-wm> (SystemdUnitFailed) firing: (4) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:01:12 <kindrobot> taavi: it seems like its stalled out. It's cleared CI, but it hasn't merged
2023-03-27 21:02:25 <kindrobot> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/903326/
2023-03-27 21:02:50 <icinga-wm> RECOVERY - Host restbase1033 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
2023-03-27 21:03:46 <icinga-wm> PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-03-27 21:04:38 <icinga-wm> PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-03-27 21:07:22 <kindrobot> dancy: ^
2023-03-27 21:11:02 <tzatziki> !log moving Universal Code of Conduct/Enforcement guidelines -> Universal Code of Conduct/Enforcement guidelines/Version 1 on metawiki with `extensions/Translate/scripts/moveTranslatableBundle.php `
2023-03-27 21:11:05 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:11:17 <taavi> kindrobot: somehow the +2 was applied to PS1 while PS2 was the latest
2023-03-27 21:11:19 <tzatziki> (probably don't need to log that but just in case)
2023-03-27 21:11:25 <Superpes> Uh it doen't want to merge it
2023-03-27 21:11:51 <Superpes> Oh
2023-03-27 21:12:17 <thcipriani> hrm
2023-03-27 21:12:33 <taavi> so just re-+2 it and probably file a bug in scap
2023-03-27 21:12:39 <kindrobot> It should probably be OK to scap backport again, eh?
2023-03-27 21:12:42 <kindrobot> OK.
2023-03-27 21:12:43 <wikibugs> ('CR) ''Andrea Denisse: [V: ''+1 C: ''+2] doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
2023-03-27 21:12:48 <thcipriani> +1
2023-03-27 21:13:10 <kindrobot> Thank you all.
2023-03-27 21:13:17 <wikibugs> ('CR) ''TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
2023-03-27 21:14:03 <jinxer-wm> (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 21:14:12 <wikibugs> ('Merged) ''jenkins-bot: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
2023-03-27 21:14:25 <logmsgbot> !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]]
2023-03-27 21:14:31 <stashbot> T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
2023-03-27 21:14:54 <logmsgbot> !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided)
2023-03-27 21:15:08 <logmsgbot> !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided) (duration: 00m 13s)
2023-03-27 21:15:49 <logmsgbot> !log kindrobot@deploy2002 kindrobot and superpes: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
2023-03-27 21:16:23 <kindrobot> Ready to check Superpes
2023-03-27 21:17:02 <Superpes> Checked both and everything is fine kindrobot! Thanks! :)
2023-03-27 21:17:22 <kindrobot> Thanks, syncing
2023-03-27 21:18:28 <icinga-wm> RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
2023-03-27 21:18:47 <wikibugs> ('CR) ''Dduvall: buildkitd: Isolate build container user/process/network namespaces (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
2023-03-27 21:19:03 <jinxer-wm> (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 21:22:31 <wikibugs> ('CR) ''Dduvall: buildkitd: Isolate build container user/process/network namespaces (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
2023-03-27 21:22:51 <logmsgbot> !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] (duration: 08m 26s)
2023-03-27 21:22:57 <stashbot> T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
2023-03-27 21:23:40 <kindrobot> Sync finished. Thanks everyone.
2023-03-27 21:23:52 <kindrobot> !log finish UTC late backports
2023-03-27 21:23:56 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:24:11 <jinxer-wm> (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:24:16 <jinxer-wm> (SystemdUnitFailed) firing: (30) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:24:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 21:24:30 <icinga-wm> RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
2023-03-27 21:24:56 <Amir1> !log start of watchlist clean up in arwiki (T328501)
2023-03-27 21:24:59 <kindrobot> Reedy, sbassett, Maryum, and manfredi backport window finished :)
2023-03-27 21:25:00 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:25:01 <stashbot> T328501: Request to clean my watchlist from articles in namespace 0 and 1 - https://phabricator.wikimedia.org/T328501
2023-03-27 21:25:15 <Superpes> Thanks for your time kindrobot :D
2023-03-27 21:25:53 <kindrobot> No problem, thank you. :)
2023-03-27 21:26:16 <jinxer-wm> (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:27:38 <icinga-wm> PROBLEM - Restbase root url on restbase1033 is CRITICAL: connect to address 10.64.48.71 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
2023-03-27 21:29:11 <jinxer-wm> (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:29:24 <icinga-wm> PROBLEM - SSH on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-03-27 21:30:56 <icinga-wm> PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-03-27 21:35:17 <wikibugs> 'SRE, ''vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (''andrea.denisse)'
2023-03-27 21:37:59 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ladsgroup) superset should be automatically done via wmf ldap group. If Jgiannelos is in the ldap group, it should be done already. Correct?'
2023-03-27 21:39:35 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''Ladsgroup) a:''Ladsgroup I'm on clinic duty this week. Waiting for signoff by Tyler. Maybe a deployment training can be arranged (or other devs in wmde can do an i...'
2023-03-27 21:40:02 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''Ladsgroup) https://wikitech.wikimedia.org/wiki/Deployments/Training'
2023-03-27 21:42:50 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''taavi) > To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibase client) is properly installed and configured Unless you're also planni...'
2023-03-27 21:43:53 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ottomata) Superset has its own 'roles', and I think something changed in a recent version that makes is so the default role doesn't have access to the SQL lab feat...'
2023-03-27 21:45:34 <ryankemper> !log T330165 Depooled relevant search platform hosts: `sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'`
2023-03-27 21:45:39 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:45:40 <stashbot> T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
2023-03-27 21:49:11 <jinxer-wm> (SystemdUnitFailed) firing: (13) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:56:01 <jinxer-wm> (SystemdUnitFailed) firing: (40) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 21:58:13 <urandom> !log power cycling restbase1033 — T333243
2023-03-27 21:58:17 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:58:18 <stashbot> T333243: restbase1033 is down - https://phabricator.wikimedia.org/T333243
2023-03-27 21:58:41 <maryum> !log Deploy security fix for T326952
2023-03-27 21:58:45 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 21:59:11 <jinxer-wm> (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 22:01:08 <icinga-wm> PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:01:16 <icinga-wm> PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:01:16 <icinga-wm> PROBLEM - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:01:34 <icinga-wm> RECOVERY - SSH on restbase1033 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-03-27 22:01:38 <icinga-wm> RECOVERY - Restbase root url on restbase1033 is OK: HTTP OK: HTTP/1.1 200 - 17255 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/RESTBase
2023-03-27 22:02:08 <icinga-wm> PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:02:24 <icinga-wm> PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:02:24 <icinga-wm> PROBLEM - cassandra-a service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:04:04 <icinga-wm> RECOVERY - cassandra-b service on restbase1033 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:04:18 <icinga-wm> RECOVERY - cassandra-c service on restbase1033 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:04:18 <icinga-wm> RECOVERY - cassandra-a service on restbase1033 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2023-03-27 22:04:46 <wikibugs> ('PS2) ''EoghanGaffney: Adds php and apache logs for doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245)'
2023-03-27 22:04:56 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 22:05:18 <jinxer-wm> (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2023-03-27 22:06:16 <wikibugs> ('CR) ''EoghanGaffney: Adds php and apache logs for doc machines (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
2023-03-27 22:06:54 <icinga-wm> RECOVERY - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-b valid until 2024-08-28 11:43:21 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:07:00 <icinga-wm> RECOVERY - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-c valid until 2024-08-28 11:43:23 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:07:01 <icinga-wm> RECOVERY - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-a valid until 2024-08-28 11:43:18 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
2023-03-27 22:07:02 <icinga-wm> RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886
2023-03-27 22:07:04 <icinga-wm> RECOVERY - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.152 port 9042 https://phabricator.wikimedia.org/T93886
2023-03-27 22:07:22 <icinga-wm> RECOVERY - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.153 port 9042 https://phabricator.wikimedia.org/T93886
2023-03-27 22:09:50 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Volans) >>! In T330165#8731601, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/YxgIJY...'
2023-03-27 22:10:17 <jinxer-wm> (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2023-03-27 22:14:43 <wikibugs> ('CR) ''Dzahn: "It's not true that this removes IRC notifications, they were just sent to a test channel only. I am fixing that here: https://gerrit.wikim"; [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 22:16:08 <zabe> !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Meta:WMF Support and Safety" "Meta:WMF Trust and Safety" "Zabe" --reason "per [[:phab:T330514|T330514]]" # T330514
2023-03-27 22:16:13 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 22:16:15 <stashbot> T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
2023-03-27 22:17:18 <wikibugs> ('CR) ''Dzahn: [C: ''+2] alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 22:18:06 <jinxer-wm> (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 287.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
2023-03-27 22:19:47 <wikibugs> ('PS1) ''Zabe: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514)'
2023-03-27 22:21:42 <zabe> MessageIndexException from line 191 of /srv/mediawiki/php-1.41.0-wmf.1/extensions/Translate/utils/MessageIndex.php: MessageIndex: unable to acquire lock
2023-03-27 22:21:46 <zabe> :|
2023-03-27 22:22:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 22:22:27 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''fundraising-tech-ops, ''netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (''Dwisehaupt) Thanks! Verified working and runs good.'
2023-03-27 22:22:54 <herzog> zabe: I don't think we need to backport that to the current wmf branch, we can let the train do it when the time comes?
2023-03-27 22:23:34 <herzog> oh, well, you're moving the meta pages now - I wanted to wait a bit
2023-03-27 22:24:05 <herzog> well, we get this done now, good :)
2023-03-27 22:24:15 <zabe> is there anything specific you wanted to wait for?
2023-03-27 22:24:55 <mutante> runs puppet on bast1003 because that alert claims puppet fails on bastion "cluster" but also I dont get the graph :)
2023-03-27 22:27:02 <mutante> and nothing actually failed there.. so no idea
2023-03-27 22:28:08 <mutante> ah, it's bast5003 pushing things over the limit and the usual background ones https://puppetboard.wikimedia.org/nodes?status=failed
2023-03-27 22:29:15 <herzog> zabe: my idea was train -> watch for failures -> rename; but since you are backporting it now, I guess there's no need to wait :)
2023-03-27 22:30:54 <zabe> well :)
2023-03-27 22:31:10 <zabe> jouncebot: nowandnext
2023-03-27 22:31:10 <jouncebot> For the next 0 hour(s) and 28 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100)
2023-03-27 22:31:10 <jouncebot> In 3 hour(s) and 28 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200)
2023-03-27 22:31:29 <wikibugs> ('CR) ''Zabe: [C: ''+2] Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: ''Zabe)'
2023-03-27 22:42:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 22:43:32 <mutante> !log apt2001 - kill 3105; run puppet
2023-03-27 22:43:36 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 22:43:45 <mutante> !log stat1004 - kill 29291; run puppet
2023-03-27 22:43:48 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 22:46:51 <wikibugs> ('Merged) ''jenkins-bot: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: ''Zabe)'
2023-03-27 22:47:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 22:47:20 <logmsgbot> !log zabe@deploy2002 Started scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]]
2023-03-27 22:47:26 <stashbot> T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
2023-03-27 22:48:34 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 22:48:48 <mutante> !log stat1005 - kill 18179; run puppet ; stat1007 - kill 3346; run puppet ; stat1006 - kill 23887 run puppet
2023-03-27 22:48:52 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 22:52:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: (6) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 22:57:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 23:00:01 <logmsgbot> !log zabe@deploy2002 zabe: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
2023-03-27 23:00:11 <stashbot> T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
2023-03-27 23:02:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: (9) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 23:02:11 <wikibugs> 'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) We got the "widespread puppet failures" alert which made me look at some random failed hosts in the list. I found the reason was this offboarding, because: apt2001: ` Err...'
2023-03-27 23:02:58 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 23:03:34 <jinxer-wm> (SystemdUnitFailed) firing: (15) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 23:07:58 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-03-27 23:08:34 <jinxer-wm> (SystemdUnitFailed) firing: (22) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 23:08:48 <logmsgbot> !log zabe@deploy2002 Finished scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] (duration: 21m 27s)
2023-03-27 23:08:54 <stashbot> T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
2023-03-27 23:09:26 <jinxer-wm> (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 23:10:11 <wikibugs> ('CR) ''Dzahn: [C: ''+2] peopleweb: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 23:13:34 <jinxer-wm> (SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 23:15:26 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 it's also not true anymore that this removes IRC notifications. they sho" [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
2023-03-27 23:17:29 <wikibugs> 'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''colewhite)'
2023-03-27 23:18:15 <wikibugs> ('CR) ''Dzahn: [C: ''+2] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 23:21:49 <wikibugs> 'SRE, ''ops-eqiad, ''Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (''wiki_willy) a:''Jclark-ctr'
2023-03-27 23:22:06 <jinxer-wm> (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
2023-03-27 23:22:35 <wikibugs> 'SRE, ''SRE-swift-storage, ''ops-eqiad, ''Analytics-Radar, ''DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (''wiki_willy) a:''Jclark-ctr'
2023-03-27 23:24:23 <wikibugs> 'SRE, ''SRE-swift-storage, ''ops-codfw, ''DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (''wiki_willy) a:''Papaul'
2023-03-27 23:24:26 <jinxer-wm> (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2023-03-27 23:24:39 <wikibugs> 'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Htriedman) @MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as `analytics-platform-eng` on stat machines by using `sudo -u analytics-platform-eng <cmd>...` and am b...'
2023-03-27 23:25:24 <wikibugs> 'ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (''wiki_willy) a:''Jhancock.wm'
2023-03-27 23:29:25 <wikibugs> 'SRE, ''SRE-swift-storage, ''ops-codfw, ''DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (''wiki_willy) Hi guys - can we confirm the firmware is all up to date? Thanks, Willy'
2023-03-27 23:31:12 <zabe> !log deployed patch for T330968
2023-03-27 23:31:16 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 23:33:34 <jinxer-wm> (SystemdUnitFailed) firing: (58) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 23:38:34 <jinxer-wm> (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-03-27 23:42:22 <wikibugs> 'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Dzahn) Hi @Htriedman and @MoritzMuehlenhoff, the answer to this riddle is that while the special user "`analytics-platform-eng`" exists on all stat* machines, the admin group `analytics-platform-eng-admin...'
2023-03-27 23:44:30 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*people.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_inpu"; [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
2023-03-27 23:44:42 <wikibugs> 'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Ottomata) > @Htriedman I think this comes down to a new access request like "add analytics-platform-eng-admins on stat* hosts". Or ssh to an-airflow1004 and run your sudo cmd there :)'
2023-03-27 23:47:08 <mutante> !log people1003 - taking down apache to provoke monitoring alert (inactive instances) and confirm IRC alerting change works
2023-03-27 23:47:11 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-03-27 23:50:06 <mutante> jinxer-wm: jinx it
2023-03-27 23:50:55 <jinxer-wm> (ProbeDown) firing: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 23:51:02 <icinga-wm> PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
2023-03-27 23:51:23 <mutante> oh, well, that worked but the Icinga part isnt gone
2023-03-27 23:51:31 <mutante> it was supposed to replace that
2023-03-27 23:52:42 <icinga-wm> RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
2023-03-27 23:55:50 <jinxer-wm> (ProbeDown) resolved: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-03-27 23:59:07 <wikibugs> ('CR) ''Dzahn: [C: ''+2] "confirmed this reports on IRC on both channels and also created a ticket, as desired" [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'

This page is generated from SQL logs, you can also download static txt files from here