Page MenuHomePhabricator

tappof (Tiziano Fogli)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Jul 23 2024, 9:16 AM (44 w, 17 h)
Availability
Available
IRC Nick
tappof
LDAP User
Tiziano Fogli
MediaWiki User
Tiziano Fogli [ Global Accounts ]

Recent Activity

Yesterday

tappof updated the task description for T369122: On-call batphone escalation configuration holidays FY2024/25.
Tue, May 27, 9:33 AM · SRE Observability (FY2024/2025-Q4)
tappof added a comment to T369122: On-call batphone escalation configuration holidays FY2024/25.

sre business hours escalation reset to: escalate to americas/emea, if unacked for 5 minutes escalate to batphone

Tue, May 27, 9:31 AM · SRE Observability (FY2024/2025-Q4)
tappof added a comment to T369122: On-call batphone escalation configuration holidays FY2024/25.

overrides in case they are reset with batphone change

image.png (895×1 px, 145 KB)

image.png (593×1 px, 95 KB)
Tue, May 27, 9:28 AM · SRE Observability (FY2024/2025-Q4)

Mon, May 26

tappof added a comment to T395130: Migrate prometheus7001 to prometheus7002.

Thank you @andrea.denisse for your feedback.
I've just updated the task description, apologies for the previous version, it was drafted and pushed in a hurry.
Regarding the 15-minute delay, I took that number from here, and it seems to represent the maximum time for a Ganeti-to-Netbox sync.

Mon, May 26, 1:28 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
tappof updated the task description for T395130: Migrate prometheus7001 to prometheus7002.
Mon, May 26, 1:17 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
tappof added a comment to T387231: missing pdu infos for magru.

Ok, thank you @RobH. I’ll add some Pint directives to silence alerts for missing metrics in the DCs that don’t have Pro4X PDUs installed, and then I’ll go ahead and merge the patch today.

Mon, May 26, 8:07 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Fri, May 23

tappof added a comment to T395130: Migrate prometheus7001 to prometheus7002.
# add prometheus7002 to manifestes/site.pp
node /^prometheus[3456]00[1-9]\.(esams|ulsfo|eqsin|drmrs)\./ {
    role(prometheus::pop)
}
Fri, May 23, 3:02 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
tappof created T395130: Migrate prometheus7001 to prometheus7002.
Fri, May 23, 2:36 PM · SRE Observability (FY2024/2025-Q4), Observability-Metrics
tappof added a comment to T387231: missing pdu infos for magru.

Sure @RobH, nothing will catch fire because of this patch (or maybe it will, since we’re talking about electric current) :) Thanks for taking a look.

Fri, May 23, 9:27 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Thu, May 22

tappof added a comment to T387231: missing pdu infos for magru.

Hi @wiki_willy,
based on the breaker alerts currently configured for the Sentry4 model, I’ve set up the same alert for the Pro4X model directly in Prometheus/Alertmanager.
I’m looking for someone to review the patch on Gerrit — ideally with an eye on the electrical aspects of the setup, as I lack domain-specific knowledge in that area.
Thank you

Thu, May 22, 1:14 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Mon, May 19

tappof added a comment to T389937: ircecho (icinga-wm) was stuck on alert1002.

The issue happened again.

May 15 11:54:57 alert1002 ircecho[1319087]: Error writing: %sDropping this message: "%s"
May 15 11:54:57 alert1002 ircecho[1319087]: Exception in thread Thread-11:
May 15 11:54:57 alert1002 ircecho[1319087]: Traceback (most recent call last):
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/local/bin/ircecho", line 187, in process_IN_MODIFY
May 15 11:54:57 alert1002 ircecho[1319087]:     bot.connection.privmsg(chans, out)
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/irc/client.py", line 854, in privmsg
May 15 11:54:57 alert1002 ircecho[1319087]:     self.send_raw("PRIVMSG %s :%s" % (target, text))
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/irc/client.py", line 884, in send_raw
May 15 11:54:57 alert1002 ircecho[1319087]:     raise ServerNotConnectedError("Not connected.")
May 15 11:54:57 alert1002 ircecho[1319087]: irc.client.ServerNotConnectedError: Not connected.
May 15 11:54:57 alert1002 ircecho[1319087]: During handling of the above exception, another exception occurred:
May 15 11:54:57 alert1002 ircecho[1319087]: Traceback (most recent call last):
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
May 15 11:54:57 alert1002 ircecho[1319087]:     self.run()
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/local/bin/ircecho", line 50, in run
May 15 11:54:57 alert1002 ircecho[1319087]:     self.notifier.loop()
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/pyinotify.py", line 1376, in loop
May 15 11:54:57 alert1002 ircecho[1319087]:     self.process_events()
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/pyinotify.py", line 1275, in process_events
May 15 11:54:57 alert1002 ircecho[1319087]:     self._default_proc_fun(revent)
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/pyinotify.py", line 910, in __call__
May 15 11:54:57 alert1002 ircecho[1319087]:     return _ProcessEvent.__call__(self, event)
May 15 11:54:57 alert1002 ircecho[1319087]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/lib/python3/dist-packages/pyinotify.py", line 630, in __call__
May 15 11:54:57 alert1002 ircecho[1319087]:     return meth(event)
May 15 11:54:57 alert1002 ircecho[1319087]:            ^^^^^^^^^^^
May 15 11:54:57 alert1002 ircecho[1319087]:   File "/usr/local/bin/ircecho", line 190, in process_IN_MODIFY
May 15 11:54:57 alert1002 ircecho[1319087]:     print('Error writing: %s'
May 15 11:54:57 alert1002 ircecho[1319087]: TypeError: unsupported operand type(s) for %: 'NoneType' and 'tuple'
May 15 11:56:04 alert1002 ircecho[1319087]: Connected
Mon, May 19, 10:24 AM · Observability-Alerting

Mon, May 12

tappof closed T387866: Configure PDU monitoring resources through NetBox as Resolved.
Mon, May 12, 5:46 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof closed T387866: Configure PDU monitoring resources through NetBox, a subtask of T375166: Port PDU checks to Prometheus/Alertmanager, as Resolved.
Mon, May 12, 5:46 PM · Observability-Alerting
tappof added a comment to T393894: New version of Grafana makes it not possible to remove option in long list of values.

Above the list of all the wikis and their respective 'x', you'll see a list with checkboxes to select or deselect each entry. Did you try using that?

Mon, May 12, 1:47 PM · SRE Observability (FY2024/2025-Q4), Grafana

Tue, May 6

tappof added a comment to T387231: missing pdu infos for magru.

I merged the patch for T387866: Configure PDU monitoring resources through NetBox, the PDUs located in drmrs are now split by rack.

Tue, May 6, 12:36 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Mon, May 5

tappof updated the task description for T393366: Regression in RAID10 software RAID with 6.1.135.
Mon, May 5, 1:05 PM · observability, collaboration-services, Infrastructure-Foundations, SRE

Fri, May 2

tappof edited projects for T393097: Frequent filter timeouts in superset UI, added: Data-Platform-SRE; removed Data-Platform, SRE.
Fri, May 2, 2:47 PM · Data-Platform-SRE, superset.wikimedia.org
tappof moved T393140: Update SSH key for apine from In Discussion to Awaiting User Input on the SRE-Access-Requests board.
Fri, May 2, 8:31 AM · SRE, SRE-Access-Requests
tappof added a comment to T393140: Update SSH key for apine.

Hi @cmassaro,
It looks like the "Requested group membership" field is missing from your form.
Could you please let us know which group(s) you need to be added to?
Thanks!

Fri, May 2, 8:31 AM · SRE, SRE-Access-Requests
tappof changed the status of T393140: Update SSH key for apine from Open to In Progress.
Fri, May 2, 8:13 AM · SRE, SRE-Access-Requests
tappof moved T393140: Update SSH key for apine from Untriaged to In Discussion on the SRE-Access-Requests board.
Fri, May 2, 8:12 AM · SRE, SRE-Access-Requests
tappof updated the task description for T393140: Update SSH key for apine.
Fri, May 2, 7:51 AM · SRE, SRE-Access-Requests
tappof changed the status of T393066: Requesting access to <Superset> for <SCampos-WMF> from Open to In Progress.

@Ospingou, could you please sign off on the access request? Thank you!

Fri, May 2, 7:41 AM · Data-Platform-SRE (2025.05.02 - 2025.05.23), Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T393066: Requesting access to <Superset> for <SCampos-WMF>.
Fri, May 2, 7:39 AM · Data-Platform-SRE (2025.05.02 - 2025.05.23), Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T393066: Requesting access to <Superset> for <SCampos-WMF>.
Fri, May 2, 7:34 AM · Data-Platform-SRE (2025.05.02 - 2025.05.23), Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof added a project to T393066: Requesting access to <Superset> for <SCampos-WMF>: Data-Engineering.
Fri, May 2, 7:33 AM · Data-Platform-SRE (2025.05.02 - 2025.05.23), Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof closed T392893: Requesting access to analytics-privatedata-users for madalina as Resolved.
Fri, May 2, 7:27 AM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests

Wed, Apr 30

tappof added a comment to T387231: missing pdu infos for magru.

@wiki_willy, please take a look at T387866: Configure PDU monitoring resources through NetBox. This will change how the row label is set and will also fix the splitting issue with the drmrs PDUs.

Wed, Apr 30, 1:32 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof added a comment to T392886: Revisit default Istio histogram buckets.

The only concern I have with dropping metrics based on a given label is that the _sum and _count values will no longer reflect the actual buckets stored in the TSDB.
It might be pointless for us, but it's just something to keep in mind

Wed, Apr 30, 1:22 PM · SRE Observability (FY2024/2025-Q4), Patch-For-Review, Observability-Metrics
tappof updated subscribers of T387866: Configure PDU monitoring resources through NetBox.

Merging the patch https://gerrit.wikimedia.org/r/1135022 will result in the following change to the Prometheus configs:

Wed, Apr 30, 1:18 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof moved T392893: Requesting access to analytics-privatedata-users for madalina from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Wed, Apr 30, 8:56 AM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Wed, Apr 30, 8:56 AM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof added a comment to T392893: Requesting access to analytics-privatedata-users for madalina.

Hi @Madalina,
While I was checking, I noticed that you've already been added to the group.

Wed, Apr 30, 8:54 AM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests

Tue, Apr 29

tappof added a project to T392893: Requesting access to analytics-privatedata-users for madalina: Data-Engineering.
Tue, Apr 29, 3:54 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Tue, Apr 29, 3:53 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Tue, Apr 29, 3:50 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Tue, Apr 29, 3:37 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Tue, Apr 29, 3:37 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof updated the task description for T392893: Requesting access to analytics-privatedata-users for madalina.
Tue, Apr 29, 2:55 PM · Data-Engineering-Radar, Data-Engineering, SRE, SRE-Access-Requests
tappof added a comment to T387231: missing pdu infos for magru.

Actually, they're defined in Puppet like this:

Tue, Apr 29, 12:37 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Mon, Apr 28

tappof added a comment to T387231: missing pdu infos for magru.

@wiki_willy, I was able to split the PDUs in a 'per row' manner. If you're looking at a PoP, this is equivalent to 'per rack'. Only in the two main DCs is there a real distinction between rack and row.
Otherwise, I can split them 'per instance', meaning one graph per PDU.
Let me know what you prefer.

Mon, Apr 28, 3:14 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof added a comment to T387231: missing pdu infos for magru.

Hey @wiki_willy, thanks for the feedback! I'll take a look at your request and let you know.

Mon, Apr 28, 8:23 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Apr 18 2025

tappof closed T381665: module to define custom Prometheus alerts directly in Puppet, a subtask of T370153: Move kafka-mirror Prometheus-based alerts from Icinga to alerts.git, as Resolved.
Apr 18 2025, 2:40 PM · Observability-Alerting, Patch-For-Review
tappof closed T381665: module to define custom Prometheus alerts directly in Puppet as Resolved.
Apr 18 2025, 2:40 PM · SRE Observability (FY2024/2025-Q4), Observability-Alerting, Patch-For-Review
tappof added a comment to T387231: missing pdu infos for magru.

@wiki_willy I've just finished updating the dashboard to include the information scraped from Magru's PDU.
Please have a look and let me know if anything doesn't look right.

Apr 18 2025, 2:30 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Apr 16 2025

tappof added a comment to T387350: liftwing SLO performance issues.

Yes, although for the purposes of e.g. citoid latency SLO, we need just a few labels: source_workload_namespace destination_canonical_service le site along with the typical instance, job etc.

So my thinking is to for example copy istio_request_duration_milliseconds_bucket to istio_request_duration_milliseconds_bucket_stripped and apply a labelkeep filter to retain only what we absolutely need.

Losing dating shouldn't be an issue since it'd be a second metric. The original istio_request_duration_milliseconds_bucket would be left as-is

Apr 16 2025, 2:45 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Apr 9 2025

tappof added a comment to T390676: Alert in need of triage: ProbeDown (instance ripe-atlas-codfw:0).

You're free to close this task, since all the checks have been migrated to the corresponding HTTP version, as per T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance, and no alerts have been triggered.
Thank you.

Apr 9 2025, 1:04 PM · Infrastructure-Foundations, sre-alert-triage

Apr 8 2025

tappof added a comment to T390166: Make it possible for some graphs with Navigation timing metrics to have a timespan of 1 year .
07:53:41 phedenskog │ Great, I'll do the alerts later today and send them for review. Thank you! For T390166 let us chat about that first before you start
                    │ so I can show you what we use today with the old setup.
08:50:17            │ [tappof back: gone 13:46:16]
09:04:52     tappof │ hey good morning
09:06:02     tappof │ If you'd like to chat about what you'd like to speed up in your dashboard, I'm available
10:33:42 phedenskog │ Hello! Ok. So I think there's two uses cases for speeding up queries. One is if it's possible to make a dashboard like
                    │ https://grafana.wikimedia.org/d/rum/real-user-monitoring?orgId=1 faster? There's so many different tags there that you can change (and
                    │ many different graphs). Do you think that is doable?
10:37:08 phedenskog │ The other way would be to choose a couple of metrics and tags and make it possible to look at data over long time. We used dashboards
                    │ like https://grafana.wikimedia.org/d/000000143/navtiming-overall-history?orgId=1 where we look at a metric and can see trends over
                    │ year(s).
10:37:35 phedenskog │ However as it is now I think the most important is the alerts and that we can look back at data one month and that work now.
10:55:49     tappof │ So, the widgets in this one (https://grafana-rw.wikimedia.org/d/000000143/navtiming-overall-history?orgId=1) still need to be
                    │ converted to the corresponding Prometheus expressions. I think we can come back to it once they've been migrated. Are you ok with
                    │ that?
11:01:49 phedenskog │ Yes I don't think we should do anything now and I think we will drop viewing of data like that. I think since it's not decided
                    │ organisational who/which team is responsible for those dashboards, it makes sense to not spend time on making them/speeding up data.
                    │ If  later on it is decided, then that person/team can decide what to do.
11:03:24 phedenskog │ For some of the old Graphite dashboards: I plan not to remove them, but save them with a timespan so that we can use them to see back
                    │ in time, and then I'll change the autoamtic message you team added about the deprecation. Is that ok? So we keep that dashboard as
                    │ long as we still has Graphite as read only.
11:05:17     tappof │ Yeah, I think it's okay. I'll also talk to the team about adding a tag to every dashboard that needs to survive the sunsetting, due to
                    │ historical data consumption.
11:11:45     tappof │ About this one: https://grafana-rw.wikimedia.org/d/rum/real-user-monitoring?orgId=1&forceLogin&editPanel=25 — it might be challenging
                    │ since you're applying a lot of filters. I'll take a look. First of all, I think we'll need to remove the timespan setting, since it's
                    │ hardcoded into the recording rule.
11:11:54     tappof │ Instead, to smooth the curves a bit, we should use an avg_over_time. I just checked, and even when pre-filtering with recording rules
                    │ only on country, continent, browser, platform, and group, we still end up with around 11k series... still too many.
11:14:48     tappof │ I'll think about that and how to rearrange the dashboard. It might be necessary to split it into multiple dashboards.
11:14:56     tappof │ A question: are you used to querying this kind of data on a 'per-country' basis?
11:24:24     tappof │ Filtering only by geo_continent, mw_skin, and ua_family will reduce the cardinality to 1,800 series. While that's still a lot, it’s
                    │ much more manageable and acceptable.
11:43:46 phedenskog │ No, the per country is not used today I think. The idea was that potentially when we had focus countries for example India, we could
                    │ have alerts to that country. What we used it in the past, is that we oversampled some countries and then looked at that data. Today we
                    │ take 1 request out of 100 and beacon back that data, but we have the functionality to send for example 100% of data for users that
                    │ access from India. But we don't
11:43:46 phedenskog │ use that today (since there's no team that work with it).
Apr 8 2025, 11:34 AM · NavigationTiming

Apr 7 2025

tappof added a comment to T354908: evaluate and migrate in-use parsoid metrics to statslib.

The 'init', 'total', and 'timePerInputKB' metrics haven't received data since 27/10/2023.

Apr 7 2025, 2:39 PM · MW-1.43-notes (1.43.0-wmf.15; 2024-07-23), Content-Transform-Team, OKR-Work, Content-Transform-Team-WIP, MediaWiki-Platform-Team (Radar), Observability-Metrics
tappof updated the task description for T354908: evaluate and migrate in-use parsoid metrics to statslib.
Apr 7 2025, 2:32 PM · MW-1.43-notes (1.43.0-wmf.15; 2024-07-23), Content-Transform-Team, OKR-Work, Content-Transform-Team-WIP, MediaWiki-Platform-Team (Radar), Observability-Metrics
tappof updated the task description for T354908: evaluate and migrate in-use parsoid metrics to statslib.
Apr 7 2025, 2:31 PM · MW-1.43-notes (1.43.0-wmf.15; 2024-07-23), Content-Transform-Team, OKR-Work, Content-Transform-Team-WIP, MediaWiki-Platform-Team (Radar), Observability-Metrics
tappof added a comment to T390672: Create recording rules for Authentication graphs.

I’ve just finished updating the dashboard: all widgets that previously relied on high-cardinality metrics are now based on the corresponding recording rules.

Apr 7 2025, 1:48 PM · MediaWiki-Platform-Team, Observability-Metrics
tappof closed T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance, a subtask of T381561: deployment site for prometheus::blackbox::check::(icmp|http|tcp) is not driven by the $site parameter, as Resolved.
Apr 7 2025, 10:16 AM · SRE Observability (FY2024/2025-Q3), Patch-For-Review, Observability-Alerting
tappof closed T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance as Resolved.
Apr 7 2025, 10:16 AM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting

Mar 27 2025

tappof closed T359381: Migrate MediaWiki.loginnotify.* to statslib, a subtask of T350592: EPIC: migrate in use metrics and dashboards to statslib, as Resolved.
Mar 27 2025, 2:54 PM · SRE Observability (FY2024/2025-Q4), MW-1.43-notes (1.43.0-wmf.21; 2024-09-03), Epic, MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics
tappof closed T359381: Migrate MediaWiki.loginnotify.* to statslib as Resolved.

dashboard migrated
https://grafana.wikimedia.org/goto/thS_QhTNg?orgId=1

Mar 27 2025, 2:54 PM · MW-1.44-notes (1.44.0-wmf.22; 2025-03-25), Community-Tech, MediaWiki-extensions-LoginNotify, Observability-Metrics

Mar 25 2025

tappof claimed T359359: Migrate AbuseFilter Extension to statslib.
Mar 25 2025, 1:46 PM · Trust and Safety Product Team, AbuseFilter, Observability-Metrics
tappof placed T359484: Migrate MediaWiki.ExternalGuidance to statslib up for grabs.
Mar 25 2025, 1:41 PM · LPL Essential (LPL Essential 2024 Jul-Oct), ExternalGuidance, Observability-Metrics
tappof claimed T359484: Migrate MediaWiki.ExternalGuidance to statslib.
Mar 25 2025, 1:38 PM · LPL Essential (LPL Essential 2024 Jul-Oct), ExternalGuidance, Observability-Metrics
tappof created T389934: Removal of PDUs from Netbox-Hiera network devices.
Mar 25 2025, 10:42 AM · Infrastructure-Foundations

Mar 19 2025

tappof added a comment to T387350: liftwing SLO performance issues.

Opened an issue on the pyrra-dev/pyrra github repository:

Mar 19 2025, 10:06 AM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Mar 18 2025

tappof claimed T359381: Migrate MediaWiki.loginnotify.* to statslib.
Mar 18 2025, 7:36 AM · MW-1.44-notes (1.44.0-wmf.22; 2025-03-25), Community-Tech, MediaWiki-extensions-LoginNotify, Observability-Metrics

Mar 17 2025

tappof closed T388680: Icinga check_curl plugin is broken on bullseye and bookworm hosts as Resolved.

It looks like the patch has fixed the problem.

Mar 17 2025, 1:57 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Traffic, SRE

Mar 13 2025

tappof added a comment to T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance.

I was implementing the checks as decided. Before proceeding, I manually ran a telnet from the proxies and found that they are unable to reach the anchors.

Mar 13 2025, 3:53 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting
tappof created T388801: Missing atlas-esams device in netbox.
Mar 13 2025, 3:10 PM · Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting
tappof added a comment to T387350: liftwing SLO performance issues.

Wouldn't stripping down labels without aggregating data lead to losing data?

Mar 13 2025, 10:39 AM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Mar 12 2025

tappof placed T359385: Migrate MediaWiki.arclamp to statslib up for grabs.
Mar 12 2025, 9:31 AM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Mar 11 2025

tappof added a comment to T359385: Migrate MediaWiki.arclamp to statslib.

https://grafana.wikimedia.org/d/yVf-D1RWk/arc-lamp?orgId=1

Mar 11 2025, 3:21 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof claimed T359385: Migrate MediaWiki.arclamp to statslib.
Mar 11 2025, 3:14 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof reopened T359388: Migrate MediaWiki.Parsoid.wt2html.* to statslib, a subtask of T350592: EPIC: migrate in use metrics and dashboards to statslib, as Open.
Mar 11 2025, 1:44 PM · SRE Observability (FY2024/2025-Q4), MW-1.43-notes (1.43.0-wmf.21; 2024-09-03), Epic, MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics
tappof reopened T359388: Migrate MediaWiki.Parsoid.wt2html.* to statslib as "Open".

I'm reopening the task just to wait for feedback on the dashboard review.

Mar 11 2025, 1:44 PM · OKR-Work, Content-Transform-Team-WIP, Content-Transform-Team, Observability-Metrics
tappof added a comment to T359388: Migrate MediaWiki.Parsoid.wt2html.* to statslib.

This is a proposal for migrating the wt2html dashboard: https://grafana.wikimedia.org/goto/aPt4onhNR?orgId=1 .
. Some widgets have (???) in their titles and need review.

Mar 11 2025, 1:42 PM · OKR-Work, Content-Transform-Team-WIP, Content-Transform-Team, Observability-Metrics
tappof closed T388504: Prometheus 'ext' in codfw linear growth for space used since Jan 10th as Resolved.

During the activities for T371087, the original retention time of 730h was adjusted to the standard value of 4032h.
In agreement with @fgiunchedi, we restored the original value

Mar 11 2025, 11:15 AM · SRE Observability
tappof added a comment to T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance.

Before addressing Filippo's suggestions, I checked the situation in Esams, where prometheus7001 is able to perform ICMP tests against the anchor in the same data center.

Mar 11 2025, 8:24 AM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting

Mar 10 2025

tappof created T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance.
Mar 10 2025, 2:50 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting
tappof closed T381561: deployment site for prometheus::blackbox::check::(icmp|http|tcp) is not driven by the $site parameter, a subtask of T370506: Replace check_ripe_atlas with prometheus alert, as Resolved.
Mar 10 2025, 1:43 PM · Patch-For-Review, Observability-Alerting
tappof closed T381561: deployment site for prometheus::blackbox::check::(icmp|http|tcp) is not driven by the $site parameter as Resolved.
Mar 10 2025, 1:43 PM · SRE Observability (FY2024/2025-Q3), Patch-For-Review, Observability-Alerting
tappof closed T381580: cloudgw ICMP checks are affected by changes to prometheus::blackbox::check::icmp, a subtask of T381561: deployment site for prometheus::blackbox::check::(icmp|http|tcp) is not driven by the $site parameter, as Resolved.
Mar 10 2025, 1:43 PM · SRE Observability (FY2024/2025-Q3), Patch-For-Review, Observability-Alerting
tappof closed T381580: cloudgw ICMP checks are affected by changes to prometheus::blackbox::check::icmp as Resolved.
Mar 10 2025, 1:43 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting
tappof added a comment to T388379: ProbeDown .

irc logs:

11:15:30    tappof │ arturo: dcaro Just a heads-up in case of any unwanted alerts: I'm merging this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100819
11:15:59    arturo │ tappof: thanks 🚢🇮🇹
11:15:59     dcaro │ tappof: ack!
11:52:06    arturo │ tappof: indeed we just had the alert firing
11:52:21    arturo │ https://usercontent.irccloud-cdn.com/file/SlS4y6la/image.png
11:52:34    arturo │ and T388379
11:52:34 +stashbot │ T388379: ProbeDown  - https://phabricator.wikimedia.org/T388379
11:52:46    tappof │ arturo: yes, I've seen
11:54:52    arturo │ tappof: are you able to investigate? I'm about to jump into a meeting
11:55:17    tappof │ arturo: yes, I'll check soon
11:55:22    arturo │ thanks
12:02:21       <-- │ dwalden ([email protected]) has quit (Quit: ZNC 1.8.2+deb2+b1 - https://znc.in)
12:02:37       --> │ dwalden [dwalden] (ZNC - https://znc.in) ([email protected]) has joined #wikimedia-cloud
12:07:25       --> │ PhantomTech [PhantomTech] (en:User:PhantomTech) (~PhantomTe@wikipedia/PhantomTech) has joined #wikimedia-cloud
12:34:05        -- │ tgr|away is now known as tgr_
13:06:28    tappof │ arturo: We didn't have the IPv6 check before... https://w.wiki/DNG7 AFAICS, Prometheus in eqiad can reach cloudgw2002-dev on the IPv6 VIP, but the reply gets lost
                   │ somewhere https://snipboard.io/XJ0gn9.jpg
13:06:57    arturo │ topranks: oh, ok!
13:07:40    arturo │ we may want to create a ticket to investigate why that happens, and remove the IPv6 check meanwhile
13:10:05    tappof │ arturo: I think we can use T388379 as the task for this one. I'll link it to the task related to monitoring to keep track of the relationship.
13:10:06 +stashbot │ T388379: ProbeDown  - https://phabricator.wikimedia.org/T388379
13:10:29    tappof │ is it ok for you arturo ?
13:11:26    arturo │ ok!
Mar 10 2025, 12:12 PM · Cloud-VPS, cloud-services-team
tappof added a subtask for T381580: cloudgw ICMP checks are affected by changes to prometheus::blackbox::check::icmp: T388379: ProbeDown .
Mar 10 2025, 12:11 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting
tappof added a parent task for T388379: ProbeDown : T381580: cloudgw ICMP checks are affected by changes to prometheus::blackbox::check::icmp.
Mar 10 2025, 12:11 PM · Cloud-VPS, cloud-services-team

Mar 7 2025

tappof updated the task description for T375166: Port PDU checks to Prometheus/Alertmanager.
Mar 7 2025, 8:48 AM · Observability-Alerting
tappof added a comment to T387866: Configure PDU monitoring resources through NetBox.

All the patches needed to achieve this goal can be found in task T387231: missing pdu infos for magru, which is the task that initiated this milestone

Mar 7 2025, 8:34 AM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Mar 6 2025

tappof updated subscribers of T388138: PDUs in Active status missing IP address information in NetBox.
Mar 6 2025, 3:53 PM · DC-Ops, Observability-Metrics
tappof moved T388138: PDUs in Active status missing IP address information in NetBox from Inbox to Radar on the Observability-Metrics board.
Mar 6 2025, 3:34 PM · DC-Ops, Observability-Metrics
tappof created T388138: PDUs in Active status missing IP address information in NetBox.
Mar 6 2025, 3:32 PM · DC-Ops, Observability-Metrics

Mar 5 2025

tappof added a comment to T387866: Configure PDU monitoring resources through NetBox.

Information about the model type was added to Netbox-Hiera through this PS: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124142

Mar 5 2025, 8:58 AM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Mar 4 2025

tappof added a comment to T387866: Configure PDU monitoring resources through NetBox.

A portion of the job has been completed to finalize task T387231: missing pdu infos for magru.

Mar 4 2025, 2:48 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof added a subtask for T375166: Port PDU checks to Prometheus/Alertmanager: T387866: Configure PDU monitoring resources through NetBox.
Mar 4 2025, 2:45 PM · Observability-Alerting
tappof added a parent task for T387866: Configure PDU monitoring resources through NetBox: T375166: Port PDU checks to Prometheus/Alertmanager.
Mar 4 2025, 2:45 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics
tappof added a comment to T387231: missing pdu infos for magru.

@Papaul No, I'm working on setting these objects only in Prometheus.

Mar 4 2025, 2:44 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof created T387866: Configure PDU monitoring resources through NetBox.
Mar 4 2025, 2:36 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics

Feb 27 2025

tappof added a project to T387231: missing pdu infos for magru: SRE Observability (FY2024/2025-Q3).
Feb 27 2025, 3:04 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof moved T387231: missing pdu infos for magru from Radar to Inbox on the Observability-Metrics board.
Feb 27 2025, 3:03 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof added a comment to T387231: missing pdu infos for magru.

I can confirm that it's a different model and responds to different MIBs (Raritan-PDU2-MIB). I'll proceed with setting up the scraping and let you know once it's done.

Feb 27 2025, 3:02 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Feb 26 2025

tappof added a comment to T387231: missing pdu infos for magru.

Thank you, @wiki_willy, for pointing me in the right direction within NetBox. It seems the PuppetQL query might need to be updated (different model and/or type?). I'll take a look tomorrow.

Feb 26 2025, 5:16 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof added projects to T387231: missing pdu infos for magru: DC-Ops, ops-magru.

@wiki_willy The data is missing because Prometheus is not configured to retrieve metrics from magru's PDUs, as they are not present in NetBox. As soon as they are added to NetBox, they will be automatically recognized by Prometheus. Once done, let us know if you need further help setting up the dashboard.

Feb 26 2025, 4:26 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof removed a project from T387231: missing pdu infos for magru: SRE Observability (FY2024/2025-Q3).
Feb 26 2025, 4:14 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
tappof moved T387231: missing pdu infos for magru from Inbox to Radar on the Observability-Metrics board.
Feb 26 2025, 4:13 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics