This feature marked a significant transition for Palo Alto Network’s offering as an end-to-end tool for diagnosing issues and pinpointing causes

Problem

Remote work is set to expand substantially following the end of the pandemic compared to pre-pandemic levels. And there’s a good reason why. During the pandemic, organizations have largely experienced increased productivity from a workforce that has been primarily working from home, and employees have primarily found that they like it.

Persona

As a network admin, I want to be able to measure and report on any major changes in UCaaS use or end-user experience

I also want to be able to identify any emerging major issues affecting users across my organization. This helps me get ahead of serious outages within and outside my control and communicate clearly
Triage service tickets filed with inflated severity.

Scenario

The University of Miami’s (UM) campus in Coral Gables, Florida, suddenly went quiet in the spring of 2020 as students switched to remote learning at the onset of the COVID-19 pandemic. Over the next few months, some students returned to campus while others remained online, requiring the university to provide flexible learning options in response to the pandemic’s evolving situation.

We had to go from being a mostly residential, on-campus university to different ways of teaching including completely remote
- Ernie Fernandez, Chief Information Officer at UM.

Faculty members began exploring unique applications of Zoom for interactive instruction. Ali Habashi, assistant professor of cinematic arts at UM’s School of Communication, encouraged remote students to use Zoom as a collaborative platform for filmmaking.

Using meeting quality scores and network alerts

Admins can enable meeting quality scores or network alerts on the Dashboard for meetings and webinars.

The quality score of the meeting is based on the Mean Opinion Score which ranges from 1 to 5 for a meeting’s quality between bad to good. Network alerts and quality scores for audio, video, and screen sharing will be displayed on the Meetings and Webinar dashboard. Zoom will use default values for network alerts.

Alternatively, admins can set custom thresholds that will trigger network alerts related to audio, video, screen sharing, and CPU usage. These alerts will be shown on Dashboard.

Feedback

Managing feedback was even more challenging and felt like a swinging pendulum of viewpoints

My role

I led the design of this feature for Autonomous Digital Experience Management across the web and PAN-OS early this year.

Up until May 2022, I led efforts to evolve the UI and address customer pain-points related to the discovery and detect experience.

Customer Insights and Ideation

The pandemic marked a surge in Zoom usage that led to an increase in volume of calls. With it came call-related issues like quality - Jitter, ….

Vision

Created a framework and prototypes to share the vision and design principles and content strategy. This helped evangelize ideas, gain alignment across the platform and roadmap for having unique application dashboards was laid.

Planning and scope definition

I defined the feature with my product manager partners. I evangelized customer goals and balanced business goals. I prioritized and negotiated features for launch and beyond.

Zoom User Detail Dashboard

Zoom Application Summary Dashboard

Design execution and validation

Designed for Zoom - unaware of requirements, technical details of the app - what data is available in telemetry.

Executed journeys, wireframes, prototypes and design specs

Leadership

I designed up and presented works to gain buy-ins from executives, senior stakeholders, and many other PANW teams throughout the project lifecycle.

The challenge

Create a deeper relationship with customers using and managing Zoom

Today, there is no single offering that allows network admins to monitor the performance of the full Zoom service delivery chain. ADEM cannot offer customers information about real applications and network performance for Zoom, Teams, and other UCaaS applications because we’re sending low-volume HTTPS probes to servers that are not hosting meetings - Zoom.us and Teams.Microsoft.com. Network admins and internal UCaaS service teams are in a reactive stance. They are responding to UCaaS performance issues only after end-users have submitted tickets, and to resolve those tickets they have to correlate data from several tools and dashboards. Network admins also have to aggregate data from several dashboards in order to report on how successful their teams are at providing end-users the best possible UCaaS experience.

Customer Profile:

Networking Teams: In many customer organizations, the network team is responsible for UCaaS application performance. After the move to remote work, many of these smaller orgs are still scrambling to measure end-user UCaaS experience and are unsure what the biggest sources of performance degradation are for their organization.

UCaaS Teams: In larger global companies like Ernst + Young, there is a dedicated team responsible for managing UCaaS and often VoIP applications. These teams are driving the search for a DEM product. They have access to a wide range of dashboards throughout the stack, but are still triaging and resolving UCaaS tickets by manually correlating several data sources to find a root cause. They also have very limited data about local ISP performance, and they have to manually compile aggregate stats to report on the success of their team.

Use Case:
As a network admin or UCaaS team lead, my goals are to:

Resolve or communicate likely UCaaS performance issues before they impact end-user experience
I want to alert users when important upcoming meetings may underperform and suggest remediation I want to get ahead of regional ISP and UCaaS outages

Quickly understand the root cause of UCaaS performance tickets, and resolve them when possible
If the end-user can resolve the issue but I can not, I want to explain the problem and the necessary action to the end-user.
If neither the end-user nor I can resolve the issue, I want to explain the scope and impact of the issue with specificity to the end-user and my boss.
If an ISP or Application is underperforming, I want data that I can use to support my request for better performance

Report clearly on my team's success within their scope
I don't control local ISP performance, UCaaS application performance, what equipment the end user is using, or how far the end-user sits from their WiFi router.

The Approach

<List design principles here>

Good

Fast

Cheap

Development

Working backwards from a fixed launch date, meant that design was subsumed into an engineering‐driven process. Sign‐off milestones were driven by engineering estimates and time to create the right design was the time left over. The combination of a fixed launch date and aggressive scope created an intense environment with many coordination and time challenges.

Customer Insights

Numerous calls with the Zoom team drove our planning phase

Key insights from the research:

Primary issues are related to audio. (Icon)

Network admins want to know the exact point of issue -device/ wifi/internet/lan

Customers only need to know about the poorest performing calls - decisions about KPIs

Data collection from historical data - trend (future scope)

Monitoring for a wide time range

How we got there?

Managing feedback was even more challenging because it felt like a swinging pendulum of viewpoints. The team spent a disproportionate amount of time debating design decisions - when there wasn’t data that could easily be gathered to. Help drive a decision. For example - The disable functionality was a highly destructive action as the user would lose access to all historical data. So we added a show/hide functionality instead.

The impact was agony, paralysis and a growing skepticism for instincts in the design process.

In order to avoid this, I started creating documentations to help alleviate the data crutch and better articulate and distribute design rationale. Doing this was time consuming but saved a lot of back and forth as the project progressed.

Design principles and feedback from the Zoom team helped to create visibility into my design process and galvanize the team to share in the vision for unique dashboards for each monitored app.

Do all the heavy lifting for the customer

Make sensible decisions and provide intuitive fixes/resolutions for improving performance

Account for edge cases - Meetings rejoined……

Show actionable content at all times

Avoid dead ends

A technique I used to brainstorm content ideas was inspired by a technique from —— book = ‘——‘. He teaches us that great and original ideas can emerge if we reverse the polarity in an assumption.

I imagined the worst possible perception of the Zoom dashboard to be analogous to the … - useless metrics that do nothing except show a bunch of numbers

Filtered out absolutely useful metrics - journey map of a network admin to troubleshoot issues with Zoom

From this exercise, it became clear that we could help derive more value for Zoom if we crafted a curated experience that can be applied to other applications as well.

To get buy-in for this direction, I created a set of dashboards for a few apps uniquely identifying their issues. Although many of these concepts were not feasible at the time of launch, they were still important to help get the team a vision about the future of diagnostic capabilities of ADEM.

Survey

Sniffing out important details to understand the root causes of each type of issue.

There is a high degree of overlap between audio and video related issues.
There is a set way admins follow to find out root causes that go beyond the user interface

Based on these insights, I designed the cascading UI where action taken on the top widgets curate the subsequent widgets to show the user a filtered view.

How does this fit in the overall integrated view of PANW suite of products?

How can we utilize existing design patterns to design this curated experience?

SWOT analysis

Features:

UCaaS user detail view with meeting performance data for a particular ADEM user in a given *day*. Root-Cause Analysis for UCaaS Performance Issues
Organization-Wide UCaaS Performance Dashboard
User alerts and remediation suggestions for upcoming meetings likely to underperform

Data:

Initially, we want to support UCaaS integration for Zoom There are critical gaps in the data available from Teams

Zoom is rolling out a push-based streaming service that will give us per-user qos telemetry every minute

Data Volume:

PAN has about 75k meetings a week with 12k employees. About 1.5 meetings per employee per workday. So for a customer with 100k employees, we should expect about 150k meetings per workday and be prepared to handle 300k meetings per day.
In order to support the existing ADEM time ranges, we need to be able to store and retrieve up to 30 days of data.

Root Cause Analysis

The primary value proposition of ADEM's integration with Zoom is ADEM's ability to leverage existing synthetic test data to interpret the root cause of experience issues users have on Zoom.
We want to identify the most likely root-cause, and, in some cases, there may be more than one root-cause. A root cause represents a problem that we believe requires attention and action.

A couple cases to consider:

There's high latency at the LAN exit - 500ms. There's 550ms of latency on the isp. The isp only introduced 50ms latency, so the LAN exit is the root cause.
There's high latency at the LAN exit - 500ms. There's 1000ms of latency on the isp. The isp introduced 1000 ms latency, so both the LAN and ISP are root causes

There's moderate latency at the LAN exit as well as the ISP - introduction of 200ms, 100ms respectively. Together, this amounts to latency that impacts Zoom. So we flag the LAN exit because it's responsible for the greatest part of the issue.
There's moderate and identical latency at the LAN exit as well as the ISP - introduction of 150ms, 150ms respectively. This impacts Zoom. Because the LAN exit has high latency relative to normal latency at the LAN exit, AND Zoom was impacted, we flag the LAN exit

Enriched Self-Serve Notifications

If an end-user has self-serve enabled AND we have Zoom QSS data for that user, we should enrich any self-serve notifications we raise on their device with Zoom information.
The trigger for these notifications should remain the existing self-serve trigger

Isolate problems instantly - increase productivity by quickly isolating problems

Reduce ticket escalations - use easy to use metrics that reduce escalations to Tier 3 support teams

Proactively notify users on performance degradation to reduce service desk ticket volume

Using meeting quality scores and network alerts

Last Updated: December 13, 2021

Account owners and admins can enable meeting quality scores or network alerts on the Dashboard for meetings and webinars.

The quality score of the meeting is based on the Mean Opinion Score which ranges from 1 to 5 for a meeting’s quality between bad to good. Network alerts and quality score for audio, video and screen share will be displayed on the Meetings and Webinar dashboard. Zoom will use default values for network alerts.

Alternatively, admins can set custom thresholds that will trigger network alerts related to audio, video, screen sharing, and CPU usage. These alerts will be shown on Dashboard.

This article covers:

How to enable meeting quality scores and network alerts

Sign in to the Zoom web portal.
In the navigation panel, click Account Management then Account Settings.
Click the Meeting tab.
In the Admin Options section, verify that Meeting quality scores and network alerts on Dashboard is enabled.
If the setting is disabled, click the toggle to enable it. If a verification dialog displays, click Enable to verify the change.
Click one of these options to enable it:
- Show meeting quality score and network alerts on Dashboard: Display the standard MOS metric for measuring meeting quality. Alerts will be on the MOS. See the overview section for more information.
- Set custom thresholds for network alerts: Set custom thresholds for alerts instead of using the standard MOS metric. See the overview section for more information. Make sure to set custom thresholds after enabling this option.

How to set custom thresholds for network alerts

If you enabled the option to Set custom thresholds for network alerts, follow this section to specify your thresholds.

Tip: To help determine your thresholds, see our recommendations for meeting and phone statistics.

Sign in to the Zoom web portal.
In the navigation panel, click Dashboard.
At the top of the Dashboard screen, click the Meetings or Webinars tab.
In the top-right corner, click Quality Settings.
Click either the Audio, Video, Screen Sharing, or CPU Usage tab.
Click Edit.
Set the values to the desired threshold.
Click Apply.

How to view alerts on the Dashboard

Sign in to the Zoom web portal.
In the navigation panel, click Dashboard.
At the top of the Dashboard screen, click the Meetings or Webinars tab.
(Optional) Click Past Meetings to access historical meeting data.
Take note of the follow columns:
- Health: Displays any Warning level or Critical level issues in the meeting based on the MOS or custom threshold you've set.
- Issue: Shows any current connection/client health warnings, including unstable audio, or video, or screen sharing quality, high CPU usage, or disconnect and reconnect issues. For example High CPU Usage or Unstable network for video.
- If you enabled MOS instead of setting custom thresholds, you will also see the Video Quality, Audio Quality, and Screen Share Quality columns. These display a grade (Good, Fair, Poor, or Bad) based on the MOS. Click a participant's display name to view specific MOS details.

Once Zoom is enabled, below the application experience chart, include a summary statistics section with:

total minutes on Zoom:
Minutes with poor performance total/percentage of poor-performing minutes associated with different root causes

Total meetings in Zoom
Total meetings (percentage as well) with poor performance
Breakdown by

Audio issues
Video issues Screensharing issues

Hurdles to Overcome

Testing against Zoom's dashboard has shown that Zoom does not record every momentary issue. Wherever possible, we should aim to round up the number of minutes when a user was experiencing problems.

Feedback from stress testing

Meeting issues will be aggregated - if an issue lasted for 5 minutes, we should have a single meeting issue with a 5 minute time range.

Impact

Team level - Build a completely new set of components for this feature and added them to the component library

Org level -

Product level - Extend to other apps like Microsoft Teams, Google Meet, Cisco WebEx

Business level -

Copy - Diagnosing Zoom Calls to Improve Call Quality

This feature marked a significant transition for Palo Alto Network’s offering as an end-to-end tool for diagnosing issues and pinpointing causes

Problem

Persona

Scenario

We had to go from being a mostly residential, on-campus university to different ways of teaching including completely remote

Using meeting quality scores and network alerts

Managing feedback was even more challenging and felt like a swinging pendulum of viewpoints

Customer Insights and Ideation

Vision

Planning and scope definition

Design execution and validation

Leadership

How to enable meeting quality scores and network alerts

How to set custom thresholds for network alerts

How to view alerts on the Dashboard

Hurdles to Overcome

Feedback from stress testing

Impact