在 Amazon EKS 集群中针对 AWS Inferentia 节点的开源可观察性机器学习博客

2026-01-27 13:25:41

可观测性工具在 AWS Inferentia 节点中使用 Amazon EKS 集群

由 Riccardo Freschi 撰写于2024年4月17日，发布于 Amazon CloudWatch， Amazon Elastic Kubernetes Service， Amazon Managed Grafana， Amazon Managed Service for Prometheus， AWS Inferentia ，AWS Neuron ， AWS Trainium ， AWS XRay永久链接评论区

重点摘要

最新的机器学习ML发展导致大型模型的出现，这些模型往往需要数百亿参数。在这一分布式环境中，观察实例和 ML 芯片的性能关键在于模型性能的微调和成本优化。本文介绍了如何使用开源工具监控搭载 AWS Inferentia 的 Amazon EKS 集群的性能，以实现更好的资源利用及管理。

近年来，机器学习ML的发展催生了越来越大的模型，这些模型通常需要数百亿个参数。尽管这些模型的计算能力更强，训练和推理这些模型却需要大量的计算资源。尽管有先进的分布式训练库，但训练和推理工作通常需要数百个加速器如 GPU 或专用的 ML 芯片，例如 AWS Trainium 和 AWS Inferentia，因此需要数十到数百个实例。

在此类分布式环境中，观察实例和 ML 芯片的性能变得至关重要，有助于微调模型性能和优化成本。通过指标，团队能够理解工作负载的行为，从而优化资源的分配和利用，诊断异常，提升整体基础设施的效率。对于数据科学家而言，ML 芯片的使用率和饱和度也与容量规划密切相关。

本文将引导您了解针对 AWS Inferentia 的开源可观测性模式。这一模式展示了如何在以 Amazon Elastic Kubernetes ServiceAmazon EKS搭建的集群中监控使用 Amazon Elastic Compute Cloud Amazon EC2实例类型 Inf1 和 Inf2 的数据平面节点的 ML 芯片性能。

该模式是 AWS CDK 可观测性加速器的一部分，提供了一系列模块，帮助您为 Amazon EKS 集群设置可观测性。AWS CDK 可观测性加速器围绕模式组织，模式是用于部署多个资源的可重用单元。开源可观测性模式使用 Amazon Managed Grafana 仪表板、AWS Distro for OpenTelemetry 收集器来收集指标，并使用 Amazon Managed Service for Prometheus 存储这些指标。

解决方案概述

以下图示展示了解决方案架构。

该解决方案部署了一个包含 Inf1 实例的 Amazon EKS 集群节点组。

节点组的 AMI 类型为 AL2x8664GPU，使用 Amazon EKS 优化的加速 Amazon Linux AMI。除了标准的 Amazon EKS 优化 AMI 配置外，加速 AMI 还包括 NeuronX 运行时。

为了从 Kubernetes 访问 ML 芯片，该模式部署了 AWS Neuron 设备插件。

指标通过 neuronmonitor DaemonSet 暴露至 Amazon Managed Service for Prometheus，该 DaemonSet 部署了一个最小容器，安装了 Neuron 工具。具体来说，neuronmonitor DaemonSet 中运行的 [neuronmonitor](https//awsdocsneuronreadthedocshostedcom/en/latest/tools/neuronsystools/neuronmonitoruserguidehtml#neuronmonitorug) 命令通过管道传输到 neuronmonitorprometheuspy 伴随脚本这两个命令均为容器的一部分：

bashneuronmonitor neuronmonitorprometheuspy port ltportgt

该命令使用了以下组件：

neuronmonitor 从运行的 Neuron 应用收集指标和统计信息，并将收集的数据以 JSON 格式输出到标准输出。neuronmonitorprometheuspy 将 JSON 格式的遥测数据映射并暴露为 Prometheus 兼容格式。

所收集的数据在 AWS Managed Grafana 中通过相应的仪表板进行可视化。

其余的设置以 Amazon Managed Service for Prometheus 和 Amazon Managed Grafana 收集和可视化指标的方式与其他基于开源的模式类似，这些模式也包含在 AWS Observability Accelerator for CDK GitHub 仓库中。

前提条件

完成本文中的步骤所需条件如下：

前提条件描述AWS CLI安装AWS CDK安装，使用版本 2860 或更高版本Homebrew在 macOS 或 Linux 中安装所需的软件包Amazon Managed Grafana 工作区如果您没有现有工作区，可以通过 Amazon Managed Grafana 控制台创建Node版本 2000 或更高版本NPM版本 1000 或更高版本Kubectl安装Git下载Make安装

设置环境

按照以下步骤设置您的环境：

打开终端窗口并运行以下命令：

bashexport AWSREGION=ltYOUR AWS REGIONgtexport ACCOUNTID=(aws sts getcalleridentity query Account output text)

免费的加速器永久免费检索任何现有 Amazon Managed Grafana 工作区的工作区 ID：

bashaws grafana listworkspaces示例输出如下：json{ workspaces [ { authentication { providers [ AWSSSO ] } created 20230607T1223566250000400 description acceleratorworkspace endpoint gXYZgrafanaworkspaceuseast2amazonawscom grafanaVersion 94 id gXYZ modified 20230607T1230098920000400 name acceleratorworkspace notificationDestinations [ SNS ] status ACTIVE tags {} } ]}

将 id 和 endpoint 的值分配给以下环境变量：

bashexport COAAMGWORKSPACEID=ltltYOURWORKSPACEID 类似上述 gXYZ，去掉引号gtgtexport COAAMGENDPOINTURL=ltlthttps//YOURWORKSPACEURL 包含协议如 https//，去掉引号，类似上述 https//gXYZgrafanaworkspaceuseast2amazonawscomgtgtCOAAMGENDPOINTURL 需要包含 https//。

从 Amazon Managed Grafana 工作区创建 Grafana API 密钥：

bashexport AMGAPIKEY=(aws grafana createworkspaceapikey keyname grafanaoperatorkey keyrole ADMIN secondstolive 432000 workspaceid COAAMGWORKSPACEID query key output text)

在 AWS Systems Manager 中设置一个密钥：

bashaws ssm putparameter name /cdkaccelerator/grafanaapikey type SecureString value AMGAPIKEY region AWSREGION这个密钥将被 External Secrets 附加组件访问，并在 EKS 集群中作为本地 Kubernetes 密钥提供。

启动 AWS CDK 环境

进行 AWS CDK 部署的第一步是启动环境。您可以在 AWS CDK CLI 中使用 cdk bootstrap 命令来准备环境AWS 账户与 AWS 区域的组合，以便 AWS CDK 在该环境下进行部署。AWS CDK 启动对每个账户和区域组合都需要进行，因此如果您已在某个区域完成过启动，您无需重复此过程。

bashcdk bootstrap aws//ACCOUNTID/AWSREGION

在 Amazon EKS 集群中针对 AWS Inferentia 节点的开源可观察性机器学习博客

部署解决方案

完成以下步骤以部署解决方案：

克隆 cdkawsobservabilityaccelerator 仓库并安装依赖包。这个仓库包含用 TypeScript 编写的 AWS CDK v2 代码。

bashgit clone https//githubcom/awsobservability/cdkawsobservabilityacceleratorgitcd cdkawsobservabilityaccelerator

Grafana 仪表盘 JSON 文件的实际设置预计将由 AWS CDK 上下文指定。您需要在当前目录中的 cdkjson 文件中更新 context。仪表盘位置由 fluxRepositoryvaluesGRAFANANEURONDASHURL 参数指定，而 neuronNodeGroup 用于设置要用于节点的实例类型、数量和 Amazon Elastic Block StoreAmazon EBS大小。

在 cdkjson 中输入以下代码段，替换 context：

jsoncontext { fluxRepository { name grafanadashboards namespace grafanaoperator repository { repoUrl https//githubcom/awsobservability/awsobservabilityaccelerator name grafanadashboards targetRevision main path /artifacts/grafanaoperatormanifests/eks/infrastructure } values { GRAFANACLUSTERDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/clusterjson GRAFANAKUBELETDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/kubeletjson GRAFANANSWRKLDSDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/namespaceworkloadsjson GRAFANANODEEXPDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/nodeexporternodesjson GRAFANANODESDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/nodesjson GRAFANAWORKLOADSDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/infrastructure/workloadsjson GRAFANANEURONDASHURL https//rawgithubusercontentcom/awsobservability/awsobservabilityaccelerator/main/artifacts/grafanadashboards/eks/neuron/neuronmonitorjson } kustomizations [ { kustomizationPath /artifacts/grafanaoperatormanifests/eks/infrastructure } { kustomizationPath /artifacts/grafanaoperatormanifests/eks/neuron } ] } neuronNodeGroup { instanceClass inf1 instanceSize 2xlarge desiredSize 1 minSize 1 maxSize 3 ebsSize 512 } }

您可以将 Inf1 实例类型替换为 Inf2，并根据需要更改大小。要检查您所选区域的可用性，请运行以下命令根据需要调整 Values：

bashaws ec2 describeinstancetypeofferings filters Name=instancetypeValues=inf1 query InstanceTypeOfferings[]InstanceType region AWSREGION

安装项目依赖：

bashnpm install

运行以下命令以部署开源可观测性模式：

bashmake buildmake pattern singleneweksinferentiaopensourceobservability deploy

验证解决方案

完成以下步骤以验证解决方案：

运行 updatekubeconfig 命令。您应该能够从前一步命令的输出中获取该命令：

bashaws eks updatekubeconfig name singleneweksinferentiaopensource region ltyour regiongt rolearn arnawsiamxxxxxxxxxrole/singleneweks

验证您创建的资源：

bashkubectl get pods A以下是我们示例输出的屏幕截图：

确保 neurondeviceplugindaemonset DaemonSet 正在运行：

bashkubectl get ds neurondeviceplugindaemonset namespace kubesystem以下是我们期望的输出：

resultsNAME DESIRED CURRENT READY UPTODATE AVAILABLE NODE SELECTOR AGEneurondeviceplugindaemonset 1 1 1 1 1 ltnonegt 2h

确认 neuronmonitor DaemonSet 正在运行：

bashkubectl get ds neuronmonitor namespace kubesystem以下是我们期望的输出：

resultsNAME DESIRED CURRENT READY UPTODATE AVAILABLE NODE SELECTOR AGEneuronmonitor 1 1 1 1 1 ltnonegt 2h

要验证 Neuron 设备和核心是否可见，可以从您的 neuronmonitor pod可以从 kubectl get pods A 的输出中获取 pod 名称运行 [neuronl