`
tanghongjun1985
  • 浏览: 55112 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

YARN 框架源码分析

阅读更多
此博客来源于http://www.ccplat.com/?p=652


ResourceManager
管理集群资源,创建时需要一个Store存储其信息。

Store
管理和存储RM状态接口,包含以下两个子接口,同时继承NodeStore和ApplicationsStore接口。
实现目前有MemStore和ZKStore两种,它们都实现了RMState和ApplicationStore。
ApplicationInfo
可以通过该接口获取应用的ApplicationMaster、MasterContainer、ApplicationSubmissionContext和所有Container。

RMState
可以通过该接口获取RM已存储的RMNode和ApplicationInfo信息

NodeStore
管理和存储RMNode的接口,主要是添加和删除RMNode。

RMNode
Node managers information on available resources and other static information.

RM管理的节点(NM)接口,包括节点信息和节点资源信息。

NodeId
RMNode id,包括节点host和端口。

ApplicationsStore
管理和存储应用的接口,包括根据ApplicationId和ApplicationSubmissionContext来创建应用以及根据ApplicationId删除相应应用。

ApplicationId
应用id接口,包含整数id和RM的启动时间戳。

ApplicationStore
应用用来管理和存储Container的接口,包括添加和删除Container、存储MasterContainer、和ApplicationMaster联系以更新应用状态。

ClientToAMSecretManager
管理客户端令牌。

RMContainerTokenSecretManager
管理容器令牌。

ApplicationTokenSecretManager
管理应用令牌。

RMDelegationTokenSecretManager
@Private
@Unstable

A ResourceManager specific delegation token secret manager. The secret manager is responsible for generating and accepting the password for each token.

管理RM委托令牌,处理每一个令牌密码的创建和验证。

Dispatcher
Event Dispatcher interface. It dispatches events to registered event handlers based on event types.

根据事件类型将事件分发到相应的已注册的事件处理器上进行处理。

实现类AsyncDispatcher,主要包含:

private final BlockingQueue<Event> eventQueue;

存放事件的阻塞队列。

private Thread eventHandlingThread;

处理事件的线程,不断地从eventQueue中取出事件并分发到相应EventHandler上进行处理。

protected final Map<Class<? extends Enum>, EventHandler> eventDispatchers;

事件类型–事件处理器的map,用于管理相关EventHandler。



主要对外接口是register和getEventHandler。前者将事件类型和相应的EventHandler对象注册到Dispatcher上(放入eventDispatchers),后者则是获得一个GenericEventHandler,通过调用这个GenericEventHandler的handle来处理相关事件。GenericEventHandler的handle实际上只是将事件放入eventQueue,具体的事件处理是由已注册的相应EventHandler来处理的。(个人感觉这种编码模式有坏味,应该直接在Dispatcher中添加一个handle方法来表示接受事件并进行相应处理的概念,GenericEventHandler的handle方法内容移到此方法中。)

RM初始化的时候会将所有相关EventHandler注册到Dispatcher上。

ResourceScheduler
@LimitedPrivate(value={“yarn”})
@Evolving

This interface is the one implemented by the schedulers. It mainly extends YarnScheduler.

调度器接口,主要功能继承自YarnScheduler,同时继承了Recoverable,可以从指定RMState中恢复。
实现有FairScheduler和FifoScheduler。
ClientRMService
The client interface to the Resource Manager. This module handles all the rpc interfaces to the resource manager from the client.

实现ClientRMProtocol,处理客户端相关请求,包括获取新应用、获取应用报告、提交应用,杀掉应用等。

ApplicationMasterService
实现AMRMProtocol,处理ApplicationMaster相关请求,包括注册/完成ApplicationMaster请求、ApplicationMaster对RM的分配请求、注册/注销ApplicationAttemptId等。

ApplicationMasterLauncher
用来处理AMLauncherEvent,根据事件的类型(LAUNCH,CLEANUP)创建相应AMLauncher线程并放入一个BlockingQueue中,使用一个专门的线程不断从这个queue中取出线程执行。

AdminService
实现RMAdminProtocol,提供administration相关功能,包括获取集群当前的应用队列、节点、用户、服务控制等信息。

ContainerAllocationExpirer
主要继承自AbstractLivelinessMonitor,监控Container是否过期。主要方法包括根据ContainerId对Container进行注册/注销、对过期的Container提交相应事件到相关EventHandler进行处理。

NMLivelinessMonitor
同样主要继承自AbstractLivelinessMonitor,监控RMNode是否存活。主要方法负责根据NodeId对RMNode进行注册/注销、对过期的RMNode提交相应事件到相关EventHandler进行处理。

NodesListManager
实现EventHandler,可以对节点状态事件(NODE_USABLE,NODE_UNUSABLE)进行处理(将不可用的节点放入不可用节点列表中,将恢复的节点从不可用节点中移除);同时,还可以对配置中的节点信息进行访问(包括slaves和excludeds,支持对这两个配置的动态修改)。

SchedulerEventDispatcher
实现EventHandler,处理SchedulerEvent(类型包括NODE_ADDED,NODE_REMOVED, NODE_UPDATE, APP_ADDED, APP_REMOVED, CONTAINER_EXPIRED)。包含一个BlockingQueue,handle时只负责将事件放入queue;同时还包含一个EventProcessor线程和一个ResourceScheduler,EventProcessor不断从queue中取出事件交给ResourceScheduler处理。

RMAppManager
实现EventHandler,通过处理RMAppManagerEvent(APP_SUBMIT, APP_COMPLETED)来对应用进行管理。

ApplicationACLsManager
应用访问控制管理器。

WebApp
RM的web页面管理。

RMContext
RM的上下文接口,可以以之获得以上的大部分RM成员。



ResourceTrackerService
实现ResourceTracker,主要处理NM注册和心跳请求并返回相应响应信息。

NodeManager
NodeManagerMetrics
NM统计数据类。

ApplicationACLsManager
应用访问控制管理器。

NodeHealthCheckerService
The class which provides functionality of checking the health of the node and reporting back to the service for which the health checker has been asked to report.

提供检查本节点是否健康和健康报告的功能。

LocalDirsHandlerService
The class which provides functionality of checking the health of the local directories of a node. This specifically manages nodemanager-local-dirs and nodemanager-log-dirs by periodically checking their health.

提供节点本地文件夹是否健康的功能,具体就是检查NM的本地文件夹和日志文件夹。

CompositeServiceShutdownHook
JVM Shutdown hook for CompositeService which will stop the give CompositeService gracefully in case of JVM shutdown.

在JVM关闭时优雅地关闭指定服务,这里是关闭NM。

ApplicationMaster
An ApplicationMaster for executing shell commands on a set of launched containers using the YARN framework.

This class is meant to act as an example on how to write yarn-based application masters.

The ApplicationMaster is started on a container by the ResourceManager‘s launcher. The first thing that the ApplicationMaster needs to do is to connect and register itself with the ResourceManager. The registration sets up information within the ResourceManager regarding what host:port the ApplicationMaster is listening on to provide any form of functionality to a client as well as a tracking url that a client can use to keep track of status/job history if needed.

The ApplicationMaster needs to send a heartbeat to the ResourceManager at regular intervals to inform the ResourceManager that it is up and alive. The AMRMProtocol.allocate to the ResourceManager from the ApplicationMaster acts as a heartbeat.

For the actual handling of the job, the ApplicationMaster has to request the ResourceManager via AllocateRequest for the required no. of containers using ResourceRequest with the necessary resource specifications such as node location, computational (memory/disk/cpu) resource requirements. The ResourceManager responds with an AllocateResponse that informs the ApplicationMaster of the set of newly allocated containers, completed containers as well as current state of available resources.

For each allocated container, the ApplicationMaster can then set up the necessary launch context via ContainerLaunchContext to specify the allocated container id, local resources required by the executable, the environment to be setup for the executable, commands to execute, etc. and submit a StartContainerRequest to the ContainerManager to launch and execute the defined commands on the given allocated container.

The ApplicationMaster can monitor the launched container by either querying the ResourceManager using AMRMProtocol.allocate to get updates on completed containers or via the ContainerManager by querying for the status of the allocated container’s ContainerId.

After the job has been completed, the ApplicationMaster has to send a FinishApplicationMasterRequest to the ResourceManager to inform it that the ApplicationMaster has been completed.

启动、连接、注册
AM由RM的launcher启动,运行在一个container上。首先AM需要连接RM并注册自身。注册时附带以下信息:AM所在节点host和AM监听的rpc端口(AM通过此端口响应客户端请求),追踪url(客户端通过此url追踪应用的状态和历史)。

心跳
AM需要通过定期的心跳向RM报告其状态,心跳具体是通过AMRMProtocol.allocate(ApplicationMasterService)进行的。

实际中的任务处理,AM通过AllocateRequest接口向RM请求所需的containers。每个AllocateRequest包含了一个ResourceRequest列表。每个ResourceRequest代表了一个应用对一个节点的所有资源请求,包括当前请求的一个Resource和之前已分配的containers。RM通过AllocateResponse通知AM新分配的containers、已完成的containers和可用的Resource(均包含在AMResponse)。

对于每个被分配的container,AM使用ContainerLaunchContext来设置必要的加载上下文,具体包括container id、运行所需的本地资源、运行的环境变量、将要运行的命令等,同时提交一个StartContainerRequest到ContainerManager以在分配的container上加载和执行定义的命令。

AM可以通过使用AMRMProtocol.allocate向RM查询已完成的containers的更新信息,或者根据ContainerId通过ContainerManager.getContainerStatus向ContainerManager查询已分配的containers的状态。

完成
当应用完成后,AM需要向RM发送以FinishApplicationMasterRequest以通知RM自己已经完成。



AllocateRequest
@Public
@Stable

The core request sent by the ApplicationMaster to the ResourceManager to obtain resources in the cluster.

AM向RM获取资源的核心请求。

The request includes:

ApplicationAttemptId being managed by the
ApplicationMaster
应用尝试id。

A response id to track duplicate responses.
Progress information.
意义不明,目前是AM中已完成的containers。

A list of ResourceRequest to inform the
ResourceManager
about the application’s resource requirements.
资源请求列表。

A list of unused Container which are being returned.
AM释放的containers列表。

See Also:

AMRMProtocol.allocate(AllocateRequest)



ResourceRequest
Application
主要包含用户、获得的containers、应用id和状态(NEW, INITING, RUNNING, FINISHING_CONTAINERS_WAIT,APPLICATION_RESOURCES_CLEANINGUP, FINISHED)等信息。

ContainerManager
@Public
@Stable

The protocol between an ApplicationMaster and a NodeManager to start/stop containers and to get status of running containers.

If security is enabled the NodeManager verifies that the ApplicationMaster has truly been allocated the container by the ResourceManager and also verifies all interactions such as stopping the container or obtaining status information for the container.

用于处理AM和NM直接启动/停止containers和获取运行的containers状态。

启用安全验证的话,NM将验证AM确定从RM处分配到了containers,同时也会验证其他所有的交互,如停止container和获取container状态。

Container
@Public
@Stable

Container represents an allocated resource in the cluster.

The ResourceManager is the sole authority to allocate any Container to applications. The allocated Container is always on a single node and has a unique ContainerId. It has a specific amount of Resource allocated.

代表了集群中一个已分配的资源。

只有RM可以给应用分配container。分配的container只会在一个节点上,并且有唯一的ContainerId。(暂时不确定最后一句的意义)

It includes details such as:

ContainerId for the container, which is globally unique.
全局唯一。

NodeId of the node on which it is allocated.
container所在节点。

HTTP uri of the node.
节点uri。

Resource allocated to the container.
分配给 container的资源。

Priority at which the container was allocated.
container优先级。

ContainerState of the container.
container状态,NEW,RUNNING, COMPLETE。

ContainerToken of the container, used to securely verify authenticity of the allocation.
container令牌,验证分配的安全性。

ContainerStatus of the container.
container详细状态信息。

Typically, an ApplicationMaster receives the Container from the ResourceManager during resource-negotiation and then talks to the NodManager to start/stop containers.

See Also:

AMRMProtocol.allocate(org.apache.hadoop.yarn.api.protocolrecords.AllocateRequest)

ContainerManager.startContainer(org.apache.hadoop.yarn.api.protocolrecords.StartContainerRequest)

ContainerManager.stopContainer(org.apache.hadoop.yarn.api.protocolrecords.StopContainerRequest)

具体包括:

ContainerId for the container, which is globally unique.
NodeId of the node on which it is allocated.
HTTP uri of the node.
Resource allocated to the container.
Priority at which the container was allocated.
ContainerState of the container.
ContainerToken of the container, used to securely verify authenticity of the allocation.
ContainerStatus of the container.


Resource
代表集群中的计算资源,目前仅代表一块大小可以设置的内存。

YarnClient
客户端,主要功能包括提交/杀死任务,获取相关信息(任务报告、所有任务列表、所有节点报告、队列信息等)。

ClientRMProtocol
@Public
@Stable

The protocol between clients and the ResourceManager to submit/abort jobs and to get information on applications, cluster metrics, nodes, queues and ACLs.

客户端和RM交互的协议,YarnClient的主要功能都通过其实现。协议实现是ClientRMService
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics