Job submission to Grid Computing systems

During the last years, the discussion about Grid and Cloud Computing has become very common. It’s very easy to find a paper or an article analyzing the advantages and disadvantages of using one of both choices. This blog post will not compare the similarities and diferrences between them, and neither argue for or against one of them. In this article we will try to take a deeper look at the Grid computing, and its evolution in the last decade.

Grid Computing refers to a set of distributed systems connected with each other and sharing all it’s resources,  such as processing power, data storage or amount of memory. The result is a network of computers (a grid of computers) forming a virtual single supercomputer. It’s mostly used to execute large-scale distributed applications, which would require a  large execution time in a common server or in a personal computer, like complex mathematical and scientific problems. In order to take advantage of the benefits of grid computing, the term “software scalability” becomes a determining point. Scalability refers to how the execution of a process changes by adding resources to the node it’s running in. In an ideal situation, by doubling the power and capacity of a server, the CPU time would decrease in a factor of 2 as well. However, the experience shows that trying to transform the hardware parallelism in an equivalent software parallelism is a very ambitious goal, which is not easy to achieve.

The architecture of a Grid Computing system  contains two different types of hosts. The first one is known as the control node, and its goal is to manage all the administrative tasks of the system. Clients will connect to it in order to make use of the services the Grid expose. The second type of hosts are the execution hosts, which form the grid of computers in which the jobs of the users will run. But how are the jobs dispatched to the grid and what happen if all hosts are busy? Well, this is a task for the Grid middleware. The Grid middleware is a software installed in the control node monitoring the whole grid, knowing in every moment the load of the system and which resources are free and available. When a user wants to make use of the grid services, his jobs is sent to a queue of jobs, where the middleware prioritize and schedule all the taks of the user across the network. It’s very usual to find diferent types of queue depending on the size of the task, delimiting every queue the maximum time execution, the amount of memory to use or also the number of CPUs required for the task being executed.

Figure 1. Example of Grid Computing system, formed by one server acting as the control node and two grid networks.

Grid computing is not something new. In the last decades, a large number of grids have been deployed by universities, scientific institutions and also private entities. The most common way the user could access to the grids was by establishing a SSH connection and running the jobs manually, polling the state of the job waiting for the result of the execution. Of course, as time went by, every entity developed its own different alternative. In year 2006 the Open Grid Forum (OGF) was born. OGF is a community of users, developers and vendors leading global standardization effort for distributed computing. Some of the standards created by the OGF are the Open Grid Services Architecture (OGSA), Open Grid Services Infrastrcuture (OGSI) and Job Submission Language Description (JSDL).

One of the most important definitions inside the OGSA is the Basic Execution Service (BES). BES defines a Web Service interface that can be used for creating, monitoring and managing processes running in remote infrastructures, like grids. The processes are called “Activities”, and they are described by the the Job Submission Description Language (JSDL). Also created by the OGF, JSDL contains all the necessary information in order to submit and execute an activity in a remote machine:

  • Activity Identification: Useful for the client-side, provides information to identify the activity, such as the name of the job or the project it belongs to.
  • Application information: Contains the main attributes of the application to be runned remotely, like the name of the executable, version, arguments, etc.
  • Resources information : Requirements of the machine(s) the activity will be executed in. In this section the user can specify the attributes of the host(s) in terms of system architecture, operating system, memmory, storage, etc.
  • Data Staging: List of files to be imported to the system for executing the activity, and also the list of files to be exported after it (for example: the activity result).

The BES interface is not more than a SOAP Web Service whose WSDL is defined in JSDL. Submiting a job using a BES interface means that the server will create a new process instance with the attributes specified in the JSDL, it will collect all the input files from a remote host, execute it and  transfer the result to the host the user has specified.  The methods which a user can call for creating, monitoring and controlling his activities are:

  • CreateActivities: Using the Job Submission Description Language, the user can specify the main attributes of the job(s) he wants to submit to the grid. Every job will be transformed in a process instance executed independently from each other. The BES interface will return to the user the identifier of the job, required to use the rest of the methods of the Webservice.
  • GetActivitiesStatuses: The user can poll the state of all his jobs in the grid, using the id returned when the activity was created. A job can be in one of the following states: “pending” (the activity was submited, but is not instantiazed yet, since it’s beeing scheduled or waiting for an available host), “running” (the process is running in the grid, including the staging in an staging out of files), “cancelled” (by the user or by the system), “finished” (activity execution is finished and the result was staged out if necessary) and “failed” (job execution could not finish correctly).
  • TerminateActivities : The user can cancel the job execution while the activity is in pending or running state.
  • GetActivityDocuments: The JSDL file sent by the user might have changed inside the grid system, by inserting new attributes according to policies or processes inside the server. The user can request this file in order to compare it with the original one.
  • GetFactoryAttributesDocument : This operation is used to know managing information about the Web Service.

Figure 2. State diagram of a job specified by the BES standard.

Originally published in 2008, many software solutions has been adapted and implemented using this standard, like middleware (Gridway, Gridsam)  and provider (Unicore, Globus Toolkit). It has been an important point in the standarization history og the grid environment for the providers,facilitating the implementation of a Web Service their users can consum, but also for the clients, finding the same common usual interface in the different grids.

This entry was posted in Ongoing Projects, Projects and tagged , , . Bookmark the permalink.