Timeshare Service

Overview

The timeshare service offers a central location for students, staff, and faculty to run general applications on a well-maintained Unix environment. The service is split into two major categories: low-intensity interactive jobs (e.g., email and text editing) and longer-term scientific jobs (e.g., Matlab and Mathematica, Stata). The former service only requires two to three low-to-mid-powered nodes; the latter requires a large number of powerful systems and a significant amount of software infrastructure to effectively manage them. The service currently supports about 3,000 users per month. This number promises to increase significantly if it is easier for students and researchers to take advantage of the cluster for their work.

The compute density per node has continued to increase over the last several years, and this trend does not appear to be significantly slowing. However, the countervailing trend is that the computing needs of the research community have increased to match this growth in density. More research codes are adding parallel elements, and the researchers want to use it. To reduce the numbers of researchers that build ad-hoc clusters out of desktops in their existing work space, there must be support for the basic research that they need to do.

The campus has already prioritized centralized research computing as a primary goal over the next several years. The timeshare service is one component of that goal, both as a driver for the central facility and as a proving ground for various software packages.

Current State

There are currently a total of 30 8-core, 32GB compute nodes available to campus users, shared to the community as the Corn cluster. These systems run Ubuntu Linux. Additionally, there are currently three heterogeneous systems in the Cardinal cluster. These systems run Debian Linux, and are a mix of virtual and physical systems. All of these systems are maintained by the Sysadmin Team (and specifically the Timeshare Team).

Vision

In the short term, IT Services plans to implement a full queuing system on the timeshare servers, which will provide several advantages over the current system and allow the timeshare service to expand. In addition, older versions of software applications on the timeshare servers will be purged, and an improved system for installing applications on the servers will be implemented. Lastly, new dedicated mailing lists will promote better communications with users about the timeshare service.

The first step will be to deploy a queuing system for job submission. In the current system, users choose which nodes to run their work on by logging onto a random node and just running the job. This has several disadvantages:

  • Tracking jobs is extremely difficult for both users and administrators.
  • There is no automatic way to restrict resource usage based on user or job type; there is also no good way to report on such usage, or to charge for it.
  • It can be difficult to find an empty node, which leads to users running jobs over each other.
  • There is no easy support for "specialized" nodes for specific projects, such as a greater need for RAM or swap space, or an improved interconnect.
  • There is no good way to handle multi-node parallel jobs.
  • Maintaining the nodes becomes more difficult because it is difficult to clear (and therefore patch/upgrade) an existing node.
  • There is no easy and consistent way to take advantage of external computing clouds (e.g. Amazon EC2 or the TeraGrid).

To address these issues, IT Services will implement an LSF (Load Sharing Facility) queuing system over the next several months. The system will initially be used to submit jobs to a small number of nodes reserved for batch work. As IT Services acquires familiarity with the system, the offering may be expanded to submit into the cloud, and/or involve the purchase of additional nodes for local jobs. The benefits of a queuing system include improved system utilization, compatibility with other queuing alternatives from other institutions and/or Stanford's HPC (High-Performance Computing) environments, and the improved maintainability and scalability of existing nodes.

The major cost of adding a queuing system will be in user education, as users will have to be taught to use it. The learning curve will be ideally be minimized by standardizing on a single queuing system. IT Services will also need to deploy a one-to-two node master queue server. Once the queuing system is complete, Computing Services will implement a central computing cluster for large-scale jobs, which will be operated as a client of the HPC Team. It will take advantage of the same queuing system above, and provide significantly more computing power for those that needed it. It will also offer cost-recovery functionality.

The second step will be to retire all unmaintained applications from the pubsw tree. The Unix Systems team has long maintained a tree of freeware software applications for the Unix clusters, known as 'pubsw' (because it is located in /usr/pubsw on the cluster systems). Due to improved package management tools, in recent years the software included in this tree has been superseded by locally installed versions of these packages; and the software in the pubsw tree has not been adequately maintained. The Timeshare Team plans to retire the pubsw tree from service on Linux system, by removing 10 to 15 software packages each week. At the end of the process, the pubsw tree will contain only actively maintained software and symlinks to existing software locations on the timeshare nodes. Note that this will not affect the /usr/sweet tree, which is where site-licensed software is managed.

Additionally, a better system of managing installed packages on the timeshare servers will also be implemented. A list of the thousands of installed packages is currently maintained entirely through Puppet. While convenient, Puppet is not designed to handle this magnitude of software, especially when it comes to order-of-package-installation. The intent is to select packages for install via a Debian meta-package, which will ensure that packages install in a consistent order, and improve the speed of system installs.

Lastly, communications about the timeshare service are currently being handled through HelpSU (client-admin) and the local message-of-the-day files (admin-client). While those communication channels must be maintained, overall communication efforts must be improved, starting with the creation and population of community mailing lists for discussion of the service.

Goals

For the queuing system project:

  • Gather requirements and metrics: What constitutes success vs failure? Who are the interested parties? How will we support them?
  • Find a pilot community to test and use the new system.
  • Implement a basic service.
  • Roll out the service to a small number of existing machines.
  • Instruct the help desk in how to teach users to use the service.
  • Roll out the service to new systems (virtual machines, additional new hardware, or cloud systems).

Additional improvements:

  • Puppet, the system management tool used for the timeshare nodes, best handles systems with a relatively small number of installed packages (200 to 500). The existing timeshare nodes have on the order of 1,500 installed packages, and Puppet does not handle this in an efficient manner. The new management system will be implemented through the ongoing LSP project.
  • The process of retiring old software packages has already begun in earnest, most recently with the retirement of fetchmail-based mail clients. A batch of software packages will continue to be retired on a bi-weekly basis until the entire tree is either retired or individually justified. The project is expected to be completed within the next six months. Any necessary software will be directly installed on the timeshare nodes.
  • Two timeshare mailing lists have been created for user communication: one for timeshare-related announcements (for all timeshare users), and one for discussions (interested users only). However, these lists still need to be populated, and procedures created to ensure that they stay populated appropriately.
  • OS updates occur every six months for the the Ubuntu operating system. Current practice sees the timeshare team deploying new Ubuntu updates immediately after their release; but this can lead to some relative instability. In the future, IT Services will formally split the cluster into two: a "stable" branch, and a "bleeding edge" branch. The former will be deployed on the majority of the nodes; the latter will be installed on a small number of nodes for those users that have a need for the most recent software and are willing to sacrifice a bit of stability to get it. IT Services will still not support unreleased versions of Ubuntu.
  • Account compromises turn nodes into sources of spam and denial-of-service attacks, which then involve significant staff time to remediate. They may also result in the nodes or the campus net as a whole being "blacklisted" by other Internet sites. IT Services currently uses Nagios and some customized automated scripts to help detect these compromises in their early stages and cut down on remediation work. Further development and refinement of these tools will allow for earlier detection.
  • Automated scripts are currently used to detect and remediate certain types of resource starvation on the nodes, increasing the reliability of the timeshare service and cutting down on staff time. As there will continue to be at least some need for interactive (non-queue-able) computing on the nodes, IT Services will continue to develop these scripts.

Measures of success

  • One useful measure of success is the continued user count: who is actually using the service on a regular basis. These metrics are currently being provided in monthly systems reports.
  • Moving forward, there is the need to develop additional measures of usage for timeshare systems. Specifically, once jobs are submitted through a queuing system, IT Services will be able to identify who is using these systems and what applications they are running. This information could be extremely useful for tailoring future hardware and software purchases.
  • Perhaps the most useful measure of success will be the growth of this fledgling research computing service. A well-run timeshare service will both nurture researchers and educate IT Services about the basics of scientific computing; with any luck, these researchers will then take advantage of other IT Services' offerings as their computing needs grow.