Systems Automation

Overview

IT Services needs to be able to scale to ever-increasing numbers of systems and applications and ever-higher ratios of systems to systems administrators. Clients need faster provisioning of new systems and faster changes to systems, but with auditing and reporting of what has changed. Also, tracking new software releases of the operating systems and applications is needed, which means finding robust and efficient ways of testing new software and upgrading large numbers of systems. This makes automation vital to IT Services' systems administration work.

The enemy of automation is anything that requires separate action on individual systems. As much administration as possible must be done through central management interfaces that can push changes to all affected systems. Hardware and configuration inventory similarly needs to be centralized and data collection automated to reduce or eliminate manual data entry. Actions that do need to be done manually should be exposed as a service via protocols such as remctl or SOAP (Simple Object Access Protocol), so that administrators do not have to log on to a system to make changes to it. Moreover, centralization will provide opportunities for new automation and administration tools that can enhance workflow and daily systems administration activities.

Effective automation and scaling require standardization and predictability. IT Services' Linux systems should look like standard Linux systems, leveraging the native package management frameworks and storing files in standard locations. Likewise, Windows systems should look like standard Windows systems, leveraging current Microsoft management software and techniques and making full use of the current features of Active Directory (AD) and related technologies. The CRC's (Computer Resource Consulting) Mac OS environment should align with industry standards in areas of centralized authentication, application revision control, and group policy management through AD extensions. IT Services' application installations should follow vendor recommendations and best practices wherever possible to make full use of the accumulated wisdom of other installations. Stanford-specific divergences, where needed, should be isolated and well-documented, and IT Services will continue to attempt to reduce those divergences by pushing the changes needed for Stanford into the standard versions of operating systems and applications.

Local development will not be avoided where gaps are identified; rather, such gaps will be considered as challenges and opportunities to collaborate and improve the software for everyone. Local development will be emphasized and combined with active participation in broader communities, either incorporating that development into the general product or making it available in other ways to the community using that software. The best path to standard systems is to improve the standard and software so that it works for Stanford; avoiding local development leads to an unsustainable accumulation of workarounds, patches, and manual processes to fill in gaps.

Best practices guides for the configuration management systems used for all servers will be will developed and maintained, and these guides will be used to ensure all servers are maintained consistently throughout the department and the life of the system. IT Services' use of Puppet and Microsoft's System Center Configuration Manager (SCCM) will be documented for the broader community.

Current State

Core technologies used by IT Services include:

  • Puppet for UNIX configuration management.
  • remctl (locally developed) for UNIX service-oriented architecture.
  • SOAP and/or REST (Representational State Transfer) for Windows service-oriented architecture. SOAP interfaces are built into Microsoft OS and server products.
  • Out-of-date (locally developed) systems for package and operating system version information on UNIX systems.
  • Standardization on a quarterly patch cycle for UNIX systems and monthly application of patches for Windows systems.

IT Services is keeping pace with, or is out in front of, the following industry trends:

  • Central administration of systems through system analysis, automatic correction of divergences, and automated deployment of configuration and package updates, via Puppet and SCCM.
  • Configuration management, in the sense of having a central repository of data about systems and services and the interconnections between them. IT Services has several partial implementations of configuration management and is currently working on a comprehensive deployment based on the CMDBf framework, which will be cutting-edge from the perspective of Stanford's peer institutions.
  • Exposing system and application functions as network services. The UNIX Systems group has built a considerable amount of automation and integration on remctl. Nearly all systems and services can be managed remotely via remctl, and remctl is also used to provide the Help Desk with tools to fix user problems without needing to escalate tickets. Microsoft includes WS-Management, a SOAP-based automation protocol, with their operating system as a built-in service; Windows Remote Management (WinRM); and SOAP and/or REST interfaces for their server products (Exchange, SharePoint, etc.). The Windows group is also adding SOAP or REST interfaces to locally-programmed services that benefit from integration.
  • Distributed revision control systems. After successful experiments with local wiki and Debian packaging, the UNIX Systems group is moving from CVS and Subversion towards Git for all locally-developed software and packaging.

IT Services needs to catch up with these other technology trends:

  • Virtualization. This critical component to the IT Services' system administration strategy is discussed in a separate strategy document, Server Virtualization.
  • Batch, grid, and other research computing. See the Research Computing strategy.
  • Bug tracking. The industry trend is to drive most software development from a comprehensive bug tracking system. IT Services has only a short-term issue tracking system (in HelpSU) and spike implementation of Jira and Microsoft's bug tracking. The current Jira deployment has serious problems and is not suitable for use outside of the limited areas in which it's currently used. Microsoft's Team Foundation Server has more promise for Windows development, but as yet it's only been used for a single project and needs more development.
  • Microsoft and other vendors are increasingly focused on interoperability and interchangeability between their products. We're also seeing slow but steady convergence among vendors about how their products are deployed and managed. This is providing more opportunity to use best-of-breed products while maintaining standardized environments and configuration.

Vision

The UNIX Systems group has already seen huge gains in productivity and ability to scale from the deployment of Puppet, a configuration management tool, and expects to see even more gains from completing that work and having all systems be managed the same way. Puppet has been a key factor in IT Services' ability to somewhat reduce rates, and the hope is to continue that trend and use the same infrastructure to offer competitively-priced research computing services. The Windows admins hope to leverage SCCM to replace similar homegrown technologies that take more time to maintain, which outweighs the incremental cost of purchasing SCCM. The next step for Puppet on UNIX is to use that infrastructure to standardize a process around how general changes are made across all systems that UNIX Systems runs and how those changes are communicated to clients. The result of that work will be better, more structured client communication that show clients exactly what they're getting for their money.

Staff both inside Computing Services and elsewhere in IT Services currently waste quite a bit of time with duplicate data entry, tracking down bad data, and dealing with problems from inconsistent record-keeping around systems. The CMDB (configuration management database) project will provide a framework to solve that problem, and will also enable far better reporting and presentation of system problems that require administrator attention. This will increase perceived responsiveness to issues and allow IT Services to proactively address problems before clients notice.

System automation offers an opportunity for strategic investment that improves the efficiency of many services provided by IT Services. System administration is foundational to IT Services, and correctly chosen investments will lead to greater reliability, reduced cost, and improved flexibility in services offered.

Goals

  • All Linux and Windows systems will be managed using central management systems. For Linux systems, this will be Puppet; and for Windows, this will be System Center Configuration Manager. Routine operations that require logging in separately to multiple systems will be rare or non-existent. Changes to shared configuration will be made globally in one place and pushed out to all systems by the management software. General goals for Mac OS X are unclear at this time.
  • All systems will be configured to operate in a truly "lights-out" manner, not requiring any physical contact except for major hardware issues.
  • Core information about all systems and services will be stored in a federating configuration management database (CMDB). This will include a subset of all discoverable hardware information, network architecture, and similar data. A larger set of data will be stored in respective systems of record outside of the CMDB, and the CMDB will contain links to where that information can be obtained. The CMDB will also store information about the mappings between systems and services and other relevant relationships between data known and stored in the CMDB. Queries such as systems on an old environment, systems running an old operating system, or all systems for which a given administrator is primary will be supported by a query service that exposes the stored data.
  • Using the configuration management system and additional supporting data gathering, each system administrator will have a view into their systems that collects everything they need to act on: intrusion detection and log filtering reports, noisy system jobs, monitoring alerts, pending upgrades, and system metrics.
  • Centralized reporting of key performance metrics will be available for common services, allowing the customers to see their utilization and provide for proactive capacity planning. See the Server and Application Monitoring, as well as the Reporting strategy documents for more information.
  • Routine operations on Linux servers that do not make sense to manage through Puppet (such as querying the status of an account or service, making data changes to a service, or infrequently-used administrative operations) will be exposed via a remctl interface so that they can be run remotely without logging on to the affected system. Likewise, similar operations on Windows systems will be exposed via a SOAP or REST interface.
  • Deploy a shared IT Services bug and issue tracking system that can support local software development as well as longer-term issues and feature requests for central services. This will supplement the HelpSU system, which will continue to be used for short-term, immediate problems or client requests.

Specifically for UNIX systems:

  • Linux systems will point to one of three configuration repositories: a development branch, the current stable branch, and the previous stable branch. The development branch will be used to test both new changes to specific systems and global improvements to the shared configuration. Quarterly, the development branch will be branched and declared the new stable branch. All stable systems will be moved from the previous stable branch to the new one over the next quarter, scheduled according to the needs of that service or client. This will allow better testing and non-disruptive release of global changes and will allow IT Services to document for clients the specific changes that will go into each quarterly patch cycle.
  • All locally-developed Linux software and packaging maintained by IT Services will use the Git revision control system.
  • Revise the Puppet Best Practices document as an ongoing practice and publish internal practices for Puppet as a companion to the Puppet Best Practices document that IT Services continues to maintain for the Puppet community.
  • Improve the security (through package signatures), auditing, and consistency across architectures of local Linux package repositories.

Specifically for Windows systems:

  • Complete production roll-out of System Center Configuration Manager (SCCM) to all Windows systems.
  • Integrate the existing Windows Server “Lite-Touch” build process with SCCM.
  • Enable WS-Management (WinRM) interface on all Windows systems.
  • Upgrade Windows source code repository system to Team Foundation Server 2010.

Roadmap

Overall:

  • Deploy the results of the CMDB project.
  • Develop, on top of the CMDB information and other data captured from systems, a unified view of alert, reporting, and needed-action information for systems. From that, build a reporting view for system administrators to see that information for all the systems that they maintain.
  • Choose and deploy a bug-tracking system that's suitable for tracking both internal development work and service problems. Git integration is a mandatory feature given IT Services' strategic direction for revision control systems.

For the UNIX Systems group:

  • Implement the environment support described in Goals. Develop procedures for documenting changes to the development branch, to be used for communication to clients when systems are moved from one stable environment to the next.
  • Further enhance remctl to support command discovery via a server-side help facility, upload of arbitrary-length data through the protocol, and support for arbitrary binary data as command arguments. This will improve IT Services' ability to expose all necessary integration and administrative interfaces through remctl.
  • Migrate passwords in configuration files to wallet.
  • Migrate the Puppet infrastructure 0.25 and take advantage of the advanced features available in that version.
  • Rewrite the Puppet Best Practices based on 0.25 deployment.
  • Convert Debian repositories from debarchiver to reprepro and enable repository signing. Add stronger checking and auditing for the repositories to prevent common package upload errors.
  • Deploy an automated Debian package build environment in order to maintain the same set of packages for x86 and x86_64 without manual work.

For the Windows Systems Group:

  • Deploy System Center Operations Manager 2007 R2.
  • Deploy System Center Configuration Manager 2007 R2.
  • Develop best practices and a base set of configuration task sequences for SCCM.
  • Update data connections to Remedy CMDB to draw information from System Center.
  • Complete deployment of network-based IPMI control for all servers.
  • Roll out WS-Management (WinRM) to all Windows servers.

Measures of success

  • Central management systems should make the management of systems and applications more efficient. Success can be measured through rates, which should be reduced per-system as less time is spent managing each system, and in growth of the ratio of systems to staff members. Currently, IT Services has a 40:1 system-to-staff ratio in the UNIX and Windows groups including application support, or 55:1 adjusted for application support. With the growth of lightweight virtual systems and larger farms of identical research computing equipment, this ratio is expected to rise.
  • Success of the CMDB project can be measured by efficiency and superior knowledge of IT Services' systems. For efficiency, the number of places requiring manual data entry about systems should be drastically reduced, and the remaining data that does require manual entry should only need to be entered once, after which it will propagate to the relevant systems. System knowledge can be measured by the ease with which systems underlying a particular service can be identified, systems maintained by a particular administrator, and out-of-date systems. There should also be a decrease in discrepancies between systems, particularly between the tracking systems used by administrators and the billing system.
  • Success of a unified system view for administrators can be measured by improvements in the cleanliness of regular system reports. The increased awareness should reduce the number of dirty Tripwire reports, unfiltered but uninteresting syslog messages, complaining periodic jobs, and similar forms of operational noise. The system can also be judged by the length of time it takes for a supervisor to audit how well administrators are maintaining the day-to-day health and cleanliness of their systems. Currently, auditing on only one metric takes hours; that time should drop substantially. Standardizing on one version control system in each group, in addition to providing access to advanced features where needed, will help with training and staff efficiency by allowing all version control actions to be done the same way. Progress for the UNIX group standardization can be measured by counting the remaining legacy (CVS or Subversion) revision control repositories that are still in active use.
  • A bug tracking system will be successful if it is routinely used by staff to log feature requests, future improvements, and needed fixes for all services and locally developed software managed by IT Services and is well-integrated into the development workflow of the groups. Conversely, if such a system exists but is not used, or if it is avoided because it is perceived as cumbersome, it will have failed.
  • Research computing success can be measured by the use of centrally-provided facilities by university researchers and, hopefully, movement from department-managed research computing facilities to central facilities.