Writing Solaris Service Management Facility (SMF) service manifest

SMF services are basically daemons staying in background and waiting for the requests which they should server, when the request come the daemon wake ups, serve the request and then wait for the next request to come.

The services are building using software development platforms and languages but they have one common aspect which we are going to discuss here. The service manifests which describe the service for the SMF and let the SMF manage and understand the service life cycle.

To write a service manifest we should be familiar with XML and the service manifest schema located at /usr/share/lib/xml/dtd/service_bundle.dtd.1.  This file specifies what elements can be used for describing a service for the SMF.

Next thing we need is a text editor and preferable an XML aware text editor. The Gnome gedit can do the task for us.

The service manifest file composed of 6 important elements which are listed below:

  • The service declaration:  specifies the service name, type and instantiation model
  • Zero or more dependency declaration elements: Specifies the service dependencies
  • Lifecycle methods: Specifies the start, stop and refresh methods
  • Property groups: Which property groups the service description has.
  • Stability element: how stable the service interface is considering version changes
  • Template element: more human readable information for the service.

To describe a service, first thing we need to do is identifying the service itself. Following snippet shows how we can declare a service named jws.

<service name='network/jws’ type='service' version='1'>
<single_instance/>

The first line specifies the service name, version and its type. The service name attribute forms the FMRI of the service which for this instance will be svc:/network/jws. In the second line we are telling SMF that it should only instantiate one instance of this service which will be svc:/network/jws:default. We can use the create_default_instance element to manipulate automatic creation of the default instance.

All of the elements which we are going to mention in the following sections of this article are immediate child elements of the service element which itself is a child element of the service_bundle element.

The next important element is dependency declaration element. We can have one or more of this element in our service manifest.

<dependency name='net-physica' grouping='require_all ' restart_on='none' type='service'>
<service_fmri value='svc:/network/physical'/>
</dependency>

Here we are telling that our service depends on the svc:/network/physical service and this service needs to be online before our service can start. Some of the values for the grouping attribute are as follow:

  • The require_all which represent that all services marked with this grouping must be online before our service came online
  • The require_any which represents that any of the services in this grouping suffice and our service can become online if one of them is online
  • The optional_all presence of the services marked with this grouping is optional for our service. Our service works with or without them.
  • The exclude_all:  specifies the services which may have conflict with our service and we cannot become online in presence of them

The next important elements are specifying how the SMF should start, stop and refresh the service. For these tasks we use three exec_method elements as follow:

<exec_method name='start' type='method' exec='/opt/jws/runner start' timeout_seconds='60'>
</exec_method>

This is the start method, SMF will invoke what the exec attribute specifies when it want to start the service.

<exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'>
</exec_method>

The SMF will terminate the process it started for the service using a kill signal. By default it uses the SIGTERM but we can specify our own signal. For example we can use ‘kill -9’ or ‘kill -HUP’ or any other signal we find appropriate for our service termination.

<exec_method name='refresh' type='method' exec='/opt/jws/runner reload_cof' timeout_seconds='60'>
</exec_method>

The final method which we should describe is the refresh method in which we should reload the service configuration without disturbing its function.

The start and stop methods description are required to be present in the manifest in order to import it into the SMF repository but the refresh method description and implementation is optional.

The timeout_seconds=’60’ specifies that the SMF should wait for 60 seconds before aborting the method execution and retrying it over. We can use longer timeout when we know that the execution of the method take longer or lower timeout when we know that our service starts sooner.

The property group elements specify which property groups the SMF should associate with the service.  We can use as many of the SMF built-in property groups or define our own property groups.

<property_group name='startd' type='framework'>
<propval name='ignore_error' type='astring' value='core,signal'/>
</property_group>

The above snippet shows how we can define using a framework built-in property group. The built-in property groups and their properties have meaning for the SMF framework and change its way of dealing with our service.  For example the above snippet says that the SMF should ignore core dump termination of suppresses of this service via a kill signal from another application.

Next important element is the stability element which specifies whether the property names, service name, its dependencies and other service attributes may or may not change for the next release. For example the following snippet specifies that the service attributes may change in the next release.

<stability value='Unstable'/>

Finally we have the template element which contains descriptive information about the service, for example we can have something like the following element which describes our service for human users.

<template>
<common_name>
<loctext xml:lang=’Java'>Java Network Server </loctext>
</common_name>
<documentation>
<manpage title='JWS' section='1M' manpath='/usr/share/man'/>
</documentation>
</template>

The entire element is self describing except for the man page element which specifies where the man utility command should look for the man pages for the jws service.

All of these elements should be saved into an XML file with the following structure:
<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle ...>
<service ...>
<single_instance .../>
<dependency ... />
<exec_method... />
<property_group .../>
<stability .../>
<template ...>
</service>
</service_bundle>

This is a raw representation of what the service manifest bare bone cal look like. The syntax is not accurate and the attributes and child elements are omitted to make it easier to scan by the eyes.

The service_bundle, service, and  start and stop method elements are mandatory while other elements are optional.

Now we should install the service into the SMF repository and then administrate it. For installing the service we can use the following command.

# svccfg import /path/to/manifest.xml

When we import the service manifest into the service repository, SMF will scan the service profiles and check whether this service is required to be enabled either directly or because of a dependent service or not. If required, the SMF will use the start method specified in the manifest to start the service. But before starting it, SMF will check its dependencies and start any required service recursively if they are not already online.

Reviewing the XML schema located at /usr/share/lib/xml/dtd/service_bundle.dtd.1 and the staying tuned for my next few entries is recommended for grasping a better understanding of the whole SMF and Solaris Services administration/management.

Solaris fault administration using fmadm command

In this article we will study how we can use the fmadm command to get the list of faulty components along with the detailed information about the fault.

Before starting this article we should have a command console open and then we can proceed with using the fmadm command. The most basic form of using fmadm command is using its faulty subcommand as follow

# fmadm faulty

When we have no error in the system, this command will not show anything and exit normally but with a faulty component  the output will be different, for example in the following sample we have  a faulty ZFS pool because some of its underlying devices are missing.

Starting from top we have:
  • Identification record: This record consisting of  the timestamp, a unique event ID, a message Id letting us know which knowledge repository article we can refer to for learning more about the problem and troubleshooting and finally the fault severity which can be Minor or Major.
  • Fault Class: This field allows us to know what is the device hierarchy causing this fault
  • Affects: tells us which component of our system is affected and how. In this instance some devices are missing and therefore the fault manager takes the Zpool out of service.
  • Problem in: shows more details about the problem root. In this case the device id.
  • Description: this field refer us to the a knowledge base article discussing this type of faults
  • Response:  Shows what action(s) were executed by fault manager to repair the problem.
  • Impact: describe the effect of the fault on the overall system stability and the component itself
  • Action: a quick tip on the next step administrator should follow to shoot the problem. This step is fully described in the knowledge base article we were referred in the description field.

Following figure shows the output for proceeding with the suggested action.

As we can see the same article we were referred to, is mentioned here again. We can see that two of the three devices have failed and fpool had no replica for each of those failed device to replace them automatically.

If we had a mirrored pool and one of the three devices were out of service, the system could automatically take corrective actions and replace continue working in a degraded status until we replace the faulty device.

The fault management framework is a plug-able framework consisting of diagnosis engines and subsystem agents. Agents and diagnosis engine contains the logic for assessing the problem, performing corrective actions if possible and filing the relevant fault record into the fault database.

To see a list of agents and engines plugged into the fault management framework we can use config subcommand of the fmadm command. Following figure shows the output for this command.

As we can see in the figure, there are two engines deployed with OpenSolaris, eft and the zfs-diagnosis. The eft, standing for Eversholt Fault Diagnosis Language, is responsible for assessing and analyzing hardware faults while the zfs-diagnosis is a ZFS specific engine which analyzes and diagnoses ZFS problems.

The  fmadm is a powerful utility we which can perform much more than what we discussed.  Here we can see few other tasks we can perform using the fmadm.
We can use the repaired subcommand of the fmadm utility to notify the FMD about a fault being repaired so it changes the component status and allows it to get enabled and utilized.

For example to notify the FMD about repairing the missing underlying device of the ZFS pool we can use the following command.

# fmadm repaired  zfs://pool=fpool/vdev=7f8fb1c77433c183

We can rotate the log files created by the FMD when we want to keep a log file in a specific status or when we want to have a fresh log using the rotate subcommand as shown below.

# fmadm rotate errlog | fltlog

The fltlog and errlog are two log files residing in the /var/fm/fmd/ directory storing all event information regarding faults and the errors causing them.

To learn more about fmadm command we can use the man pages and the short help provided with the distribution. Following commands shows how we can invoke the man pages and the short help message.

# man fmadm
# fmadm --help
A good starting resource is the FM wiki page which is located at http://wikis.sun.com/display/OpenSolarisInfo/Fault+Management