Information about the software in a repository should be systematically organized into a catalog that may be efficiently browsed and searched. Having a software librarian put some effort into manually classifying and cataloging the software will enable more precise searching by end users.
Software is classified by labelling it with one or more descriptors from a classification scheme. Possibilities for the classification scheme include a hierarchical taxonomy [5], a keyword thesaurus [3], a faceted thesaurus [9], or a combination of these. A hierarchical taxonomy usually classifies according to a single characteristic, such as problem area for the GAMS mathematical software taxonomy [5]. A full-featured thesaurus is a collection of terms with broader-term, narrower-term, related-term, and use(d)-for relationships. The INSPEC thesaurus is an example of a keyword thesaurus used to classify the scientific and engineering research literature. A faceted thesaurus organizes its collection of terms into characteristic groups called facets. Possible facets for software indexing terms might be problem area, application area, and algorithm/method. A faceted thesaurus may be thought of as consisting of multiple hierarchical taxonomies (i.e., the broader-term/narrower-term relationships within each facet), but with the possibility of relationships between terms in different facets.
Some repository developers will be fortunate enough to find existing classification schemes for their domains which they can use as is or easily adapt. Others will be faced with the task of creating a new scheme or carrying out non-trivial extension of an existing one. Except for very small or specialized domains, development of a high-quality classification scheme will be a large-scale effort requiring the collaboration of experts in various specialties. Classification schemes also require ongoing maintenance, because new terms must be added when new items do not fit existing categories or vocabulary.
Factors to be considered in selecting or developing a classification scheme include the following:
A large database requires the use of precision devices, such as decision trees [2] or expert advisory systems [8], [4], to effectively discriminate between large numbers of similar items.
Having authors assign controlled vacabulary terms to their own works has been shown to produce problems with indexing consistency [3].
Inexpert users can be assisted by including explanatory notes for categories and vocabulary terms and by providing automatic translation of natural language to controlled language.
Catalog records for software contain values for the attributes of the cataloged software and for the relationships between software components and between software and other entities such as authors, repositories, and individual files. The attributes and relationships are best expressed in terms of an abstract data model which can then be mapped to a concrete syntax for storage, retrieval, and exchange with other repositories.
To facilitate interoperation between participating HPCC repositories, the NHSE recommends use of the Reuse Library Interoperability Group (RIG) Basic Interoperability Data Model (BIDM) for exchanging software catalog records between repositories [1]. The BIDM is an IEEE standard (1420.1) that specifies a minimal set of catalog information that a software repository should provide about its software in order to interoperate with other repositories. The BIDM is expressed in terms of an extended entity-relationship data model that defines classes for assets (the software entities), the individual elements making up assets (i.e., files), reuse libraries (i.e., repositories) that provide assets, and organizations that develop and manage libraries and assets. The model was derived from careful study and negotiation of the commonalities between existing academic, government, and commercial reuse libraries, by representatives from these libraries. Reuse libraries need not adopt the BIDM internally, although many have. They can continue to use internal search and classification mechanisms appropriate to their unique missions while using the BIDM as a uniform external interface.
The RIG has developed an extension to the BIDM called the Asset Certification Framework (ACF), which is the basis of the NHSE Software Review Framework, which is described in a separate document. The ACF provides classes, attributes, and relationships which may be used to express a reuse library's software evaluation policy and evaluation results so that they may be easily interpreted by other libraries. The NHSE recommends that domain-specific HPCC repositories use the RIG ACF for describing their software review policies and for expressing results of software review efforts.
Another BIDM extension currently under development and which the NHSE expects to adopt in the near future is the Intellectual Property Rights Framework (IPRF) for describing software legal restrictions, such as copyright, patents, licenses, and export restrictions.
The classification schemes discussed above currently fit into the keyword asset attribute of the BIDM. There is currently no way to specify from what scheme the values for the keyword attribute are drawn. However, the RIG is working on a meta-model that address the controlled vocabulary problem and that will also allow for domain-specific extensions of the BIDM.
The RIG has developed two bindings of the BIDM and the ACF which map the abstract data model to concrete syntax specifications that can be used for exchange of asset catalog information via the World Wide Web. One binding maps the BIDM to an SGML Document Type Definition (DTD). The other maps BIDM attributes and relationships to META and LINK tags in the header of an HTML document.
More information about the RIG, the BIDM, and the Web bindings is available from the RIG home page at http://www.rig.org/.
The NHSE is developing a Repository in a Box (RIB) tookit that will assist repository developers in creating and maintaining software catalog records using the BIDM, in exchanging these records with other repositories (including the top-level virtual NHSE repository), and in providing a user interface for the software catalog. RIB is expected to be available by September 1996.
For more details about repository interoperability, see section 6.