Dashboard > ILS and Discovery Systems > Draft Recommendation > Data aggregation Draft
ILS and Discovery Systems
Data aggregation Draft
Added by John Ockerbloom, last edited by Dianne McCutcheon on Feb 15, 2008  (view change)
Labels: 
(None)

5. Data aggregation

Many discovery applications need to maintain external copies of ILS data. In this section, we define standard functions for extracting, or harvesting, ILS data in bulk.

5.1 Rationale and general issues

Many discovery applications may need to build their own index of metadata independent of the ILS, but drawing on the data managed by the ILS. They may, for instance, build indexes that allow rapid search and retrieval not supported by the ILS. They may need a selective index of some of the ILS metadata or an aggregated index that combines ILS metadata with information from other sources. Bibliographic metadata is of particular interest, though authority, holdings, and other item metadata (such as circulation information) can also be needed by external applications.

While harvested data from the ILS might not be as current as the data the ILS directly manages, it is good enough for many purposes, particularly since much bibliographic and authority metadata does not change frequently. Applications that need up-to-the-minute ILS data can use real-time search functions (the subject of the next section). If common record IDs are used, real-time searches can be used to retrieve additional or updated information related to extracted records as needed. These IDs may exist outside the actual metadata records (for example, a bibid not included in a MARC record). Including and disclosing IDs that persist over time is often important for supporting further discovery and services on exported metadata.

Harvesting all of the relevant data from an ILS can be an expensive operation. Selective harvesting, including incremental harvesting of data that has been added or changed since a certain date or time, can greatly reduce the cost and should be supported along with full harvesting when feasible. Selective harvesting based on pre-defined sets may also be useful.

Selectively filtered harvesting may be necessary in some cases if metadata records in the ILS have been licensed from a third party that does not allow redistribution. If this is the case, the client doing the harvesting needs to be aware of such filtering.

There may be multiple formats for metadata records. For example, some bibliographic records may be stored in MARC and others in MODS. It can also be useful to allow a variety of export formats for some records, such as native MARC 21 versus MARC-XML versus Dublin Core.

5.2 Sample use cases

Some possible use cases include

  • Building a duplicate index of ILS data; for example, a Lucene index of bibliographic records that can be searched in facets using Solr.
  • Making a specialized index of material selected from the ILS; for example, a catalog of video recordings that supports special searches based on actors, directors, and other video-specific features.
  • Making an aggregated index that includes material from multiple ILS's and other databases, and can be searched on its own without requiring potentially slower or less reliable federated search mechanisms.
  • Building an index of authority records to enable subject-based browsing, name and subject suggestion features, and other discovery aids.
  • Harvesting both bibliographic and licensing information (which may be managed by the ILS or by another application such as an ERMS) to support appropriately-scoped discovery services for different audiences and uses of content.
  • Harvesting holdings, item, and circulation information for services tailored to a specific library environment, such as date-sensitive citation resolvers, or relevance rankings weighted by usage. While circulation status on an item can change minute to minute, the status of a typical item does not change frequently. (Some books stay on the shelves for years without circulating, for instance) Some applications might find it useful to have recent extracts of such information, even if it is not fully up to date.
  • Harvesting recently added or changed bibliographic records for current awareness services.

Most of the usage scenarios we have seen involve a library harvesting the records of its own ILS. It is also possible for independent parties to harvest ILS records, though uncontrolled independent-party harvesting might put unacceptable loads on the ILS. For independent harvesters, it may be useful to document the selective harvesting options and metadata formats available for harvesting records from a particular ILS, and our binding recommendation specifies one way of doing this. However, because we foresee mainly internal and pre-arranged harvesting, we do not require functions for such documentation in our abstract profile.

5.3 Abstract Functions

5.3.1 HarvestBibliographicRecords (core)

Summary: Returns a set of bibliographic records (and their identifiers) that have been added to or changed in the source.

Parameters:

  • from (type date or time; optional): Only include records added or changed since the specified time.
  • until (type date or time; optional): Only include records add or changed up to a specified time.
  • format (type enum; optional): Specifies the metadata format to be returned.
  • set (type string; optional): Only include records in the specified set. (In many cases, this will effectively be an enumerated type.)

Returns:

  • A list of bibliographic records and their identifiers.

Exceptional conditions:

  • NotSupported: The underlying system cannot accurately answer the query with the supplied parameters.
  • InvalidRequest: The underlying system considers the supplied parameters invalid.

Side effects: None

Rationale: Many discovery systems may need to index bibliographic records independently of the ILS. This function allows all or part of an ILS' bibliographic records to be exported for aggregation. The whole catalog can be exported, or just a selection, such as records that have been changed since an earlier query.

Notes:

The records should be in a well-specified format, and have all the details that are relevant to discovery. The exact representation of the format may change, however. For example, a MARC record stored as relational table elements could be returned as native marc21, or in the "marc21" XML schema used by OAI-PMH.

The set of records returned may or may not include records that are suppressed from user display, or that have restricted export conditions. The documentation or binding used for this interface should make it clear whether or not such records are included, and under what conditions.

Each record should have a unique identifier, and that identifier is assumed to persist (if not forever, then as long as the records are managed by the underlying ILS, and the ILS undergoes no major changes), so that it can be used in later queries and services. The ILS' "bib id" might serve as this identifier, for instance, if it is reasonably stable.

To support the from and until parameters, the underlying system will have to be keeping track of when bibliographic records were last changed (or added for the first time).

The NotSupported condition may be needed to signal the caller that a date or set restriction can't actually be calculated correctly. The caller may be able to get an answer by removing the parameter that can't be handled. This is probably preferable to simply returning an answer that ignores the supplied parameter, without any warning (in an exception or other documentation) that an incorrect answer is being supplied. There may also be an InvalidRequest response indicating that the system recognizes the request to be invalid (which is distinct from not supporting the request).

These two conditions may be further specialized, into messages indicating unsupported sets, time formats, metadata formats, etc. We do not at this time specify the full set of conditions.

The set parameter may be relevant for exporting well-defined subsets of the catalog. For example, for supporting a video catalog, an ILS might place some bibliographic records in the "video" set, and support full or incremental harvesting of just those records. More general-purpose sets might include sets specifying only unsuppressed records (in cases where the implementation gives the option of including or not including suppressed record). We do not here specify which sets are defined, and how they are defined, but making it possible and convenient for special-purpose subsets to be available for applications that need them is a useful ILS feature.

It may be useful for there to be functions that return the sets and metadata formats available for harvesting, but it is not required in this profile. (In the OAI-PMH binding, the ListSets and ListMetadataFormats verbs are suitable for this.)

Possible Bindings:

  • OAI-PMH binding: The functionality above maps fairly straightforwardly to OAI-PMH. Detailed specification will be given in an OAI-PMH binding profile (specified in the Binding details subsection below).
  • Other bindings: Since this is an expensive, data-intensive operation, it may also be useful to have more specific library implementations closer to the ILS. A Java or Perl object library could be more efficient, for example. However, if it is not unacceptably inefficient, a web service binding (whether OAI-PMH or some other implementation) is likely to be more portable and robust.
5.3.2 HarvestAuthorityRecords (core)

Summary: Returns a set of authority records (and their identifiers) that have been added to or changed in the source.

Parameters:

  • from (type date or time; optional): Only include records added or changed since the specified time
  • until (type date or time; optional): Only include records add or changed up to a specified time.
  • format (type enum; optional): Specifies the metadata format to be returned.

Returns:

  • A list of bibliographic authority records and their identifiers.

Exceptional conditions:

  • NotSupported: The underlying system cannot accurately answer the query with the supplied parameters.
  • InvalidRequest: The underlying system considers the supplied parameters invalid.

Side effects: None

Rationale: ILS authority records give important supplementary information to the bibliographic records. For instance, they provide alternative forms of names and subjects, which are very important to support in a robust search (since users may well use the alternate forms rather than the "authorized" forms.) They also include a wealth of other information that may be important for discovery applications, including notes on scope and relationships between entities.

While some authority records can be downloaded from third parties, these records are not easily downloadable in many cases. Furthermore, the authority records stored directly in the ILS may give more precise and relevant information about authorized and related forms that are relevant to searching the catalog managed by that ILS.

Notes:

The records should be in a well-specified format, and have all the details that are relevant to discovery. The exact representation of the format may change, however. For example, a MARC record stored as relational table elements could be returned as native marc21, or in the "marc21" XML schema used by OAI-PMH.

Each record should have a unique identifier, and that identifier is assumed to persist (if not forever, then as long as the records are managed by the underlying ILS, and the ILS undergoes no major changes), so that it can be used in later queries and services. The ILS' "authority id" might serve as this identifier, for instance, if it is reasonably stable.

To support the from and until parameters, the underlying system will have to be keeping track of when authority records were last changed (or added for the first time).

The NotSupported condition may be needed to signal the caller that a date can't actually be calculated correctly. The caller may be able to get an answer by removing the parameter that can't be handled. This is probably preferable to simply returning an answer that ignores the supplied parameter, without any warning (in an exception or other documentation) that an incorrect answer is being supplied.

Possible Bindings:

  • OAI-PMH binding: The functionality above maps fairly straightforwardly to OAI-PMH. Using the OAI-PMH binding, authority records could be defined as a specific set or class of records to be harvested. Detailed specifications are in the Binding details subsection below.
  • Other bindings: The same recommendations for other bindings of HarvestBibliographicRecords apply to other bindings of this function.
5.3.3 HarvestExpandedRecords (core)

Summary: Returns a set of bibliographic records and supplementary information about the described bibliographic items that have been added to or changed in the source.

Parameters:

  • from (type date or time; optional): Only include records added or changed since the specified time.
  • until (type date or time; optional): Only include records add or changed up to a specified time.
  • format (type enum; optional): Specifies the metadata format to be returned.
  • set (type string; optional): Only include records in the specified set.

Returns:

  • A list of expanded records, including their bibliographic identifiers. The expanded records can include:
    • The bibliographic identifier
    • The bibliographic record
    • Identifiers for item under this record
    • For each of these items (or, where more suitable, for the entire record):
      • Whether the item circulates
      • Location (library building, and location within building)
      • Call number and scheme (e.g. LC, Dewey, SuDoc, NLM...)
      • Format
      • Barcode
      • Item notes
      • Item creation date/time
      • Summary circulation/availability information:
        • Status (available, checked out, library use only)
        • Date due
        • Total number of loans
        • Item last activity date/time
      • Other special information about the book kept by the ILS and important for discovery

Exceptional conditions:

  • NotSupported: The underlying system cannot accurately answer the query with the supplied parameters.
  • InvalidRequest: The underlying system considers the supplied parameters invalid.

Side effects: None

Rationale: Searchable indexes to library holdings often need information beyond what's in a book's MARC record. Searchers may want to filter or sort by location, availability, usage, and other ancillary data kept by an ILS but not present in the bibliographic record. We therefore need ways of extracting this information as well.

Notes:

Of the extended information examples given above, the most important elements, based on the libraries and applications we have surveyed, appear to be location and availability status, as both of these are commonly used for filtering or facet limitation. The other items are useful, but less essential. (Format is also sometimes used for facet limitations, but can often be derived or deduced from the bibliographic record alone.)

Some readers have questioned why one would want to harvest availability status, instead of just querying it in real time, since availability can change minute to minute. However, we have heard of applications that use it in their search indexing to provide quick filtering of relevant materials. It can be supplemented with real-time queries, or incremental extended record harvesting, to keep discovery displays up to date.

It is possible that an implementation of this function may offer more multiple record formats in the return values, with more or less expanded details provided in each. The more information is returned, the more useful the records may be for the clients, but the more work is required both to pull together the records, and (for incremental harvesting) determine the relevant changed records within a specified time frame.

Possible Bindings: As for HarvestBibliographicRecords, above.

5.3.4 HarvestHoldingsRecords (core)

Summary: Returns a set of holdings records (and their identifiers) that have been added to or changed in the source.

Parameters:

  • from (type date or time; optional): Only include records added or changed since the specified time.
  • until (type date or time; optional): Only include records added or changed up to the specified time.
  • format (type enum; optional): Specifies the metadata format to be returned.
  • set (type string; optional): Only include records in the specified set.

Returns:

  • A list of holdings records and identifiers for holdings and associated bibliographic records.

Exceptional conditions:

  • NotSupported: The underlying system cannot accurately answer the query with the supplied parameters.
  • InvalidRequest: The underlying system considers the supplied parameters invalid.

Side effects: None

Rationale: Many discovery applications can use data from holdings records in order to provide such information as call numbers, location of materials and extent of serial holdings. This function allows all or part of an ILS' holdings records to be harvested for aggregation. The full set of holdings in the catalog can be harvested, or just a selection, such as records that have been changed since an earlier query. (Exactly what the useful sets to harvest here, and how they would be defined, is still up for discussion.)

Notes:

The records should be in a well-specified format, and have all the details that are relevant to determine the extent of holdings inasmuch as the data from the ILS can provide. The exact representation of the format may change, however. For example, a MARC record stored as relational table elements could be returned as native MARC21, a MARC Holdings record in MARC XML, or in an ISO holdings XML schema (soon to be available).

Each holdings record should have a unique identifier, and that identifier is assumed to persist (within the underlying ILS), so that it can be used in later queries. If the function returns holdings records only, then the unique identifier of the associated bibliographic record must also be included.

To support the since parameter, the underlying system will have to be keeping track of when holdings records were last changed (or added for the first time). The NotSupported exception may be needed to signal the caller that a date or set restriction can't actually be calculated correctly. The caller may be able to get an answer by removing the parameter that can't be handled. This is probably preferable to simply returning an answer that ignores the supplied parameter.

If the implementation of HarvestExpandedRecords includes holdings records, it is permissible for this function to be implemented by the same underlying method.

Possible Bindings:

  • OAI-PMH binding: The functionality above can be implemented in OAI-PMH. Holdings export support could be implemented as a set within the larger context of available records.  Detailed specification will be given in an OAI-PMH binding profile (specified in the Binding details subsection below).
5.4 OAI binding

OAI-PMH gives a fairly straightforward binding for many of the functions above.

The HarvestBibliographicRecords function, for instance, can be bound to OAI-PMH's ListRecords function, with the since parameter modeled by ListRecords's from element, and the set parameter modeled by ListRecords's setSpec element. The exceptions above would be implemented by the more specific error messages of OAI-PMH (badArgument, noRecordsMatch, etc.)

In the return value, the OAI-PMH record header identifier would consist of a constant prefix followed by the bibliographic identifier, and the record metadata element would consist of an XML encoding of the bibliographic record. In the common case where the ILS maintained MARC 21 records, the marc21 record schema could be used.

If a single OAI-PMH interface bound multiple kinds of aggregation functions, it would need to use set prefixes to distinguish object spaces. If both bibliographic and authority records are returned via the same interface, for example, the setSpec "bib" could be used for bibliographic records, and the setSpec "auth" for authority records. Then, sets in each function would be modeled by subsets; for example, the set "video" in the HarvestBibliographicRecords function, if defined, would be modeled in the OAI-PMH interface by the set "bib:video".

A special-purpose element in the OAI-PMH Identify return value would describe the functions supported, the set namespaces used, if applicable, and other exceptional conditions and implementation details a client should be aware of. (This is how the NotSupported exceptional condition would be handled in this case, since OAI-PMH does not support general-purpose exception handling.)

We define the element as follows:

  • ilsharvest is the top level element. It contains one or more of the following elements:
    • collection element identifies a particular collection of records available for harvesting via this service.
    • It has the attribute type which describes the type of collection. The following attribute values are supported here:
      • "bibliographic" for bibliographic records
      • "authority" for authority records
      • "expanded" for expanded records
      • "holdings" for holdings records
    • Collection also has an optional set element. The content is a string that identifies the setspec that should be used to harvest the collection, if any. (If omitted, it's assumed that the entire collection consists of the collection.)
    • Collection also has an optional fullmdformats element. The content is one of more fullmdformat elements that contain strings that identify the recommended metadata formats for harvesting full information from this collection. This is to distinguish them from other metadata formats listed in the OAI listMetadataFormats request for the set that might contain reduced information. For example, marc21 format may contain information that's stripped out of the oai_dc version.
    • Collection also has an optional suppressed element, that specifies what is done with records that are suppressed from the user's view.
      • The suppressed element has an attribute included that says whether or not suppressed records are included. Defined values are "true" and "false".
      • The suppressed element also has an optional inset element. Its content is the name of the set used to harvest only suppressed records, if such a set exists. This would ordinarily be a subset of the set used for this collection.
      • The suppressed element also has an optional outset element. Its content is the name of the set used to harvest only non-suppressed records, if such a set exists. This would ordinarily be a subset of the set used for this collection.
    • Collection also has an optional embargoed element, that says whether or not there are identifiable records visible to the user, but that should not be exported externally. Its attributes and substructure correspond to that of the included element.
    • Collection also has an optional notes element. Its content includes notes about what kinds of records should be expected in this collection, and other notes about the collection of interest to harvesters. One possible use of this is to describe what kinds of information is in the "expanded" set, for instance. It might be useful to have a structured description of that, but for now we'll just make it a free-text element.

We do not here specify the exact format to use for extended records, but we expect it would be some XML that includes both the bibliographic record and the associated holdings and/or item data. We are not aware of a current standard that includes all of the information we specify for HarvestExpandedRecords above, but the return value should include pointers to applicable schemas, documents so that client implementors know how to retrieve relevant data.

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.6.0 Build:#913 Sep 27, 2007) - Bug/feature request - Contact Administrators