SobekCM’s Enhanced Findability/Discoverability with Schema.org Microdata and More (Or, SobekCM’s exceptional support for scholars, librarians, archivists, and curators, and others in making sure their work can be found and used)

SobekCM Version 4.0 Alpha builds on SobekCM’s existing support for Search Engine Optimization (SEO) with updated static pages and enhanced support for Schema.org’s microdata. SobekCM has been designed and is continually enhanced for optimal findability and usability through integration with other systems with a range of metadata supported (METS/MODS, MARC, Qualified DC, and so much more as shown in this infographic) and with a  rich range of connections (APIs, feeds, system-level integration, etc.) for connecting SobekCM records with library catalogs, harvest/ingest into various repository and discovery systems, and SEO for optimal findability through web search engines like Google, Bing, Yahoo, and the like.
Digital libraries, collections, and scholarly projects powered by SobekCM at UF and around the world see most of the user traffic connecting from Google and other search engines. Currently, SobekCM at UF has 393,440 pages indexed by Google, ensuring that all of the items and contents are easily findable for those searching directly for them and to best enable others in serendipitously finding them.
Schema.org Microdata
As explained on the Schema.org site, Schema.org:

This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.

Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.
A shared markup vocabulary makes it easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts. So, in the spirit of sitemaps.org, search engines have come together to provide a shared collection of schemas that webmasters can use.
It’s hugely important for digital library, repository, and other systems supporting materials and scholarly projects to support microdata. Roy Tennant wrote an article entitled “Why Microdata, Not RDF, Will Power the Semantic Web” where he provides a solid explanation for the importance of microdata, including “the key reasons why I think microdata will end up powering the semantic web”.
SobekCM’s Use of Schema.org Microdata
SobekCM’s use of Schema.org microdata is most notable in the citation views, like: http://ufdc.ufl.edu/UF00082548/00001/citation
SobekCM’s support for microdata includes standard data, as well as several itemproperties which are not yet standard (notes; identifiers; classifications) and which will hopefully become standard in the future. In addition to the microdata (which is structured and actionable), all of the citation pages in SobekCM also include the main thumbnail image for nice display for human users and ease of connection/collection by robots and others.
SobekCM’s Static Pages
SobekCM has long supported static pages specifically for use by search engine robots as part of the overall SEO support. In the past, the SobekCM application would identify the requester as a robot and then provide the rewritten static page. This worked well with a couple million views per month, but response time – and response time as measured in milliseconds – is important for SEO, even when human users can’t detect any difference. It’s hard to shave additional milliseconds off already fast response times, but the SobekCM developers are always looking for new ways to do just that.
The most recent SobekCM release increases response time by auto-writing all of the static a pages daily and whenever new items load, and these static pages are ready as-is. This is instead of being static pages that are provided after an application-connection, and it cuts some milliseconds on time for the current load. When a robot is identified as requesting an item, it doesn’t even get into the application.  The rewriter identifies it, and the request is rewritten to point to the static page directly.  This is quicker and takes the load off the SobekCM app. In addition to SEO now, this also ensures optimal support as usage continues to increase by leaps and bounds with 5, 6, and growing millions of user views each month. This is an example of the new static pages: http://ufdc.ufl.edu/data/IR/00/00/35/45/IR00003545_00001.html
The static pages for all items are in this format, which includes following the Pairtrees directory structure standard for digital preservation. From the citation page for an item, the format is baseURL followed by /data and then followed by the Pairtrees and finally concluding with the itemID.html.
These are a few examples:

Notes
Mark V. Sullivan (Lead developer of UF’s SobekCM Team, and original creator) provided much of the information for this blog entry (and so this was collaboratively written by Mark V. Sullivan and Laurie N. Taylor, UF Digital Humanities & Data Curation Librarian.

1 Comment

Comments are closed.