Some thoughts on institutional research software management and persistent identification
Pablo de Castro, Open Access Advocacy Librarian at U Strathclyde (with thanks to Alan Morrison, Research Data Management Officer, for the explanation on institutional workflows around research software at Strathclyde Uni)
See also this previous StrathOA blog post by Alan Morrison "Depositing, distributing & citing software and code (A Zenodo – GitHub integration)"
A webinar on "DOIs for research software" will be organised by DataCite on Wed May 24th (in two weeks at the time of writing). This forthcoming event provides a good opportunity to share some thoughts on research software and the attempts to persistently identify it. These thoughts address the way institutions may or may not be specifically supporting research software management, with some specific considerations on persistent identifiers thrown on top. These latter thoughts are driven by one of the questions immediately raised by the event title: why is it called "DOIs for software" instead of "persistent identifiers for software"?
a. On research software management and its support from institutions
1. Research software is a key part of the gradually-developing European Open Science Cloud (EOSC). Moreover, research software is also a critical element when discussing research reproducibility
Slide from presentation “Software – a different kind of research object?” delivered by Neil Chue-Hong (Software Sustainability Institute) within the U Lancaster 3rd Data Conversation linked below (click on the image to access the full deck of slides)
2. While data repositories have also been collecting software for quite some time, this tends to be a researcher-led task. Proactive institutions are definitely able to support their academics for this specific purpose, usually within a wider conversation on Research Data Management – see for instance this inspiring 3rd Data Conversation "Software as data" held by colleagues at Lancaster University on Oct 3rd, 2017.
3. It's not that frequent however for institutions to independently address research software management as a separate area with its specific workflows and resources, but rather as part of the RDM-related work. RDM policies are quite widespread – including a recently issued RDMS policy at Strathclyde – but they tend not to include specific sections devoted to research software management.
4. When discussing general worklines like persistent identifiers for research software, the perspective of the institution is very relevant. Members of an institutional Open Research team are arguably best placed to deliver the sort of advice on Open Research implementation that would ensure that research software is always persistently identified. Critically, institutional Open Research teams are able to provide this advice in a discipline-agnostic way. This places them at the forefront of any specific dissemination activity around PIDs, not just for software but for any other entity too (including datasets but also projects or research equipment and facilities).
5. The intersection between persistent identification and institutional advocacy offers Open Research implementation teams a potential way into a more holistic support for the adequate management of the various research outputs produced by research groups, departments and schools.
b. On persistent identifiers for research software (or for any other entity in the area of "emerging PIDs" such as geosamples, conferences or research equipment and facilities)
A prominent research information management workflow modeller made the following remark during a discussion on PIDs at a recent euroCRIS event: "After extensively discussing the issue within the team, we decided not to implement a PID-issuing feature for all sorts of entities in [specific commercial CRIS solution] – which we could easily do from a technical perspective – because we could add to the confusion by enabling a mechanism to inadvertently create duplicate unique persistent identifiers for those entities".
An interesting example for this risk of duplication is provided by the VasoTracker software developed by researchers at the Universities of Strathclyde and Durham within the 'Optical Cannula' Wellcome-funded project, persistent grant ID https://doi.org/10.35802/202924 (among other acknowledged funding sources). As described on the VasoTracker website, this is a collection of open source tools for studying vascular physiology. The motivation for its development is also explained in the homepage:
This VasoTracker software not having been deposited in the system [Pure] that Strathclyde uses as a data repository, it has no DOI. The reasons why it hasn't been deposited probably come down to (i) the frequent misconception by researchers that datasets only apply to supplementary data underpinning publications and (ii) the probable wish to avoid the need to keep what has quickly become a live software package updated in several places at the same time – which may have led to choosing the website (and its associated github repository) as the default 'containers' for the code.
So would Strathclyde researchers developing code and their institutional Open Research support teams learn any new tricks at a webinar on "DOIs for research software"? Presumably yes, even if it were just on how Zenodo can help with the deposit of code, its maintenance and versioning. Plus perhaps DataCite will soon start supporting the issuing of DOIs for research software via Fabrica like it's already doing for geosamples and might one day do for research instruments and facilities.
There is however one interesting aspect regarding this VasoTracker software in line with the remark above on the risk of "inadvertently creating duplicates for unique PIDs". VasoTracker already has a PID. It's a RRID and not a DOI, granted (hence the nuanced title for the DataCite webinar?) but still a persistent identifier. How this RRID: SCR_017233 came to be assigned is not easy to tell. It's highly unlikely that this was a result of the outreach effort from the researchers involved in its development – it looks rather as if it had been automatically identified by some algorithm searching all across the Internet, including all github repositories.
In fact, AI-driven PID cross-linking routines could quite quickly get the PID Graph displayed that we are so painstakingly building these days. The SciCrunch portal that hosts all these RRIDs is in fact able to crawl the references to a specific 'identified entity' (a software package in this case, but also a research instrument or facility or an antibody) in the published research literature (with the caveat that it needs to be available Open Access, otherwise even the super-clever modern algorithm will crash into the old-fashioned profit-driven paywalls).
The SciCrunch identification of the research publications that cite this RRID-tagged piece of software is not perfect, or not yet: the list on the RRID webpage only includes two of the seven references shown on the VasoTracker webpage (as identified by the software creators themselves). The fact that these references appear at all on the very same SciCrunch page where this RRID: SCR_017233 is described is a huge progress anyway and a hint at what we will be able to achieve in the not-so-distant future.
The risks of duplication highlighted by the research information management colleague at the euroCRIS meeting should however be kept in mind during the process of expanding the DOI coverage. While duplication is not necessarily an issue per se, it would make sense for the different PID initiatives to enable some (reasonably simple) mechanism to map duplicate entries to each other.















