Author identity

The main project I'm focused on at work right now relates to uniquely identifying the people who write and referee articles for our journals. Our referee database is pretty good, but even that has a number of duplicate entries, as I've been finding. In one case the name was the same but with last and first name's switched; in another somehow we'd created a record with a slightly modified version of the surname (and a note that the name was wrong). It's a lot trickier than one might at first imagine, but an open public solution for researcher identification could be a big help. There seem to be several projects in the works in that regard:

There are at least two general issues:

(1) Identity of authors of previously published articles

  • some sort of claim/validation process is needed. We are trying to simplify that as much as possible for our articles by some heuristics on names, contact information, subject areas, etc. - in particular, linking an author with an email address seems to be the most reliable identifier, though not perfect.
  • However, this sort of simplification/automation is far easier for the corresponding author - for other authors we only have names (and some record of affiliation and subject) to go by, so there will definitely need to be a more complex "claims" process in that case
  • All of this is complicated by authors changing affiliations, using several slightly different names or even changing names, changing corresponding authors on a paper or having a mismatch between the listed corresponding author and the contact information, occasionally sharing email addresses with others, etc.

We think we can get most of our records linked in this way to unique instances of authors, but we're inevitably going to have some percentage of erroneous relationships: some papers linked to the wrong author, and some authors listed two or more times as being different people.

ResearcherID in a sense doubles the problem - ISI makes no attempt to verify authorship claims, and it doesn't even seem to provide a way to uniquely list articles more than 10 years old (I uploaded all my 1995 and earlier publications there, and it doesn't seem to understand anything about them - no citation data etc.) At least CrossRef would solve the "what article are you talking about" problem by ensuring unique article identifiers in the first place. And then, how do we uniquely associate our author records with a ResearcherID number? ISI doesn't currently provide a protocol for an individual to prove to a third party that they own a particular ResearcherID, and the webservices they do provide are tied to WebOfScience subscriptions. An individual can acquire more than one ResearcherID, and list subsets of the same publications under multiple identities, if they chose (or if they just forgot they'd registered previously).

AuthorClaim seems to be based on an internal database of articles (most of mine were not there, though it found over 1000 matching my name!), so it controls the article side of things - doing this with the CrossRef database and DOI's as article identifiers would make sense. That doesn't seem to be what it's doing yet, but at least articles should be uniquely identified there, so it doesn't have the ResearcherID problem on that front. On the other hand it does seem to have similar issues with claims and potential for author duplication.

OpenID, generically, does not help either although it does provide that third-party proof-of-ownership piece that ResearcherID is missing right now. An individual can have many different OpenID's, just as they can have many different email addresses, and an OpenID associated with an individual is probably practically just about as useful as an email for uniquely identifying them. We already have email addresses for essentially all our (corresponding) authors of the last decade, and two decades for a good fraction, and it's still tough to figure out exactly who's who.

Unless there's some strong motive for researchers to stick to a unique non-shared ID in self-identifying, or other actors in the research system force such a unique ID somehow, this issue of duplicate records for older work is not going away.

(2) Identity of authors from the point of submission through publication/citation etc.

  • this ensures the integrity of the links between author(s) and article by getting it in up front - we have this now based on email addresses (and names and affiliations), at least for corresponding authors, but those can change, so a more permanent unique ID would be useful.
  • However, this doesn't completely solve the duplicate ID problem; an author can still acquire more than one ID. The advantage in a going-forward system is that the author would have some motive in uniting their records; otherwise their publications would split between the separate ID's.
  • Co-authors (other than the corresponding author) would also need to have some way to prove their identities and authorship if we want the same reliable connection for other than the corresponding author. This has the advantage that all authors should be aware of a publication in progress. On the other hand, this could be quite a burden for large collaborations, and we'd need to make exceptions for papers where one of the authors becomes unavailable or deceased during the publication process.
  • The requirements for handling issue (2) are technically straightforward but perhaps practically difficult due to the need for cooperation:

    • Provide every author with a unique permanent identifier (OpenID, ResearcherID, email address...)
    • Provide a mechanism for third-party validation of that identifier (OpenID and email allow this - email just by having the third party send a validation email)
    • Have publishers require authors go through that validation in the process of submitting articles for publication
    • Record and preserve the unique permanent identifiers as part of the article record (associated with a given author of that article), and distribute as part of article metadata etc.

    The requirements I'd suggest to get full authoring relationships for historical data, issue (1), are technically trickier but perhaps practically easier:

    • Heuristically group author-article pairs by author name, affiliation, subject, email, and whatever other article-based data is available that could uniquely identify each author
    • Use non-article data, if available (history of a particular person's affiliations or email addresses, for instance) to join these groups together into best-match single-author clusters
    • For authors still working as researchers, apply the unique identifier used for current work (issue 2, above) to have them validate themselves and claim to be particular authors, and correct or approve the single-author cluster(s) belonging to them.
    • Ensure that only one validated claim (one unique author identifier) is allowed per author of any article
    • Update article metadata with the author identifiers

    There's a lot of potential here, but it's going to be slow going without a widely used and agreed upon unique, validatable, permanent, author identifier.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Turns out there's lots more

Turns out there's lots more out there on this, even just recently...

Just this past Monday Geoffrey Bilder at CrossRef posted a "what do people want" with a variety of links to continuing and previous discussion on the subject. Most notably I'd like to quote from Bilder's interview a couple of months back. He also sees two major categories of "use cases" for author ids, divided roughly as I have, into "knowledge discovery" and "authentication" purposes. And he makes it clear we need a centralized authority of some sort on this:

my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is- how do you make the centralized system that eventually emerges accountable? This is, of course, a social issue more than a technical issue and involves making sure that whatever entity emerges has clearly defined data portability policies and a “living will” that attempts to guarantee that the service can be run in perpetuity- even if by another organization.

There was also a followup summary of the openwetware dicsussion from Cameron Neylon with many more links... some good quotes from that:

any system that works will have to be credible and trustworthy to researchers as well as other users, and have a solid and sustainable business model ...
the majority view appeared to be that CrossRef would be right place to start ...
This is not a problem unique to research and one in which a variety of messy and disparate solutions are starting to arise. Maybe the best option is to sit back and wait to see what happens. I often say that in most cases generic services are a better bet than specially built ones for researchers because the community size isn’t there and there simply isn’t a sufficient need for added functionality. My feeling is that for identity that there is a special need, and that if we capture the whole research community that it will be big enough to support a viable service.