Elsevier URNs

Specification version 3.0 (Target versions: PS 7.x/MSR 3.x/MS 3.x)

Elsevier URNs serve as unique identifiers of Pathway Studio (PS) nodes. To simplify their use, Elsevier URNs are composed so that simple, case-sensitive string comparison (strcmp) can be used to compare them for equality. This requires formal approach to their creation and more strict checks of their validity (with possible conversion).

Elsevier URNs conform to the URN syntax specification ( RFC 2141 , RFC 2396 ):

URN ::= urn:NID:NSS

As a general guideline, and unless explicitly specified otherwise, alphabetical parts of AGI URNs should be written in lower case. Also, NSS portion is always urnencoded, so that it contains only characters from the following set: 0-9 a-z A-Z - _ . ! ~ * '( ) (not counting percent signs used for encoding).

The urn: part is required and should be in lower case.

The Namespace Identifier (NID) part of Elsevier URNs identifies the type of the object and dictates the format of the corresponding Namespace Specific String (NSS). All Elsevier NIDs have agi- prefix to minimize namespace clashes with non-Elsevier URNs. A lower-case alphanumerical string for the type follows the prefix; specific types and corresponding strings are listed below.

The format of the Namespace Specific String (NSS) depends on the type. In most cases it is an identifier belonging to a fixed nomenclature or dictionary, uniquely defined by the NID part. In each case, if the nomenclature allows letters and is, by its nature, case-insensitive, the choice what case to use in Elsevier URNs is fixed in the corresponding section of this document. Also, no matter what nomenclature is used, spaces and other peculiar characters in the original entry must be urnencoded1 before being included in the NSS part. Encoding replaces each peculiar character with three characters: percent sign (%) and two lowercase hexadecimal digits specifying the ASCII code for the character2. Plus sign (+) should not be used to encode space characters; use %20.

1 PA 2.5 and earlier did not urnencode NSS parts, letting in spaces, colons and other characters explicitly deprecated by this specification. See Appendix B for details on conversion/validation issues.

2 UNICODE characters outside ASCII range should first be converted to the shortest UTF-8 sequence and then urnencoded in a usual manner.

To guarantee that strcmp can be used for URN equality tests, the set of characters that needs to be encoded is fixed. Characters in 0-9 a-z A-Z - _ . ! ~ * '( ) set must always be included as

is; every character not in this set must be urnencoded (sample code for this is at the end of the document). ASCII NUL („\0‟) is not allowed in any form.

The rest of the document specifies the format for particular types of entities.

Protein URNs

PA‟s protein nodes (node type “Protein”) rely on Entrez GeneID (formerly LocusLink) as a fixed nomenculature for NSS part:

ProteinURN ::= urn:agi-llid:EntrezGeneID

EntrezGeneID is a positive integer in decimal notation with no leading zeroes; no URN encoding is needed. GeneIDs from three organisms are used in the following order of preference: human, mouse and rat. The following URNs are also acceptable if the EntrezGeneID for the protein does not (yet?) exist:

urn:agi-prot:HugoID

urn:agi-prot:HugoSymbol

urn:agi-gbprot:GenBankID

Note: Hugo Symbols and GenBank IDs may contain peculiar characters (HUGO allows #); these and other such symbols must be urlencoded. Also, both nomenclatures specify that canonical letter case is UPPER, and so upper case must be used in NSS parts. For GenBank, accession numbers without versions should be used. Although both HUGO IDs (HGNC IDs), HUGO Symbols and GenBank URNs are acceptable, they should not be used for proteins that have assigned Entrez GeneIDs, because the URNs from different nomenclatures will not be mapped to each other in the PS database.

Examples:

urn:agi-llid:51004 recommended

urn:agi-prot:UGT1A13

urn:agi-gbprot:AAH35562

Small Molecule URNs

Most chemicals use the CAS registry number for URN:

SmallMolURN ::= urn:agi-cas:CASnumber

As an alternative, chemicals not covered by CAS registry or whose CAS number is unknown, may use URNs based on NLM‟s PubChem compound or substance IDs (CIDs, SIDs):

SmallMolURN ::= urn:agi-pccid:PubChemCID

SmallMolURN ::= urn:agi-pcsid:PubChemSID

PubChemCID (PubChem Compound ID, as assigned after April 6, 2005) is a decimal number with no leading zeroes. So is PubChemSID (PubChem Substance ID); substance IDs should not be used for substances describing a single compound with known CID.

As a last resort, a small molecule URN can be derived from its internal MedScan ID or name:

urn:agi-smol: InternalID

urn:agi-smol:name

The InternalID is a numerical small molecule IDs (range: 1,000,000 ≤ ID < 2,000,000). IDs are sequentially assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry. This nomenclature coincides with IDs used by MedScan and entries of MedScan dictionaries. ID assignment is stable, but not extendable by third parties.

The chemical names used for URNs should be a case-sensitive string identifiable in text by the preprocessor (lower case is preferred when case is unimportant). The string should be urlencoded, as required by general Elsevier URN conventions.

Examples:

urn:agi-cas:58-08-2 recommended

urn:agi-pccid:213587 recommended alternative

urn:agi-pcsid: 6850756 recommended alternative for substances

urn:agi-smol:1808978

urn:agi-smol:diacyl%20lipopeptide

Cell Object URNs

PS‟s cell object nodes (node type “CellObject”) rely on Gene Ontology‟s cellular_component IDs for unique identification:

CellObjectURN ::= urn:agi-gocellobj:GoID

where GoID is a 7-digit numerical identifier assigned by GO (without the GO: prefix). If for a certain cell object a GO ID does not exist, the following alternative can be used:

urn:agi-cellobj:InternalID

which uses numerical InternalIDs in the range (2,000,000 ≤ ID < 3,000,000). IDs are sequentially assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry. This nomenclature coincides with IDs used by MedScan and entries of MedScan dictionaries. ID assignment is stable, but not extendable by third parties.

Examples:

urn:agi-gocellobj:0005739 recommended

urn:agi-cellobj:2000929

Protein Complex URNs

Protein complexes covered by Gene Ontology use GO-derived URNs of the following form:

ComplexURN ::= urn:agi-gocomplex:GoID

GoIDs are seven-digit numbers; leading zeroes, if any, must be present.

If for a certain well known protein complex a GO ID does not exist, the following alternative can be used:

urn:agi-complex:InternalID

which is derived from MedScan internal numerical IDs (3,000,000 ≤ ID < 4,000,000). IDs are sequentially assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry. This nomenclature coincides with IDs used by MedScan and entries of MedScan dictionaries. ID assignment is stable, but not extendable by third parties.

If a complex has no GO ID and its identity is based solely on its content, their URNs can be derived from URNs of the parts as follows:

1) the list of components' URNs is sorted alphabetically

2) sorted URNs are concatenated into a single string using space char as a separator

3) MD5 digest is calculated for the concatenated string, and transformed into hexadecimal form (32 chars, all lowercase)

4) the output urn is formed by concatenating "urn:", the namespace prefix ("agi-complex:"), the string "urnhash-" and the digest in hex form:

ComplexURN ::= urn:agi-complex:urnhash-MD5Digest

Examples:

urn:agi-gocomplex:0000119 recommended

urn:agi-complex:urnhash-ed8290a1a6d723b489f231f942a7a67d recommended alt.

urn:agi-complex:3000001

Cell Process URNs

Cellular processes covered by Gene Ontology use GO-derived URNs of the following form:

CellProcessURN ::= urn:agi-gocellproc:GoID

GoIDs are seven-digit numbers; leading zeroes, if any, must be present.

If for a certain well known cellular process a GO ID does not exist, the following alternative can be used:

urn:agi-cellproc:name

which uses urlencoded, lowercase, human-readable short (usually single word) names of cellular processes, assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry.

Examples:

urn:agi-gocellproc:0006916 recommended urn:agi-cellproc:sarcolemma%20excitability

Pathway URNs

In Pathway Studio, pathways are document-like objects which usually are not assigned URNs. The only situation where this might be needed is to specify pathway‟s place in the folder structure when making database dumps. For this purpose, a URN can be derived from a unique UUIDs (128-bit Universally Unique Identifiers, as per RFC 4122: http://www.ietf.org/rfc/rfc4122.txt; also known as GUIDs in Windows world):

PathwayURN ::= urn:agi-pathway:uuid-UUID

UUIDs have the following form: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, where each x stands for one lowercase hexadecimal digit.

As an alternative, a pathway URN can be derived from the name of the pathway. For this purpose, it is recommended to make the name from a list of one or more protein name(s) separated by whitespace followed by the word “pathway”.

urn:agi-pathway:name

Examples:

urn:agi-pathway:uuid-7d444840-9dc0-dec1-beef-5ffdce74fad2 recommended

urn:agi-pathway:JAK%20pathway

Protein Functional Class URNs

Protein functional classes covered by Gene Ontology use GO-derived URNs of the following form:

FunctionalClassURN ::= urn:agi-go:GOid

GoIDs are seven-digit numbers; leading zeroes, if any, must be present.

For functional classes corresponding to well-known enzymes with EC numbers, EC-based URNs can be used:

If for a certain well known functional a GO ID does not exist, the following alternative can be used:

FunctionalClassURN ::= urn:agi-enz:ECNumber

For the ECNumber parts, “EC” prefix should not be used; only numerical dot-separated part goes into NSS.

If a functional has no GO ID and its identity is based solely on its membership, their URNs can be derived from URNs of the parts as follows:

1) the list of members' URNs is sorted alphabetically

2) sorted URNs are concatenated into a single string using space char as a separator

3) MD5 digest is calculated for the concatenated string, and transformed into hexadecimal form (32 chars, all lowercase)

4) the output urn is formed by concatenating "urn:", the namespace prefix ("agi-protfc:"), the string "urnhash-" and the digest in hex form:

FunctionalClassURN ::= urn:agi-protfc:urnhash-MD5Digest

As a last-resort alternative, URN for a functional class can be derived from its name. Note, that name-based URNs should not be used if EC or GO–based ID can be assigned.

urn:agi-protfc:name

Examples:

urn:agi-go:0008545 recommended

urn:agi-enz:1.14.99.7 recommended

urn:agi-protfc:urnhash-a6d7e889f2fa67d2942a73123db490a1 recommended alt.

urn:agi-protfc:tyrosine%20kinase

Ontology Group URNs

Ontology groups representing concepts of Gene Ontology use GO-derived URNs of the following form:

OntologyGroupURN ::= urn:agi-gogroup:GoID

GoIDs are seven-digit numbers; leading zeroes, if any, must be present.

In Pathway Studio, Elsevier ships two additional ontologies, AO (simple alternative to GO) and PO (plants-specific). AO and PO‟s internal IDs have the same form as GO IDs, and the URNs are derived in a similar manner:

OntologyGroupURN ::= urn:agi-aogroup:AoID

OntologyGroupURN ::= urn:agi-pogroup:PoID

AoIDs and PoIDs are seven-digit numbers; leading zeroes, if any, must be present.

Examples:

urn:agi-gogroup:0008545 recommended

Treatment URNs

Pathway Studio‟s treatment nodes (node type “Treatment”) use which is derived from MedScan internal numerical IDs (13,000,000 ≤ ID < 14,000,000).

urn:agi-treatment:InternalID

IDs are sequentially assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry. This nomenclature coincides with IDs used by MedScan and entries of MedScan dictionaries. ID assignment is stable, but not extendable by third parties.

As an alternative, treatment URNs can be based on urlencoded, lowercase, human-readable short (usually single word) names of treatments:

urn:agi-treatment:name

Examples:

urn:agi-treatment:13000938

urn:agi-treatment:cold%20shock

Disease URNs

PA‟s diseases (node type “Disease”) covered by MeSH disease branch (C) use MeSH-based URNs of the following form:

DiseaseURN ::= urn:agi-meshdis:MeSHheader

MeSH headings are urlencoded in their original text form.

Note: Headings should be taken as-is, with original case, spaces, commas etc. and url-encoded according to this specification. The best way to ensure correct case and punctuation is to copy the heading verbatim from the “MeSH Heading” field of the NLM entry page for the term, e.g. http://www.nlm.nih.gov/cgi/mesh/2005/MB_cgi?mode= &term=Breast+Neoplasms,+Male&field=entry. Do not copy the encoded query from the above URL, urlencode the actual heading!

The rest of diseases use MedScan internal numerical IDs (9,000,000 ≤ ID < 10,000,000). IDs are sequentially assigned by Elsevier as new entities are entered into the Elsevier‟s internal registry. This nomenclature coincides with IDs used by MedScan and entries of MedScan dictionaries. ID assignment is stable, but not extendable by third parties.

urn:agi-disease:InternalID

Examples:

urn:agi-meshdis:Carcinoma%2c%20Merkel%20Cell recommended

urn:agi-disease:9123456

Appendix A: Sample URN encoding procedure

This is an example of the correct URN encoding procedure for NSS part. Translate it to you favorite language/library.

/* encode s, using outbuf to collect the result */

char* urnencode(const char *s, cbuf_t* outbuf)

{

const char *mks = "-_.!~*\'()"; /* rfc2396 */

assert(s);

assert(outbuf);

cbclear(outbuf);

while (*s) {

int c = *s++;

if (isalnum(c) || strchr(mks, c) != NULL) {

/* as-is */

cbputchar(c, outbuf);

} else {

/* encode */

char buf[4];

sprintf(buf, "%%%.2x", c);

cbputs(buf, outbuf);

}

}

return cbdata(outbuf);

}

Appendix B: Conversion and Validation

To force adoption of this specification for PS 3.0, PS‟s input routines should provide automatic conversion of malformed URNs, complemented by user feedback needed to allow data providers to fix their bugs to make such conversion unnecessary in the future. Common problems and their solutions:

- missing urn: prefix should be added
- malformed NID: replace obsolete forms if possible; make sure agi- prefix is in place
- malformed NSS: if NSS part contains any characters other than the ones explicitly allowed for NSS part and no percent signs, it is likely that is has not been URN- encoded – encode it to fix the problem
- malformed NSS, case 2: if NSS part contains any characters other than the ones explicitly allowed for NSS part and has percent signs, it is likely that is been URN-encoded with wrong set – decode and re-encode it to fix the problem
- malformed NSS, case 3: type-specific checks may be needed to make sure that NSS uses correct case and conforms to type-specific requirements.