Sitting in one of the nice pubs in Edinburgh during KohaCon 2012, I had the
idea to add MARC records as a proper datatype to the PostgreSQL database
server. For those not familar wit MARC, it is a standard format for bibliographic data.
If MARC records are a proper datatype, that means they are stored right
in the database, are backed-up, can be restored, replicated etc. just
with the standard database tools. If then a function is provided to
access individual fields of a MARC record, then this can be used in SQL
expressions, e.g. for selects or to create views etc. As PostgreSQL
supports functional indexes, you can create indexes on individual MARC
fields, giving you super fast access to your data.
How does it look?
There is a datatype called 'marc' for now:
CREATE TABLE books (
id serial,
sig varchar(16),
marc_record marc
);
MARC records are loaded into the database as raw, binary records,
encoded in hexadecimal:
INSERT INTO books (sig, marc_record) VALUES ('a01b', '3030383830.....');
To access individual MARC fields, the function 'marc_field()' was
created, it returns VARCHAR:
SELECT id, marc_field(marc_record, '020') AS isbn FROM books WHERE sig =
'a01b';
Of course MARC fields can be used to search data:
SELECT SELECT id, marc_field(marc_record, '020') AS isbn FROM books
WHERE marc_field(marc_record, '245') like 'whatever%';
An index on a specific field can easily be created:
CREATE INDEX books_isbn_idx ON books (marc_field(marc_record, '020'));
As a syntactic sugar, the expression marc_record@'020' is equal to
marc_field(marc_record, '020').
Marijana Glavica kindly let me use a MARC database of ~250'000 records
to make some tests, here are some real world examples:
books=# \d test_marc
Table "public.test_marc"
Column | Type | Modifiers
--------+---------+--------------------------------------------------------
id | integer | not null default nextval('test_marc_id_seq'::regclass)
marc21 | marc |
books=# select count(id) from test_marc;
count
--------
246727
(1 row)
How many croatian books do they have?
books=# select count(id) from test_marc where
substring(marc_field(marc21, '008'), 37, 3) = 'hrv';
count
-------
52582
(1 row)
Of course much more complex queries are possible with this, and using
the right indexes it is really, really fast. One query I tested when
from 8.4 seconds to 0.21 ms with the right index. That's a speedup of
40'000.
Thanks to Marijana, Dobrica, and Marc for feedback, interesting
discussions and ideas!
Feedback and suggestions are of course more than welcome.



In a past life I wrote some MARC format convertors... apparently different countries do different things with different fields.
But as I recall, MARC itself is little more than a structured storage format.
And your examples seem to show that you've only implemented the physical format, not the "intelligence" (which is good).
I also seem to recall that within a MARC field (which are denoted by a number like 020 I think?) there can be all sorts of subfields. Do you envisage making anything to make those easier to extract and indexable or will that be the domain of substr and friends?
And yes, I have plans to add proper subfield support as well, and also support for altering fields (or subfields). What I have now is only the beginning. And there is the question whether PostgreSQL's text search functions can be used for full text title search etc.
I'm sure we could make use of a MARC extension, too. I'll keep my eyes open
My guess is that you will end up having to shred the MARC out into separate tables, too, so that you can have meaningful values for "title" and "sorting title" and the umpteen different kinds of subjects and authors and other minutely granular access points to which librarians insist upon having access. (I'm a librarian, so count me among the guilty parties on that front...)
As an example, queries against fields will need to support indicators. Drawing from http://www.loc.gov/marc/bibliographic/bd245.html for extracting the title of a work in MARC21:
245 04 $aThe Year book of medicine
The sortable title would be "Year book of medicine", as determined by the second indicator "4" which instructs you to ignore the first 4 characters for sorting purposes. But you'll also want to provide access to the full title (which will also potentially include subfields b, c, and others, but excluding h because that tells us a fact about the medium in which the work is expressed and isn't a part of the title).
And so the MARC2MODS XSL transform does a lot of heavy lifting for us in sorting out all of the semantics of MARC21, and all of the day-to-day queries operate on the shredded values, not the original XML. I suspect anyone using the MARC type will have to recreate the same sort of rules and will require the same sort of granular access to the indicators... and repeating subfields... in the proper order.
Augh, I'm flashing back to all the headaches that were involved in providing a MARC library for PHP that supported all of these ins and outs (File_MARC in PEAR). The horror! The horror!