Sitting in one of the nice pubs in Edinburgh during KohaCon 2012, I had the
idea to add MARC records as a proper datatype to the PostgreSQL database
server. For those not familar wit MARC, it is a standard format for bibliographic data.
If MARC records are a proper datatype, that means they are stored right
in the database, are backed-up, can be restored, replicated etc. just
with the standard database tools. If then a function is provided to
access individual fields of a MARC record, then this can be used in SQL
expressions, e.g. for selects or to create views etc. As PostgreSQL
supports functional indexes, you can create indexes on individual MARC
fields, giving you super fast access to your data.
How does it look?
There is a datatype called 'marc' for now:
CREATE TABLE books (
id serial,
sig varchar(16),
marc_record marc
);
MARC records are loaded into the database as raw, binary records,
encoded in hexadecimal:
INSERT INTO books (sig, marc_record) VALUES ('a01b', '3030383830.....');
To access individual MARC fields, the function 'marc_field()' was
created, it returns VARCHAR:
SELECT id, marc_field(marc_record, '020') AS isbn FROM books WHERE sig =
'a01b';
Of course MARC fields can be used to search data:
SELECT SELECT id, marc_field(marc_record, '020') AS isbn FROM books
WHERE marc_field(marc_record, '245') like 'whatever%';
An index on a specific field can easily be created:
CREATE INDEX books_isbn_idx ON books (marc_field(marc_record, '020'));
As a syntactic sugar, the expression marc_record@'020' is equal to
marc_field(marc_record, '020').
Marijana Glavica kindly let me use a MARC database of ~250'000 records
to make some tests, here are some real world examples:
books=# \d test_marc
Table "public.test_marc"
Column | Type | Modifiers
--------+---------+--------------------------------------------------------
id | integer | not null default nextval('test_marc_id_seq'::regclass)
marc21 | marc |
books=# select count(id) from test_marc;
count
--------
246727
(1 row)
How many croatian books do they have?
books=# select count(id) from test_marc where
substring(marc_field(marc21, '008'), 37, 3) = 'hrv';
count
-------
52582
(1 row)
Of course much more complex queries are possible with this, and using
the right indexes it is really, really fast. One query I tested when
from 8.4 seconds to 0.21 ms with the right index. That's a speedup of
40'000.
Thanks to Marijana, Dobrica, and Marc for feedback, interesting
discussions and ideas!
Feedback and suggestions are of course more than welcome.