Sunday, August 3. 2008Why DBMSs are so complex
Developing a database application means converting your data into types understood by the DBMS, generating SQL code, sending it to the DBMS for evaluation, retrieving the result, and converting the result into types understood by your application language. It's even more strange when you use parameterized SQL: you generate some code with placeholders instead of literals, and then you send the literals separately (by "more strange," I certainly don't mean "worse," -- parameterized SQL is a huge improvement).
If this design were suggested for any purpose other than interacting with a DBMS, others would react with immediate (and justifiable) suspicion. The use of "eval" is strongly avoided, and many popular languages don't even support it. So why is this design so widely accepted for database management systems? The reason is a combination of the rich semantics of a language like SQL that supports constraints, arbitrary relational expressions, isolated transactions with rollback, etc.; and the centralized nature of a DBMS. It's not an unsolvable problem, but it's often approached from the wrong direction. Continue reading "Why DBMSs are so complex" Sunday, February 24. 2008Data Labels and Predicates
Here are a couple common representations of data sets:
or We naturally think of the former as representing a relation, but both representations are logically equivalent, even if the latter is overly verbose. In both cases, we've essentially just labeled the data, by which I mean we've briefly documented the meaning of individual properties, independent of the other properties. Continue reading "Data Labels and Predicates" Tuesday, December 18. 2007Database Formalities
Most application development can essentially be described as formalizing a business process. You take an informal process, remove the ambiguity that's invariably present, adapt it to computer interfaces, and restate that process in a formal language. This has many benefits, among them:
Friday, December 14. 2007ruby-pg is now the official postgres ruby gem
ruby-pg is now the official rubyforge project for the "postgres" ruby
gem. See the project here: http://www.rubyforge.org/projects/ruby-pg or install the gem directly: # gem install --remote postgres The previous project has gone unmaintained for a long time, which lead to the fork. This gem includes some important fixes, most notably the ability to compile against PostgreSQL 8.3. Continue reading "ruby-pg is now the official postgres ruby gem" Tuesday, December 11. 2007Terminology Confusion
I recently read the following article, another opinion in the long-standing surrogate key vs. natural key "debate":
http://rapidapplicationdevelopment.blogspot.com/2007/08/in-case-youre-new-to-series-ive.html I put "debate" in quotes because it's filled with so much confusion, I don't think most people understand what they're arguing over. People use different definitions for surrogate keys, often in the same discussion, so for the purposes of this article I'll define surrogate key to mean: "A system generated key that is not derived from the business rules, hidden from the business, but visible to the application". The surrogate key is used to take the place of a natural key, which is derived from the business rules. In other words, a surrogate key is a pointer. The article seems objective and reasonable at first pass, and the author seems to be honestly attempting to analyze the trade-offs. But all of the arguments the author makes are really business arguments for a generated unique identifier, not arguments for an identifier that is hidden from the business. Making a positive match between people is not always easy, because people don't want it to be. It's called privacy, and it has rational benefits (it makes it harder for people to steal your identity) and psychological benefits (people naturally like privacy). So what do businesses and government agencies do in order to get the job done? They each assign individuals a generated identity specific to that organization. An average person probably has many of these identities: SSN, grocery store savings card, student identification number, driver's license number, etc. Having so many identities helps preserve our privacy, because it's not easy to obtain a mapping between the various identities. I know about these many identities because they are not hidden from the business, they are explicitly laid out in the business rules by which these organizations operate. Hidden surrogate keys do not model reality, and do not help the business operate at all. You still have the same problems that you did before. Continue reading "Terminology Confusion" Monday, December 3. 2007On ORMs and "Impedence Mismatch"
The solution ORMs provide is to "map" an object class, which is a type (or domain), onto a table, which is a relation variable (a.k.a. relvar). This supposedly abstracts away an "impedance mismatch" between the two. ORMs are already off to a bad start, mapping a type to a variable, but I'll continue.
The real impedance mismatch is a set of very fundamental differences between application data and data in a well-designed relational database. Application data:
Saturday, November 17. 2007700 queries for one page load?
My employer is evaluating RT. So, we have a test instance set up, of course, and it's in a working state. The application is written in Perl with the Mason framework, it has support for many databases (including PostgreSQL), and generally has a decent feature set.
We need to migrate from our previous system, so we need to import the data into RT. RT's tools and APIs do not meet our importing needs, so we look into the schema to do the import directly. I decide the easiest way to figure out how the schema fits together is to just turn on query logging. So, we reload the page, and all of a sudden a deluge of output appears in our "tail -f". At first I thought it was just the normal number of queries generated by some abstraction layer, and we could dig through them. But it was about 700 queries. Continue reading "700 queries for one page load?" Monday, October 29. 2007Databases and Data Types
Hypothetically, if I were to use a much simpler database (let's say SQLite or BerkeleyDB) for a project, rather than a powerful RDBMS like PostgreSQL, what would be the first thing that I would miss from PostgreSQL? Not "miss" in a theoretical, long term sense, but "miss" in that it would cause a practical impediment to actually constructing even a prototype application?
Would it be triggers? Certainly not. Server side functions? Those would certainly be missed, although may projects go along fine without them. SQL's relational algebra and calculus that make a relational database relational? Maybe, that would certainly be preferable to procedural code using BerkeleyDB, but I don't think this would be the first thing I'd miss (digression: actually, it wouldn't be missed until later, when I was actually trying to debug a difficult problem or look at the data for analysis). The first thing I'd miss, very specifically, would be a TIMESTAMPTZ (or similar) datatype. Everything else pales in comparison to the lack of this single datatype. SQLite and BerkeleyDB both really just store text/binary data, and don't really have datatypes. What's strange is that even numbers are probably not as important of a type. How often do I find myself sorting by numerical value, or selecting records where some numerical value falls within some range? It happens, but not nearly as often as performing the same operations with time data. I don't think that I'm unique in this, either. This makes me think that types are very important to consider during development. And these ideas are a big part of what motivated me to work on the "t_interval" type (a project I'm working on with Selena and grzm), something that I think is needed in all RDBMSs. I've spent a significant amount of time (especially lately) thinking about data types, not just for databases but also for languages. I consider SQLite or BerkeleyDB to be analogous to assembly language: very simple, just pushes bytes around. It's very simple to do simple things, you can pretty much draw a picture of exactly what's happening (usually the picture would consist of a lot of boxes and arrows). It's much harder, however, to actually write complex systems in assembly, because it's harder to prove that the assumptions you're making later in the software are actually true. The exact same thing is true of simple databases: I put the bytes in, I get them back out; simple, right? But any complex software built on a "simple" database usually becomes a mess of inconsistent data, and the meaning of the data usually drifts with time. The complexity of PostgreSQL can really simplify things. Back to PostgreSQL: writing your own data type is one of the most interesting things that you can do with PostgreSQL (aside from working on the source code itself). PostgreSQL treats your data type the same way that it treats an internal data type, and that's very important. The data type you write is not a second-class citizen, and you can make a "myinteger" type that's every bit as good as the built-in integer type. While constructing a data type, you need to interact with C functions, types, operators, operator classes, input parsing, and text representation. You might even want to build the routines necessary for B-Tree, GiST, or GIN indexing. And in the process, you learn a lot about PostgreSQL. More than just learning PostgreSQL internals, you also learn a lot about types in general, and how your application can make better use of types. If you're a postgresql power user who wants to know how postgres works in greater depth, I highly recommend building a data type.
« previous page
(Page 1 of 1, totaling 8 entries)
next page »
|
Calendar
QuicksearchArchivesCategoriesSyndicate This BlogBlog Administration |
|||||||||||||||||||||||||||||||||||||||||||||||||
