Enough alliteration already!
I was looking at Regina Obe's
excellent piece on how to use psql to do an import on fixed-width data, and thought to myself, "self, how would you do this with fewer copies, just in case you happened across a few hundred gigs of data?"
Here's what I came up with:
As I often do in situations like these, I applied some constructive
laziness, some Perl and some shell tools.
First, open up the file and start to make an unpack() pattern, which
is Perl's way to deal with fixed-width data. It's fast.
perl -le 'my $a="";
map {s/\D*(\d+)-(\d+).*/$a.="A".(1+$2-$1). " "/e} split(/\n/,<<'EOT');
Where does the pattern come from? Well, the page in Regina's post has
descriptions of all the fields, so I just pasted those in.
Columns 1-2: United States Postal Service State Abbreviation
Columns 3-4: State Federal Information Processing Standard (FIPS) code
Columns 5-9: Place FIPS Code
Columns 10-73: Name
Columns 74-82: Total Population (2000)
Columns 83-91: Total Housing Units (2000)
Columns 92-105: Land Area (square meters) - Created for statistical purposes only.
Columns 106-119: Water Area(square meters) - Created for statistical purposes only.
Columns 120-131: Land Area (square miles) - Created for statistical purposes only.
Columns 132-143: Water Area (square miles) - Created for statistical purposes only.
Columns 144-153: Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
Columns 154-164: Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude
EOT
OK, the pattern's made. Let's use iconv to turn that file into UTF-8
on the fly.
open my $f, "-|", "iconv -f ISO88591 -t UTF8 < places2k.txt";
Now, just a little more work. Unpack using the pattern, then chop off
what we don't want. Regina already supplied the table this is going
into, so we have that already made in our database.
while(<$f>){
chomp;
my $l = join("|", map{s/\s+//;s/\s+$//;$_} unpack($a, $_));
$l =~ s/^([^|]+|[^|]+|[^|]+|[^|]+).*/$1/;
print $l;
}
OK, we've now got a stream of pipe-delimited data. Let's be nice,
close that filehandle, and feed it directly to COPY.
close $f;' | psql -c "COPY places FROM stdin DELIMITER '|'"
And we're done!
A small typo: on my freebsd station i had to change encodings' names in iconv:
ISO88591 -> ISO-8859-1
UTF8 -> UTF-8.
In the second regexp, i escaped the pipes otherwise it returned only the first field (understood as OR in catch pattern). There were certainly removed during the post submission.
$l =~ s/^([^|]+\|[^|]+\|[^|]+\|[^|]+).*/$1/;
I luv' perl
is Perl's way to deal with fixed-width data. It's fast.
FYI: open can do the conversion all by itself, no need to read from iconv