A Genealogical Data Model

Sooner or later, most engaged genealogists outgrow their paper napkin diagrams, typed charts and even filing cards as preferred method of record keeping; they move on to some form of genealogical software. In my case, that step happened relatively early. I did a survey of freeware and commercially available software and was not impressed. No package offered all the features I desired, and none had adequate extension capabilities. Obviously, I had to design my own. But to tell the truth, necessity wasn’t the only reason for my DIY choice. Designing software gives me as much pleasure as watching porn movies or propagating rare orchids may give to others.

The die was cast. I would write my own. But what about context? I wanted the program for myself, to meet my needs, and I wasn’t going to commercialize it. It had to run on a PC platform; after all, real men don’t use Macs! My choice for programming language was between C++ and Visual Basic. I am too lazy to keep track of pointers. So I was pretty much limited to VB6 for the first version.

The next big issue was the basic data model for genealogical records. Obviously, my research led me to GEDCOM, version 5.5.2 . GEDCOM goes back to the 1980s, when it was developed by the LDS church for public ancestry files. It has evolved since and become the de facto standard for serialized genealogical records.  For several years there was talk of GEDCOM XLM, but it never made a mark; and why should it? GEDCOM 5.5.2 is perfect – perfect for serializing genealogic records!

And there’s the rub. Efficient software design involving large data sets requires a relational database. Yes, GEDCOM can be forced into the Procrustes bed of relational tables. But basing a genealogical program entirely on the GEDCOM structure means sacrificing execution time – big time. Drawing a kinship tree iterating through the basic GEDCOM model is like sucking treacle through a straw. On the other hand, taking a steam roller to GEDCOM to force the data into a flat table is no answer either. Resulting rows would contain hundreds of columns, most of them empty.

This conundrum gave me quite a few sleepless nights. But my epiphany arrived just in time: I would keep the information in a GEDCOM structure in the form of a simple, three column table corresponding to the three columns of a GEDCOM record: “Level”, “Tag” and “Value”. Actually, I lied. There are four more columns to make things work: “Node”, “Parent”, “Root” and “Cheat”. Each record has a unique index number called “Node”. To maintain the hierarchical structure of GEDCOM, any record can act as a parent to child records. Child records carry the node number of their parent record in their parent field. In the context of genealogy, this terminology can be confusing. ‘Parent’ and ‘Child’ refer only to a relation between records, not people. Any record with level zero is considered a root record. Its value goes into the “Root” field of itself and all its descendants. That leaves us with the “Cheat” field. It is a boolean, usually ‘false’. I wanted to leave myself the option of cheating, i.e. to use non LDS-sanctioned tags. But I wanted to cheat honestly by signalling where a tag wasn’t ‘kosher’ with a ‘Cheat’ value of ‘true’!

The last ingredient to the data model are the two index tables, one for individuals, the other for families. But having the same data in more than one table of a relational database would probably cause Edgar Codd to turn in his grave. In fact, you can’t even call the database relational. OK, so I call it pseudo-relational. There is a price, though, I have to pay for this act of transgression: every transaction involving writing to a root record or changing a family composition must pass through a compulsory integrity check. One can also wrap this operation into an SQL transaction. But it is a small price to pay.

The program worked like a charm. It was robust, fast and could perform all the functions I wanted. Since then I have written a second version in C#. It is even faster and easier to maintain and extend. It runs on all our networked computers. My wife and I both use it simultaneously. The database runs in an MS-SQL-Express server on one of our machines. When we travel, we create a temporary copy of the database on the laptop.

This entry was posted in Computing, Genealogy. Bookmark the permalink.

One Response to A Genealogical Data Model

  1. ‘Seems re-inventing the wheel is all I have to work with at this time, but I’m looking to get constructive criticism. I’ve a basic structure, and am putting out what I’ve got, and hoping your input may help. I’m using MS Access 2003 (relational database). It seems I must use the “male” in the relationship in the primary field
    GIVEN:
    {datatype}
    three tables, and the primary fields indented

    nm
    nmID “Name ID” {AutoNumber}
    nmLast “Last Name” {Text}
    nmFirst “First Name” {Text}
    nmMiddle “Middle Name” {Text}
    nmDOB “Date of Birth” {Date/Time}
    nmDOD “Date of Death” {Date/Time}
    nmGndr “Gender” {binary} (at the time of parenthood)
    etc. (other fields)
    prtnr
    prtnrID “Partners ID” {AutoNumber}
    prtnrPrsn “Primary Person or Male” {Long Integer} (uses nmID)
    prtnrPrtnr “the Partner” {Long Integer} (uses nmID also) This is the partner of prtnrPrsn
    prtnrWed Wedding date or date of first official “relationship togetherness” for the non-wed folks {Date/Time}
    prtnrStts Status of Relationship (Wed, Roommates, Involved, Divorced, Separated, etc.) {Number} (uses another table for dropdown)
    chldrn
    chldID “Child ID” {AutoNumber}
    chldNm “Child Name” {Long Integer} (uses nmID)
    chldPrnts “Parents” {Long Integer} (uses prtnrID)

Leave a Reply

Your email address will not be published. Required fields are marked *