The quantity of machine-readable census material has increased dramatically the last decade, e.g. in Britain the 1881 census alone contains over 30 million data records; for Norway both the 1801 and the 1900 censuses are now fully transcribed, totally about 3 million data records.
Two approaches are usual when working with census material for historical and demographic research: The first uses small subsets of a census with all the original data present and based on the census-lists. The second type of study usually has a greater geographical coverage but uses data on aggregate levels and are based on existing published reports. An example of the first approach might be a close examination of working class families in part of a town; the second approach might be typified by a comparative economic history for several towns.
The large-scale computerisation of the original census listings change the relationship between these two approaches, the micro-study and the macro-study. In theory, as one has the whole census in machine-readable format, one could consider using the small scale approach on larger (or the whole) scale. It is also quite possible to redo the previously conducted aggregations, and additionally change and make many other and new types of aggregation. This also allows for a much greater freedom in sampling; sampling not rigorously defined by the administrative units of the census.
However, in order to use small-scale approaches on larger parts of a census one also needs to computerise the preparation of the source material, i.e. the examination of each record in order to code, classify and/or aggregate for historical analysis. Many of the techniques developed are based on close and individual examination of each individual in the census-lists, like; is what is each persons gender and marital status? What is the relationship to the head of the family, 'son', 'mother in law'? What is the occupation of this individual, and so on.
This paper deals with the problems encountered in the formalisation of some types of family and household studies (the small scale approach), studies where kin relationship, household composition and household structure are some of the key objects of the study. More specifically the approach for studying family and household developed by the Cambridge Group for the History of Population and Social structure. The group was established in 1964 and the methods and techniques have become quite widespread. This paper seeks to answer two important questions: To what degree are these techniques for household studies possible to conduct automatically with computer programmes? What processes can be formalised and what are the problems with using computers to do "low-level" interpretations of a historical source. This "low-level" semantic could be rather "high-level" for computers.
A census list has a simple structure, it is basically a list of persons with relatively few attributes, name, sex, marital status, relationship to head of family, age and occupation, sometimes also birthplace, decease and disabilities. The census lists of the last half of 19th century was also often written on some kind of pre-printed or tabulated form with clearly defined rows and columns. In a way they look very much like a spreadsheet or a table in a database. A final typical feature of the censuses is the way persons are listed, they are very often grouped together in some type of co-resident groups, often denoted as families or households.
Figure 1 shows as an example of this type of source, the 1881 census of Great Britain. Figure 1 is one page of the Census Enumerators Book, which really are a transcription of the household schedules, one schedule was delivered to each household and collected within a week. The reader should take a little time studying the fourth to the seventh column (labelled Houses and Condition). The '\\' (or '\') of the fourth column indicates the end of a house or household, the fifth column is the name of a person, the sixth the relation to the head of family, seventh marital status. Together with age and sex (eight and ninth column) these are important when one studies household structure.Figure 1 Example of a page in the Census Enumerators book of the 1881 Census, Great Britain.
|22 Page 2|
Municipal Borough of
(2) Blind [Idiot]
(U.) or Building
|Male||Female||(3) Imbecile or
|1||The Prince William Beer House||/||Charles Smart||Head||Mar||44||Publican||Great Gadesden Herts|
|Patience Smart||Wife||Mar||56||Publican Wife||South Perrott Dorset|
|Louisa Smart||Daur||Unm||16||Dressmaker||Shenley Herts|
|Albert Barton||Stepson||Unm||25||Shoemaker||Shenley Herts|
|Henry Field||Lodger||Widr||65||General Labr||Shenley Herts|
|James Webb||Lodger||Unm||66||General Labr||Codenham Suffolk|
|\\||Joseph Auntler||Lodger||Unm||66||General Labr||Shenley Herts|
|2||Shoe Shop||/||William Hackett||Head||Mar||72||Boot Maker||Shenley Herts|
|\\||Sophia Hackett||Wife||Mar||73||Boot Maker Wife||Aldenham Herts|
|3||Private House||/||Henry Clarke||Head||Mar||48||Blacksmith||Beighton Norfolk|
|Phoebe Clarke||Wife||Mar||41||---||Green Of Shenley Herts|
|Isabel Clarke||Daur||Unm||18||Scholar||Shenley Herts|
|\\||John Bartram||Nursechild||-||7||Scholar||Tottridge ---|
|4||Mary Smith||Head||W||66||---||Chesham Bucks|
|Issak Smith||Son||Unm||25||Gardeners Labr||Shenley Herts|
|\\||Robert Entickross (?)||Lodger||Unm||24||Gardeners Labr||Sussex ---|
|5||Chapel Yard||/||William Greenham||Head||Mar||60||Farm Labr||Hatfield ---|
|\\||Charles Greenham||Son||Unm||18||Farm Labr||St Petters Herts|
|6||George Lawford||Head||Mar||54||Gardener||Smallford Fat... Herts|
|Sarah Lawford||Wife||Mar||50||Gardeners Wife||Enfield Middlesex|
|James Perry||Boarder||Unm||54||Labr||Shenley Herts|
|\\||Abraham Auntler||Boarder||Unm||37||Labr||Shenley Herts|
|7||Thomas Perry||Head||Mar||30||Gardener||Shenley Herts|
|Martha Perry||Wife||Mar||32||Gardeners Wife||North Mimms ---|
|Total of Houses..||4||Total of Males and Females...||16||8|
Table 1 Household structure, Bergen Norway 1865-1875, Ipswich and Hull, England 1881.
|Hammel-Laslett classification scheme|
|1||1a) Widowed||490||7,9 %||708||8,1 %||626||6,1 %||1198||7,9 %|
|Solitaries||1b Single||444||7,1 %||1167||13,4 %||317||3,1 %||454||3,0 %|
|2 No family||2a coresident siblings||34||0,5 %||56||0,6 %||134||1,3 %||175||1,1 %|
|2b coresident relatives, other||33||0,6 %||85||1,0 %||267||2,6 %||323||2,1 %|
|3 Simple family||3a Married couple alone||766||12,3 %||950||10,9 %||1590||15,4 %||2262||14,8 %|
|house-holds||3b Married couple with child(ren)||3270||52,6 %||4051||46,5 %||4836||47,0 %||6449||42,3 %|
|3c Widowers with child(ren)||252||4,1 %||247||2,8 %||249||2,4 %||420||2,8 %|
|3d Widows with child(ren)||575||9,3 %||920||10,6 %||780||7,6 %||1598||10,5 %|
|3e||53||0,9 %||110||1,3 %||60||0,6 %||94||0,6 %|
|4 Extended||4a Extended Upwards||88||1,4 %||113||1,3 %||311||3,0 %||438||2,9 %|
|family house-||4b Extended downwards||79||1,3 %||80||0,9 %||613||6,0 %||850||5,6 %|
|holds||4c Extended laterally||91||1,5 %||187||2,2 %||333||3,2 %||605||4,0 %|
|4d Comb-inations||7||0,1 %||12||0,1 %||61||0,6 %||127||0,8 %|
|5 Multiple Family||5a Secondary unit(s) UP||2||0,0 %||9||0,1 %||11||0,1 %||23||0,2 %|
|house-holds||5b Secondary units(s) DOWN||29||0,5%||8||0,1 %||97||0,9 %||192||1,3 %|
|5c Units all on one level||6||0,0 %|
|5d Fréréches||1||11||0,1 %||36||0,2 %|
|5e Other||1||0,0 %|
|No. of house-holds||6213||100 %||8704||100 %||10296||100 %||15251||100 %|
|No. of person records||30403||39717||48143||70206|
Sources: Census of Bergen, Norway 1865 and 1875 transcribed by Statsarkivet i Bergen. Great Britain census 1881, History Data Service, University of Essex. Public Record Office equivalents RG 11/1848-1878 (Ipswich registration district, Suffolk, England), RG 11/4765-4780 (Hull registration district, Yorkshire, England).
One 'product' of a household analysis is a table like the one in table 1. Table 1 shows number and the frequency distribution of households according to the Hammel-Laslett classification scheme.
The figures in Table 1 are frequency distribution of households according to the Hammel-Laslett classification scheme. The sample is extracted from a (hypothetical) study of North Sea ports between 1865-1881, Bergen in Norway, Ipswich and Hull in England, using three different censuses. A table like this could be part of a comparative study on household structure, but contrary to many studies of this kind the coding of the census data is partly done by a programme and the classification of households is done solely by a computer programme. A brief look on the last row of the table indicates that large populations have been examined by the software programme, larger than in a household study where the coding and classification is done by hand. Both the size of the data and the time-consuming process to code and classify can restrict the researcher's study and force limitations of the study's scope and perspective. However by using large populations the researcher can also quite easily select sub-populations, like only comparing households of a special age group, like male headed household where the head is 25-30 years old or only the households of a occupational group, like mariners (or both). In a computer environment the selection of such sub-populations can be changed and refined at any point.
The process going from the census list in figure 1 to the figures in table 1 is long and seldom straight forward, and will not be described in detail here. For a detailed description of the method see Peter Laslett and Richard Wall (eds.), Household and family in past time, Cambridge 1972, p. 1-86. Briefly in a computer-assisted analysis of this kind, the steps would be:
The term never-married is used here because it is more specific than unmarried (which also can include widowed) and offspring can include both own children and step-children. Many households contains only one CFU and these will end up in group 3 cf. Table 1, but if there are two CFU's within a household, e.g. husband and wife together with a married son, his wife and child this household will be classified as 5b Secondary unit disposed downwards from head.
In order to classify households this way the researcher must sometimes also consider the persons age and the surnames if the specified relationship to the head is ambiguous or unclear. Typically one wants to create a new table or file where each household is the record. Both the person-file and the household-file can be kept in one database or exported into a statistical package like SPSS or SAS.
Manually this process is rather time-consuming and most household analyses are therefore quite limited in speaking of the size of the population. In a typical household study one or a few parishes using two or three censuses, i.e. with a total population ranging from 300 to 4000 persons and 60 to 800 households in each census.
It is quite obvious that applying this method on larger samples
both takes a lot of time and one could easily make mistakes. Therefore
computer programs have been made to do part of the steps 3-6,
coding and classification of relationship to head and classification
of households. Common for all of these attempts are that they
are highly specialised, they have been done on a limited amount
of censuses and highly proprietary databases, because in these
cases one is looking for one solution for one census or a census
type. Therefore I presume that little effort has been put in to
make a general tool or a general model for this type of historical
computing. The purpose of this paper is partly to highlight some
of the problems making a more general software tool for this type
of analysis that can be used for a wider range both censuses,
and a variety of census languages and databases.
Firstly what features do we need to do a computer-aided analysis?
Not all information on each individual found in the census are relevant for the classification of households, so I will just focus on those fields that is used first to code relationship to head of household and secondly to classify households according to the Hammel-Laslett scheme. For a general discussion of the information in the British censuses relating to individuals see Edward Higgs, A clearer sense of the census, London 1996. And for the Norwegian censuses cf. Ståle Dyrvik, Historisk Demografi, Bergen 1983 and Gunnar Thorvaldsen, Håndbok i registrering og bruk av historiske data, Oslo1996.
Which attributes are needed in this type of household analysis and what codes or semantic categories must they be given? As far as possible the codes given to each individual attribute will follow the proposal for the coding of machine readable sources by Manfred Thaller.
SEQUENCE. Data must appear in the same sequece as in the source. If this information is not part of the data, a sequential number must be added to hold the source order.
SEX: In the Hammel-Laslett scheme (Table 1) the SEX of the head is necessary to classify into household types 3c Widowers with never-married child(ren) or 3d Widows with never-married child(dren). The values in the source must be coded into the semantic categories MALE, FEMALE or UNKNOWN. If there are persons with unknown sex this should be manually coded by using the first name or other variables that indicates sex, e.g. occupation.
MARITAL STATUS. Marital status is a crucial variable because it identifies the CFU's (Conjugal Family Units). It is also used to differ between type 1a and 1b, see Table 1. The values for marital status in the source must be coded into the values UNMARRIED, MARRIED, MARRIED SPOUSE ABSENT, WIDOWED, DIVORCED, UNKNOWN MARITAL STATUS.
AGE. Age will be used to verify relationships specified by the relationship to head attribute. Age in censuses can appear both as a date of birth, age in year, months or days or as year of birth. In the 1865 census of Bergen all these three combinations can be found. The age must be represented as a number in whole years, neither decimals nor a particular date format are really necessary.
SURNAME. Surname can also be used for verifying kin relationships. However surname can be a problematic attribute to identify relationships or to assure relationship between household members, because of spelling differences and anomalies of different origins. In the 1881 census of Great Britain the surnames seem to be quite consistent i.e. few spelling variations within a family and all members belonging to a family have got the same surname (with exceptions in Wales). Used with care, surnames can identify relationship that are ambiguous or needs confirmation. Surname in Norway in the 19th century are a totally different story. In rural areas patronyms where used as last name or surname, but for towns there can be three 'types' of surnames.
|First name||Surname/Last name||Relationship|
Even they are a family of four, they have four different 'surnames'
In an English context surnames can be used to identify a family if the relationships either are missing, ambiguous or need higher precision., e.g. whether a child is adopted, a step-child or a biological child of the head can be spotted this way. Another way of using surname is to differ between the heads family and a lodgers family (esp. children). In the Norwegian context however surname is more problematic, because the 1865-1875 censuses are in the midst of a changing process.
RELATIONSHIP TO HEAD OF HOUSEHOLD is the main source of information to classify households, but to do so the text-string found in the census must be coded and/or classified. There are two main problems: Firstly the text in this column is not as easy to automatically code as SEX, AGE and MARITAL STATUS, mainly because there are more data, i.e. the relationship is described with several words, more values, i.e. several types of relationships is described, ambiguous data, i.e. the relationship itself can have several meaning and the 'correct' meaning can be dependent on several factors. Secondly there exists several coding schemes. The code values proposed by Thaller in his draft proposal has too few values and a semantically richer set of code values are needed on kin-relations in order to classify households.
In Norway there exists at least two official coding schemes, a) the scheme used for the 1801 census with about 24 categories and b) a simpler scheme based on the 1801 scheme proposed by The Norwegian Historical Data Centre. The scheme currently used to code the relationship to head of household in the 1881 census of Great Britain, uses a much more detailed coding scheme, derived from Michael Anderson's scheme used on the 1851 census. This new 1881 scheme has about 150 codes for kin-relationship to head of households and the total number of unique codes will probably pass 500. However, a software tool, need internally to rely on one coding scheme in order to do these semantic operations automatically, i.e. to classify persons into households. This can be achieved by using conversion tables, i.e. code X in schema A equals code Y in the internal scheme. And a some stage Manfred Thaller's proposals for Coding of Machine Readable Sources must be extended to be semantically rich enough to handle the aforementioned operations.
However, coding and classification of relation to head of household is not based solely on the values in one field only at least three types of 'external' information is used when done by hand.
The problem of course is to try to formalise some of this contextual information, e.g. information that can be inferred from sequence and external information like common sense. We will have a brief look on some of the external knowledge, knowledge that necessarily is not in the data itself..
The categorising of household mean that we have to make decisions about relationships, who is married to who, whose children are these, are these really married. Often the relationship specified in the source can be ambigious, and when classifying we sometimes use 'common sense' or 'common historical knowledge'. Both to code relationships and to categorise households we also tend to use some more or less explicit assumptions about the historical past. These come into use when either information in the source is missing, is ambiguous or need confirmation or higher preciscion. Some are quite trivial and obvisous, like if a child age 5 is missing marital status we assume that the child as unmarried (assuming the age data can be trusted). Also if someone is said to be a mother of a child and the age difference is 8 years we 'know' it is a step-child, perhaps by checking the surname or in a Norwegian case by looking at the patronyms of other family members. The computer software need also somehow to have this 'understanding' and knowledge of the 'real world'. Examples of these 'facts' are like age of menopause, age of first menarche, lowest age for marriage, 'reasonable' age gap between spouses and also the legal age of marriage. I will describe this knowledge as biological and cultural variables. Table 2 Biological and cultural variables for England and Norway in late 19th century
|Constant||Suggestive values for Hull and Ipswich 1881||Suggestive values for Bergen 1865-1875|
|Mean age of menarche|
|Age of menopause|
|"Maximum" age gap between spouses|
|Age when less than 2% is married, female|
|Age when no less than 2% is married, male|
|Lowest age for male headship|
For England and Norway in the second half of the 19th century we could initially set these variables to values like shown in table 2. One problem with these figures are that they change over time. In Norway the age of menarche fell from 17,5 in the 1830s to 16,5 in the 1860s. In Great Britain the it has fallen from about 15,5 in the 1890s to 14,5 in the 1920s. Also the age of the youngest brides will differ over time, space and also socially and will therefore vary and we must therefore be able to change them easily depending on date and place of census.
How can to software programme use these variables? The main purpose is to solve ambigious relations. The age of menopause and menarche can be used to decide the type of a relationship between a mother and child. The age for first marriage can be used to asume that persons below this age are unmarried if marital status is missing. The age-gap between spouses can be used to decide whether it is likely that two persons in a household is married, e.g. if there is a second 'couple' in the household and the relationship to head of household does not give any good indicator whether they are man and wife or not. A typical decision to make them into a couple is if:
But there are more problems to solve, and the next issue is the importance and semantics of sequence.
In a census normally the string in the relationship column refers
to the head of household, as the intention is. However, quite
often the factual relationship and the stated relationship is
not always the same, because stated relationship is indirect or
goes through another person, e.g Table 3.Table 3 Example of contextual
inference, using biological and cultural variables
|Person no||Relationship||sex||age||Marital status|
In the example in Table 3 we will assume that the second daughter
is really the granddaughter of the head, however this relationship
is inferred. By using our biological variables and checking the
ages we can either refuse the last person number three in table
3 as being a daughter of the head or even assign the third person
as granddaughter to the head.
Table 4 Example of contextual inference, the semantics of sequence
|Person no.||Surname||Relationship||sex||age||Marital status|
Table 4 exemplifies the semantics of sequence. In this case there are two individuals described as "Daughter", but are both "Daughter" of the head? Doing the classification by hand one will assume that the last person in the household is the daughter of the lodger (person 5) and not the daughter of the head, even though the relationship column holds the same text for person no 3 and person no. 6. This is also a type of case where surname (see discussion on the usage of surnames above) will tell us who is parent of person no 6. This type of inference is contextual, it is not purely based on the person in question and this persons relation to the head.
One way of making the program cope with these type of problem is to check the consistency or 'logic' of a household, e.g. if there is a person called lodger and the next person is a wife, then the wife becomes the wife of the lodger. In this process one can also check other types of 'errors' or inconsistencies. A typical one is sex. In the 19th century census lists there are not always a separate field for gender, but the age is written in two separate columns, one for males and one for females. A quite common error, both by enumerators and by transcribers is to change the sex of one or several household members, e.g. the head of the household is female and the next person described as a wife is also a female. These type of errors can be corrected also by a contextual analysis of the attributes of the members of the household.
So far we have discussed ways of making a software program able to process a household we need to feed the program with:
In order do:
The last but not least element in this process is to classify the household.
It passes the scope of this paper to in detail explain the algorithm used to classify household, so I will restrict to just mention a couple of the key features. The classification algorithm was originally written in the SAS script language by Kevin Schürer, and later refined and rewritten by Arne Solli into C/C++. This will make is possible to make the software tool to run on different machine platforms and file formats.
As input the programme read a tab-delimited file and the user must specify which columns or fields that holds marital status, sex, age, surname and relationship to head of household, and also the column(s) that uniquely identify each person, possibly just a sequential number. Appendix A-1 and A-2 gives a sample of a input data file and a input description file.
The classification itself is based on a sequential examination of each households. First a pass to check the consistency of the fields sex and marital status and relationship of head of household, thereafter a several pass to identify members of conjugal family units (see explanation above on CFU's) and to mark the 'head' of these CFU's and to identify the relationship between the first CFU (the head of household CFU) and possibly other CFUs units within a household, like is the first and the second CFU related upwards or downwards from the CFU where the household head is a member of. The last stage of the examination is to use this aggregated information to classify the household into the Hammel-Laslett categories, like 3b Simple family household, Married couples, with children or 5a Multiple family household, secondary unit up, Cf. Table 1.
As output the user get two files or tables; a) a modified person file/table and a household file/table including the re-coded fields and household classification. All the data from input are also replicated to the output files/tables. Appendix A-3 gives a sample of the corresponding output files.
There are several problems not yet discussed, but must be solved to get a level of generality that make it worthwhile to invest time in this type of historical computing.
Firstly a general language component must be attached to the system so that attributes like SEX and MARITAL STATUS can be 'understood' and handled automatically by the software. Currently English and Norwegian are partly 'understood' by the software but not general and a fully automatic way.
Secondly a way to attach different coding schemes and conversion tables between coding schemes must be implemented. Currently one works partly under the assumption that some of the variables are already coded into the 'Essex-scheme', see above.
Thirdly more effort must be put in to make the software programme infer from sequence and this way understand relationship to head of household correctly. The program now depends on that the coding of relationship to head of household is done in advance, but a fairly reasonable extension is to supply the program with information on language, a larger dictionary, a coding scheme and more advanced parsing techniques to make the program code relationship to head of household automatically. A 100% automatic coding of relationship to head of household is perhaps not either recommendable or possible.
One also need ways of handle the data even if some of fields or types are missing, like what do to if the gender field is missing or how to cope with coding schemes relationship to head of household that are not as semantically rich as optimal.
Implement support for reading census data as tables in commercial DBMS like MS Access, Dbase, Paradox or Foxpro. Also easy reporting is lacking and must be implemented. Now only tab-delimited exports from these products will be handled.
An important element in this type of computing is also to develop ways of finding and measuring the differences between manually coded and classified data and the same coding and classification done by software, man against machine! This can also help us making the software better.
Lastly, but very important. The current the number of and the character of the presumptions built into the software programme are rather few and simple compared to what quite often is needed. Presumptions on marital status, offspring and about kinship will vary both in time and space (England differ from Norway, see discussion on surnames), and they are necessary when data is ambiguous or partly missing.
This paper must be treated as a highly prelimenary report from ongoing research at the Historical Departments both at the University of Essex and the University in Bergen. A lot of work is still to come. But so far some more general conclusions on Historical computing can be drawn.
The focus on these four aspects of historical computing, are needed
in order to use small scale techniques these techniques on large
scale data sets. The work to formalise these techniques is perhaps
a never-ending task and differences between computer-aided and
a manually done coding and classification will occur, similar
to differences by two manually conducted classifications. The
question what can be formalised is perhaps not that interesting,
because as part of this paper illuminates, formalising the low-level
interpretation of a source can also be rather time-consuming and
in the end it is perhaps the short term cost in time that decides
whether the method or technique is worth trying to computerise
or it is better to do the work by hand. In some ways that is a
pity, because it often means reinventing the wheel.
Section "SYSTEM" describes general fileformat attributes, section "EXTRACT" describes the fields as extracted from the source, section "CLASSES" describes which fields that corresponds to the semantic types of SEX, MARITAL STATUS, RELATIONSHIP TO HEAD OF HOUSEHOLD and AGE and section KEYFIELDS defines which fields that defines the order (or sequence) of the input. There are also two sections with name "HHOLD" and "PERSON" that describes the output which not replicated here.