[1] The purpose of this paper is to review practical usage and experiences with a new data documentation standard which is commonly used by other social science data archives. The technological development of this standard has had a major influence on the nature and set-up of the Slovenian Data Archives (ADP). We can draw a parallel in this respect with the situation in the 1960s when the first national data archives were established under the influence of technological innovations. This was the period when the first initiatives for establishing common standards started. The lack of persistent coordination efforts resulted in a sloppy use of standards and non-compatible catalogues at the start of the Internet. This encouraged data archive organizations to formulate the standard based on new technology and after taking account of all previous problems. The new DDI Standard (Data Documentation Initiative) was written in XML Internet Language. The DDI Standard was exactly the model which the Slovenian Data Archives were looking for. We took part in the beta test and adopted the DDI as our internal format for organizing metadata. After four years of experience in using DDI, the first results began to emerge. To follow up, I will mention both the advantages and disadvantages of this emerging standard. After a brief examination of the DDI DTD (Document Type Definition) document structure, I will provide a detailed description of the actual flow of information based on DDI in our archive, the so-called DDI XML Codebook production line. And finally, I will introduce the NESSTAR Catalogue as a tool which is based on the DDI standard and is available free of charge to members of Council of European Social Science Data Archives (CESSDA)
[2] Social science data archives have the task of storing raw data and instigating secondary analysis. Their development is greatly influenced by technological developments and technical standards in the particular period 1 (Note1: Tanenbaum, E. and Taylor, M. F. (1990): “Developing social science archives”, in: International Social Science Journal12.). The idea of a data archive arose in the late 1950s with the need to store IBM punch cards. In the 1960s, however, computers and new storage media came on the scene and that also spelled the end of storage standards. Data conversion and exchange became key tasks of data archives at that time. They matured as organizations which possessed know-how on various storage formats. One advantage for users was that data archives were capable of providing them with raw data according to the format which users required.
[3] In the early 1990s at the start of the Internet, history repeated itself with some minor alterations. This time the problem was no longer the lack of compatibility between data storage standards as such, but rather the incompatibility between different metadata languages and formats, and thus incompatibilities between catalogues and codebooks produced by different organizations around the world. The CESSDA electronic codebook specification was an early attempt to overcome this problem. The next attempt was the OSIRIS Codebook Dictionary. For a long time this was a working model of a common standard for describing variables. Another one, a Standard Study Description scheme, provided data archives with at least some guidelines on what information to include in a catalogue of their data holdings. However, a lack of coordination resulted in non-compatible catalogues 2 (Note2: DDI Committee (2002): The Data Documentation Initiative (DDI) Version 1.1., 2001. The New Specification for Social Science Metadata, Project Description, Data Documentation Initiative. A Project of a Social Science Community, http://www.icpsr.umich.edu/DDI;)
[4] The situation in 1997 when Social Science Data Archive (ADP) was established is neatly captured in the following evaluation: “Multiplicity of classificatory languages, search techniques and standards for documenting data”. 3 (Note3: DDI Committee (2002): The Data Documentation Initiative (DDI) Version 1.1., 2001. The New Specification for Social Science Metadata, Project Description, Data Documentation Initiative. A Project of a Social Science Community, http://www.icpsr.umich.edu/DDI;) Every organization adopted its own dialect of existing standards. Creativeness inside local boundaries resulted in many products, which were adapted to on a local needs and were almost impossible to hardly transferable to other locationsplaces. The CESSDA Integrated Data Catalogue was a solitary example of still existing integration efforts in the 1990s.
[5] When the founder of the first European data archive wrote a historical overview describing support for an established data archive compared with a new one, he talked about a “Midwife function”. 4 (Note4: Scheuch, E. K. (1990): “From a data archive to an infrastructure for the social sciences”, in: International Social Science Journal 123, 93-111; ) The role of Central Archive for Empirical Social Research (ZA), Cologne in the late 1960s, when five new archives were established in Europe, was to offer to "share experiences, especially as regards past errors" and provide “technical information on data storage and retrieval”. Using the Internet, new archives in the 1990s were able to see that technical information on cataloguing and codebook production was almost totally unreliable. We were in a twilight zone. We either had to invent our own technical solution and start from scratch or wait and look for a new and better standard which promises to overcome the problems of non-compatibility between different national archives and opens up possibilities for real cooperation.
[6] At that time, however, DDI was already being discussed. In March 1999 the DDI beta version became operable. Beta testing of the DDI DTD was conducted between March and August 1999. The test was funded through a grant from the US National Science Foundation applied for a grant, which secured six months of intensive theory and practice regarding the production of its own XML codebooks. The results of the beta test led to the successful implementation of the first ten XML codebooks. A production line for routine codebook production was enhanced. At the end of the beta test, we began to prepare our own program in XSL (Extended Style Language) for XML Codebook presentation on the Internet. This enables our users to browse codebooks prepared using the DDI beta test version. In March 2000 the slightly improved DDI DTD Version 1.0 was published incorporating suggestions from the beta test. Immediately after that we converted DDI DTD Beta XML Codebooks into the 1.0 version using a special script. Our main emphasis as a new archive was to continue producing XML Codebooks, in order to fill our collections with the most important datasets from Slovenia from the past and present.

[7] Meanwhile, the NESSTAR tool was being refined in parallel, a fact which promised to add functionality to a growing collection of XML Codebooks. And finally at the end of 2001, an ADP server catalogue was successfully configured. NESSTARnow makes a collection of over 100 Codebooks in DDI XML fully operable. It facilitates searches across Codebooks in a collection as well as simultaneous searches across different collections in different sites. It also serves as an interface for simple data analysis, enables data files to be downloaded and permits controlled access to specific datasets for our users.
[8] There are some advantages and disadvantages in using an existing standard. The DDI DTD is like any other standard when faced with pressure to revise and add new features. It is still an emerging standard. Some of the main data producers, software producers and data archives are still considering whether to use it. One obvious advantage from our point of view is that there is no need to (re)invent local catalogue rules. Use of the standard makes it possible to cooperate in document production. A properly described DDI DTD Codebook document on an international study, which was prepared in one archive, can be shared between all sites that posses the data for the same study. The adoption of a standard makes it possible to use existing and prototype software tools suitable for the standard environment. A virtual catalogue of different sites, all using the same standard for study description, is manageable. Users can therefore use just one entry when searching for data. There is a growing list of conversion tools from SPSS and CAI software files, which are made available to users of the same standard.
[9] There are also some drawbacks. There is a danger of being isolated if other bodies do not adopt the same standard. This is what happened with the above-mentioned CESSDAelectronic codebook specification in the 1960s. There is of course less potential to add specific emphasis according to local needs. When you are using a standard, the inventive initiative has to be slowed down to concentrate more on efforts aimed at adding value to existing information stored in a standard structure. By their very nature, revisions of a standard must be slow and agreed by a community of users, while bearing in mind the compatibility of previous versions of the same standard. When a revision is accepted, cooperative efforts preparing tools for a transfer to a new version move quickly. There is a danger of dependency on someone else’s timetable in the dynamics of tool production. For example, NESSTAR was late in fully adopting the UTF-8 Convention, which was crucial to us. This was the reason why we waited another year before configuring our own NESSTAR server.
[10] What is XML? It is a special computer language for storing various types of information in a structured way. It is especially practical for environments where documents of the same type are produced. It possesses the quality of an ordered information system such as a database, while retaining the flexible feature of almost free-text authoring of the documents. In short, “XML is to a document’s intellectual content what HTML is to the physical structure of that document”. 5 (Note5: Thomas, W. L. and Block, W. C. (2001): )XML becomes an option, especially in environments where there are many specialists for their specific content area and who are required to complete information. One reason is that XML can be used without professional knowledge of computer engineering. Different authors can contribute to it - each with specialist knowledge of its subject area. All obey the same content structure. It is therefore user-friendly. When an XML document is finished, it is already prepared for multiple format presentations, e.g. a printed book, Internet etc.
[11] DTD is a generic term for document type definition. DDI DTD in this at respect means a special Data Ddocumentation Iinitiative XML Codebook Document Type Definition (DTD). The XML document must be “well-formed” and “valid”. These are at is the only requirements which for a user of a specific DTD must to satisfy fulfil when writing authoring a document.
[12] Well-formed means, that a document must follow has to obey the XML syntax. The Main features are:
[13] Any XML document, e.g. HTML, can be well-formed.
[14] A valid document also conforms to a specific DTD. The specific DTD to which a particular document conforms is shown in a DOCTYPE declaration. This is illustrated by the underlined path call in the following example:
[15] When authoring a new document, there is no need to look at the “machine-readable” “codebook.DTD” file. An XML editor helps to check well-formedness and document validity. It helps to choose appropriate elements in accordance with the DTD while editing a document. A “human-readable” Tag Library, which consists of element definition with practical examples, gives you guidance on the type and form of information that you need. Let us now take a quick look inside the DDI DTD document structure. A DDI Codebook document integrates different levels of information in the same document. There are five main levels of information:
[16] A DDI XML Codebook specifies the catalogue contents which are suitable for input to virtual catalogues in different sites, which are produced on various platforms. Secondly, it specifies codebook contents (variables description) which are suitable for input to the “virtual library of all individual measurements in the studies in a collection” 6 (Note6: DDI Committee (2002): The Data Documentation Initiative (DDI) Version 1.1., 2001. The New Specification for Social Science Metadata, Project Description, Data Documentation Initiative. A Project of a Social Science Community, http://www.icpsr.umich.edu/DDI;)
[17] 1990 Erwin Scheuch talked about the a dilemma of a library concept versus a data service concept, with a strong preference for the latter 7 (Note7: Scheuch, E. K. (1990): “From a data archive to an infrastructure for the social sciences”, in: International Social Science Journal 123, 93-111; ). In the first approach a unit of storage is “study”, while in the second emphasis is placed on the variable as the primary unit of storage. The beauty of a DDI XML Codebook is that it encompasses both in an integrated way. It is flexible enough to allow for individual styles in approaches to the description of material in a data archive. It is adaptable to specific needs of a data archive. We can choose to put special emphasis on a study or a variable level inside the same collection according to the nature or importance of a specific dataset. In a DDI DTD XML codebook you can integrate meta-information about the intellectual contents of a study, its scope, methodological details, retrieval and dissemination policies, file location and format, i.e. very important information for users of any data archives. It also includes references to accompanying documents, e.g. reports on methodology, publications, classification lists, questionnaires and similar, computer syntax files, tables of results, etc. It can include cross-references inside and outside a document, which can function as hyperlinks in the WWW. You can achieve this by using ID, IDRefs and URI attributes defined in a DDI DTD.
[18] To sum up, XML is similar to HTML and is easy to use, fairly flexible, broadly accessible and hyper-textual. It also has a computer and human readable as well as comprehensible structure of document contents.

[19] During the routine production of codebooks you can use templates and entities which are repeated across documents to economize a production line. The first step towards acquiring a new dataset involves entering basic information on the new dataset file, the depositor and accompanying material in the ADPInventory book (ACCESS Database). After choosing the most suitable predefined XML DDI Codebook template (e.g. one from a previous study in a series), we extract the information from a database to the draft XML Codebook. The resulting Codebook is transferred to an Internet catalogue to obtain quick information on the new study. Viewing is supported by referenced XSL using a standard internet browser.
[20] In the second step a full study description section is produced. A depositor is requested to complete a MS Word form, containing elements corresponding to the DDI DTD study description section. A draft XML Codebook from the previous step is edited using the XMetaL® XML editor. Missing pieces of information are added manually by typing them in or by using the "drag and drop" facility.
[21] The third step sees the addition of a codebook data description section which is generated from the SPSS data file. The SPSS data file, which includes variables and value labels, is converted using the free software NSD XML Generator® to an XML data description section of the DDI Codebook and is integrated in the previous study description.
[22] In the fourth step, but only for the most important datasets, full question text is entered in the codebook data description section. We used macros to avoid repeating the process. This task is still the most labor-intensive part of codebook production. If a CAI (Computer Assisted Interviewing) computer readable file is available, we use a conversion tool to change from the CAI format to the DDI XML format. Two documents, i.e. Slovene and English language DDI XML Codebooks, are finally converted into a NESSTAR complaint format. They are published in a NESSTAR catalogue along with the data file. Users can search through datasets and catalogues, perform some simple online analysis of data and download files.
[23] Figure 2 shows how these steps are integrated around the final DDI XML Codebook. DDI XML Codebook documents form the center of a "knowledge" production and dissemination system. All the information and material connected to one specific study are integrated in an XML codebook. This opens up possibilities of further restructuring of this information across codebooks.
[24] Issues relating to the DDI XML Codebook production line must be taken into consideration when writing one’s own DDI Codebook. XML editors sometimes lack UNICODE support, which presents a problem to all non-western countries. The use of entities in XML documents helps to standardize document production and makes it faster and easier to translate into English. According to our experience, it is more natural to have two separate documents for a Slovenian and an English language codebook, even when some of the information can be entered in the same document and then distinguished by the use of xml: lang attribute.
[25] DDI DTD is attracting increasing attention in a community, a fact which will ensure the production of new tools for enhancing its use. Despite continuing developments and overlapping archive standards, DDI 1.0 is state-of-the-art technology which promises to ensure the longevity of XML Codebook 1.0 documents. The ADP in Slovenia has used its experience of DDI to manage its organization when it was the only one commonly available.
26 April 2002