[1] The last meeting of the Eastern European Data Archives was the Forum at the International Association for Social Science Information Service and Technology Conference which was held in Amsterdam in May 2001. Many applications were submitted, some of them well-structured and some only at the drawing board stage. There was considerable interest in creating a network of social science data archives in this part of Europe, both from EU countries and the former Eastern European bloc. Scientists are starting to realize the crucial importance of having and sharing reliable data for formulating social policies, in spite of a deep-rooted feeling of protecting data against any person or institution.
[2] In Romania this type of open approach emerged from the Institute for Quality of Life in Bucharest. A great deal of social research was carried out after 1990, resulting in a large number of datasets which needed care and attention. Having lost some of the old datasets and/or some variables from others, a specialized data bank was created within the Institute.
[3] These datasets were not properly stored, instead they were merely saved in different folders from one computer to another. From 1994 onwards the idea of a specialized data archive emerged based on the model of the already existing data archives from Western Europe. We began to talk about codebooks, cleaning, indexing and cataloguing, but only as separate operations without a compact system to include all of them. There was no search engine and no method of searching for relevant variables or indicators; keywords were listed only in the printed study descriptions and they referred solely to the datasets and not the variables within. Practically the only users were researchers from the Institute and some academics from the University of Bucharest, who knew about the existence of these datasets. If another user wanted information about variables or datasets on one topic or another, they rarely obtained this information.
[4] In 2000 and 2001 two useful training visits were paid to the UK Data Archive (UKDA) and the Central Archive for Empirical Research (ZA) in Cologne, Germany. These valuable opportunities to obtain fresh information and know-how on archives opened up new prospects for the emerging data archive in Romania. Up-to-date standards such as DDI (Data Documentation Initiative) were to be implemented from the very beginning, with the advantage of not having to migrate the datasets from one standard to another. Extremely useful software was generously shared both by the UKDA (with the Networked Social Science Tools and Resources System – NESSTAR) and by the Central Archive for Empirical Research (ZA) (with Codebook Explorer). The NESSTAR system was of particular interest to us because it contains everything we did not have: it is based on the new standard in data archiving – the DDI, it has a search engine and it uses new compact codebooks based on the structure of the DDI/DTD. Software for preparing the NESSTAR compliant codebooks were downloaded from the NESSTAR website (NSD XML Generator, NESSTAR Publisher), completing the suite of specialized software.
[5] In principle, this was exactly what we needed in order to create a specialized data archive. The only thing left was to obtain some work on using the software.
[6] We were faced with inevitable difficulties regarding our plans and we tried to find effective solutions, in spite of an acute lack of funding (at that time). In fact, these acute financial problems are very common in Eastern European countries.
[7] One option is to wait for some funding to start the process – the worst possible choice in our opinion; the other option is to find quick solutions to specific problems, make as much progress as possible and hope that funding will appear sooner or later.
[8] In our case, this strategy produced good results; the Romanian Social Science Data Archive (RODA) is now on a very promising path.
[9] The major difficulties (apart from the funding problem) we encountered were as follows:
[10] Lack of regular staff: in order to obtain continuous results on time, we need people working as much as possible for the data archive. In fact, the current situation is that Institute for Quality of Life employees are working on different projects with field work and combined deadlines in addition to the planned papers and theses they have to prepare. Working for the data archive is more like a temporary job; people are not very motivated by the perceived boring process of cleaning, documenting and indexing datasets, especially if there is not so much money involved. We had to deal with the situation in these circumstances and try and make progress a step at a time.
[11] Lack of training: the software which we acquired for our work, although (relatively) simple, still required a certain amount of training. Notions about DDI and its structure, NESSTAR, XML Generator, NESSTAR Publisher, XML, XSL, XML editors can be quite hard to deal with at short notice. This is the reason why in the first two months we had barely managed to assemble a small team which knew how to work with all this software.
[12] Lack of information about “old” datasets: in the case of datasets from 1990 onwards, information is hard to acquire as we go back in time. For some datasets from 1990, 1991, 1992 and 1993 there is no longer any information available because at that time nobody knew about how important it is to document a dataset. This is why, although we are very frustrated, we have to admit that some datasets will probably never be recovered; or for some variables from other datasets we do not have the labels, etc. We spent a great deal of time trying to find this kind of information, and we are still doing it.
[13] Lack of a fast Internet link: in order to properly distribute data to the academic community and interested persons, we definitely need a fast Internet connection. On the one hand, this means a website which can be loaded on any machine; on the other hand, the NESSTAR system requires (at least for the Explorer version) 128MB of RAM. Of course, any person could use the Light version, but there is a need for a fast and continuous Internet connection for the downloading process.
[14] With the academic community anxiously waiting for the data archive to be constructed, two obvious questions arise, which are also common to all emergent data archives in Eastern Europe:
[15] One initial idea is to examine the other data archives on the Internet, see how they do it and try to duplicate whatever they do. We have done this by analyzing tens of data archives all over the world. We have concluded that each archive is based on the same fundamental principles. This was good news because we found some principles to apply. At the same time, however, each data archive applied those principles in a more or less different way. This was not such good news and was not very helpful.
[16] Having set out this objective, we ascertained a number of critical issues for any emergent data archive:
[17] We will try to cover all these issues, bearing in mind those two questions I mentioned earlier: How to…? and How much…?
[18] The website: in this case, the answer is straightforward. There are a great many young people with excellent computing knowledge in Romania and probably other countries, too. A simple website can easily be created with minimal knowledge of HTML and some Java scripts which can be found freely on the Internet. The cost differs from one country to another; we cooperated, for example, with a student who wanted some experience, and for a student any amount of money can be significant depending on the particular country. Although the website which we created for Romanian Social Science Data Archive (RODA) is extremely simple, this was our specific objective, so that it can be loaded on any machine, even with a very slow Internet connection. It is far from a professional site, but it nevertheless exists and functions properly. We could make it more complex in future if our Internet connection improves.
[19] The data retrieval system: The DDI solution. As I mentioned above, each archive uses one system or another. Some systems are more efficient and some not so, but all systems come up with the same result: finding the relevant data and distributing the data to interested people. In our case, we had to choose between creating an original system or implementing one which already exists. The first option is not viable at all, because it involves huge efforts and is time consuming. We have to deal with the lack of qualified personnel, and this is the case for all institutions involved in data archiving in Eastern Europe: where can you find one or two individuals who are willing to work on creating a system for practically nothing because the salaries are so low? We have neither the time nor the money to train the personnel to create such a system, which almost certainly will be less efficient than an already existing professional system. If we can agree on this, it only remains to decide what existing system to implement.
[20] Because it is based on a simple logical structure, it is efficient, it is available; it does not cost anything and it is now being used or is starting to be used by all large data archives all over the world.
[21] At first glance, the DDI structure may look complicated, but everything becomes easier and clearer with some training and personal effort. It must be emphasized here that not all the tags in the DDI structure need be filled in, only the essential ones. The rest of them are optional. Once the essential tags have been established (generally based on the Standard Study Description), the only problem is how to fill in the information to create the codebook.
[22] There are four sections in the DDI that need to be filled with information. The first two of them (Document description and Study description) can be edited with any text editor, but we recommend using the dedicated XML editor as it is much easier. We are using Microsoft XML Notepad, a free program that can be found and downloaded from the Internet. The other two sections (Data file description and Variables description) are even easier to complete using the NSD XML Generator.
[23] The DDI codebook is then simply published in the NESSTAR retrieval system with NESSTAR Publisher, and the user can browse the data using the NESSTAR Explorer either in the stand alone version or in the NESSTAR Light version which works in a simple web browser.
[24] All this software is absolutely free and does not cost anything at all. However, this does not imply that the software is absolutely free: many people have put a lot of effort into creating this software and significant sums of money have been spent. If we can acquire it for nothing, this is equivalent to another method of funding.
[25] The major difficulty that we are facing with the NESSTAR Server (and as we learned from the NESSTAR team, this situation is about to be resolved) involves the Access Control Unit (ACU), which combines a set of logical conditions in order to create user accounts with specific access levels, and certain rules to restrict access to some datasets, irrespective of the user level.
[26] The solution which we found is functional, but we do not claim it is universal. In our case it simply worked. In our experience, there are some human resources available, which are easy to be motivated using proper stimulants. These are the students. But why would students be motivated to do this? In the case of our Department of Sociology of the University of Bucharest, students are eager to gain experience in using computers, databases and statistical techniques. Working for a data archive would represent a valuable opportunity to acquire this experience. One other reason is that they are awarded a grade for the Practical Sociology Lab. Each student is obliged to take part in this Lab, which is also their main interest. Finally, one other method which I used to attract students is a basic statistics course on Saturday. Many of my Saturdays have been spent teaching these statistics. Using our Faculty’s computer lab, I shared my experience in using SPSS to carry out basic statistical analysis. It was a great success with around 25 students attending each Saturday. It is these students who clean and document the data. They get what they want, i.e. more statistical knowledge and a grade, and we get what we want, i.e. well-cleaned and documented datasets. At the time of writing of this report, the number of students wanting to attend the group is increasing.
[27] We have now managed to finish about 30 to 35 NESSTAR compliant codebooks, and the others are in progress.
[28] We received excellent support from the Faculty of Sociology and Social Work: using a computer lab in one of the Faculty’s buildings, we can work together with the students in preparing the datasets for archiving. With 10 networked computers at our disposal, we achieved good results in a reasonable amount of time. All the other expenses are covered by the Faculty as well.
[29] The latest good news in our close relationship with the Faculty is that we have a new funding source for the data archive, which came about through one of the Faculty’s projects.
[30] Our next aims are to be accepted in the Council of European Social Sciences Data Archives (CESSDA) and the International Federation of Data Organizations (IFDO), and to implement the LIMBER (Language Independent Metadata Browsing of European Resources) Thesaurus in our data archive. To date we have only worked with a self-designed thesaurus; LIMBER is the future.
26 April 2002