you are here: Knowledge Base Home > Full Text Archive > Empirical Social Research in Romania, RODA: Data Archive Report 1
skip to table of content
Knowledge Base Home
Sitemap.Imprint.Disclaimer.
you are here: Knowledge Base Home > Full Text Archive > Empirical Social Research in Romania, RODA: Data Archive Report 1
 
 

The Romanian Social Science Data Archive (RODA)

by
Adrian Dusa

skip to content

1. The Archive’s situation

[1]  The last meeting of the Eastern European Data Archives was the Forum at the International Association for Social Science Information Service and Technology Conference which was held in Amsterdam in May 2001. Many applications were submitted, some of them well-structured and some only at the drawing board stage. There was considerable interest in creating a network of social science data archives in this part of Europe, both from EU countries and the former Eastern European bloc. Scientists are starting to realize the crucial importance of having and sharing reliable data for formulating social policies, in spite of a deep-rooted feeling of protecting data against any person or institution.

[2]  In Romania this type of open approach emerged from the Institute for Quality of Life in Bucharest. A great deal of social research was carried out after 1990, resulting in a large number of datasets which needed care and attention. Having lost some of the old datasets and/or some variables from others, a specialized data bank was created within the Institute.

[3]  These datasets were not properly stored, instead they were merely saved in different folders from one computer to another. From 1994 onwards the idea of a specialized data archive emerged based on the model of the already existing data archives from Western Europe. We began to talk about codebooks, cleaning, indexing and cataloguing, but only as separate operations without a compact system to include all of them. There was no search engine and no method of searching for relevant variables or indicators; keywords were listed only in the printed study descriptions and they referred solely to the datasets and not the variables within. Practically the only users were researchers from the Institute and some academics from the University of Bucharest, who knew about the existence of these datasets. If another user wanted information about variables or datasets on one topic or another, they rarely obtained this information.

[4]  In 2000 and 2001 two useful training visits were paid to the UK Data Archive (UKDA) and the Central Archive for Empirical Research (ZA) in Cologne, Germany. These valuable opportunities to obtain fresh information and know-how on archives opened up new prospects for the emerging data archive in Romania. Up-to-date standards such as DDI (Data Documentation Initiative) were to be implemented from the very beginning, with the advantage of not having to migrate the datasets from one standard to another. Extremely useful software was generously shared both by the UKDA (with the Networked Social Science Tools and Resources System – NESSTAR) and by the Central Archive for Empirical Research (ZA) (with Codebook Explorer). The NESSTAR system was of particular interest to us because it contains everything we did not have: it is based on the new standard in data archiving – the DDI, it has a search engine and it uses new compact codebooks based on the structure of the DDI/DTD. Software for preparing the NESSTAR compliant codebooks were downloaded from the NESSTAR website (NSD XML Generator, NESSTAR Publisher), completing the suite of specialized software.

[5]  In principle, this was exactly what we needed in order to create a specialized data archive. The only thing left was to obtain some work on using the software.

2. Major difficulties

[6]  We were faced with inevitable difficulties regarding our plans and we tried to find effective solutions, in spite of an acute lack of funding (at that time). In fact, these acute financial problems are very common in Eastern European countries.

[7]  One option is to wait for some funding to start the process – the worst possible choice in our opinion; the other option is to find quick solutions to specific problems, make as much progress as possible and hope that funding will appear sooner or later.

[8]  In our case, this strategy produced good results; the Romanian Social Science Data Archive (RODA) is now on a very promising path.

[9]  The major difficulties (apart from the funding problem) we encountered were as follows:

[10]  Lack of regular staff: in order to obtain continuous results on time, we need people working as much as possible for the data archive. In fact, the current situation is that Institute for Quality of Life employees are working on different projects with field work and combined deadlines in addition to the planned papers and theses they have to prepare. Working for the data archive is more like a temporary job; people are not very motivated by the perceived boring process of cleaning, documenting and indexing datasets, especially if there is not so much money involved. We had to deal with the situation in these circumstances and try and make progress a step at a time.

[11]  Lack of training: the software which we acquired for our work, although (relatively) simple, still required a certain amount of training. Notions about DDI and its structure, NESSTAR, XML Generator, NESSTAR Publisher, XML, XSL, XML editors can be quite hard to deal with at short notice. This is the reason why in the first two months we had barely managed to assemble a small team which knew how to work with all this software.

[12]  Lack of information about “old” datasets: in the case of datasets from 1990 onwards, information is hard to acquire as we go back in time. For some datasets from 1990, 1991, 1992 and 1993 there is no longer any information available because at that time nobody knew about how important it is to document a dataset. This is why, although we are very frustrated, we have to admit that some datasets will probably never be recovered; or for some variables from other datasets we do not have the labels, etc. We spent a great deal of time trying to find this kind of information, and we are still doing it.

[13]  Lack of a fast Internet link: in order to properly distribute data to the academic community and interested persons, we definitely need a fast Internet connection. On the one hand, this means a website which can be loaded on any machine; on the other hand, the NESSTAR system requires (at least for the Explorer version) 128MB of RAM. Of course, any person could use the Light version, but there is a need for a fast and continuous Internet connection for the downloading process.

3. Major achievements

[14]  With the academic community anxiously waiting for the data archive to be constructed, two obvious questions arise, which are also common to all emergent data archives in Eastern Europe:

  • How to reach a certain level in order to be recognized as a professional data archive by the international community? And:
  • How can we achieve this goal at little or no cost because our financial resources are very limited?

[15]  One initial idea is to examine the other data archives on the Internet, see how they do it and try to duplicate whatever they do. We have done this by analyzing tens of data archives all over the world. We have concluded that each archive is based on the same fundamental principles. This was good news because we found some principles to apply. At the same time, however, each data archive applied those principles in a more or less different way. This was not such good news and was not very helpful.

[16]  Having set out this objective, we ascertained a number of critical issues for any emergent data archive:

  • A website which can be viewed by everyone with an interest;
  • A search and data retrieval system, so that the users can find relevant data and download them as quickly as possible;
  • A fast Internet connection, so that the first two conditions are functional and do not just simply exist;
  • Knowledge of how to create these things;
  • Some level of financial support, both domestic and international;
  • Well-documented and cleaned datasets.

[17]  We will try to cover all these issues, bearing in mind those two questions I mentioned earlier: How to…? and How much…?

[18]  The website: in this case, the answer is straightforward. There are a great many young people with excellent computing knowledge in Romania and probably other countries, too. A simple website can easily be created with minimal knowledge of HTML and some Java scripts which can be found freely on the Internet. The cost differs from one country to another; we cooperated, for example, with a student who wanted some experience, and for a student any amount of money can be significant depending on the particular country. Although the website which we created for Romanian Social Science Data Archive (RODA) is extremely simple, this was our specific objective, so that it can be loaded on any machine, even with a very slow Internet connection. It is far from a professional site, but it nevertheless exists and functions properly. We could make it more complex in future if our Internet connection improves.

[19]  The data retrieval system: The DDI solution. As I mentioned above, each archive uses one system or another. Some systems are more efficient and some not so, but all systems come up with the same result: finding the relevant data and distributing the data to interested people. In our case, we had to choose between creating an original system or implementing one which already exists. The first option is not viable at all, because it involves huge efforts and is time consuming. We have to deal with the lack of qualified personnel, and this is the case for all institutions involved in data archiving in Eastern Europe: where can you find one or two individuals who are willing to work on creating a system for practically nothing because the salaries are so low? We have neither the time nor the money to train the personnel to create such a system, which almost certainly will be less efficient than an already existing professional system. If we can agree on this, it only remains to decide what existing system to implement.

4. Why DDI?

[20]  Because it is based on a simple logical structure, it is efficient, it is available; it does not cost anything and it is now being used or is starting to be used by all large data archives all over the world.

[21]  At first glance, the DDI structure may look complicated, but everything becomes easier and clearer with some training and personal effort. It must be emphasized here that not all the tags in the DDI structure need be filled in, only the essential ones. The rest of them are optional. Once the essential tags have been established (generally based on the Standard Study Description), the only problem is how to fill in the information to create the codebook.

[22]  There are four sections in the DDI that need to be filled with information. The first two of them (Document description and Study description) can be edited with any text editor, but we recommend using the dedicated XML editor as it is much easier. We are using Microsoft XML Notepad, a free program that can be found and downloaded from the Internet. The other two sections (Data file description and Variables description) are even easier to complete using the NSD XML Generator.

[23]  The DDI codebook is then simply published in the NESSTAR retrieval system with NESSTAR Publisher, and the user can browse the data using the NESSTAR Explorer either in the stand alone version or in the NESSTAR Light version which works in a simple web browser.

[24]  All this software is absolutely free and does not cost anything at all. However, this does not imply that the software is absolutely free: many people have put a lot of effort into creating this software and significant sums of money have been spent. If we can acquire it for nothing, this is equivalent to another method of funding.

[25]  The major difficulty that we are facing with the NESSTAR Server (and as we learned from the NESSTAR team, this situation is about to be resolved) involves the Access Control Unit (ACU), which combines a set of logical conditions in order to create user accounts with specific access levels, and certain rules to restrict access to some datasets, irrespective of the user level.

  • A fast Internet connection: this is still a problem because we have not yet managed to find a cheap enough ISP which will provide a fast enough connection. From Germany, for example, one has to wait for five minutes to load my home institution’s website. In Romania it’s no problem, but it gets stuck in the case of international uploading. It is true that the Institute for Quality of Life website uses frames and large images which take time to load; because the Romanian Social Science Data Archive (RODA) website is so simple, we have high expectations that it’ll work much faster. The only aspect which worries me is the data upload from RODA, which is equivalent to the user’s download. I cannot say how much will it cost; it depends on the individual ISP.
  • Knowledge of how to create these things: We have all the knowledge we needed. All the information about DDI, NESSTAR and all the software mentioned above are on the Web. Workshops and training seminars are some other very useful opportunities for us to obtain more knowledge, exchange our experiences and share our software. We can now say that we have the “know-how” to create codebooks, use software and ultimately create a data archive. However, we must admit that this knowledge is somewhat limited, at a minimum level, and we need more training as we encounter new obstacles. Specific questions on how to link NESSTAR and DDI with the web page forms need to be answered.
  • Some level of financial support: this is a different story from one country to another. To date it appears that creating a data archive does not cost very much. In order to use the NESSTAR system, at least one powerful computer is needed, if not a dedicated server. We have received funding from the Ministry of Education and Research via the Information Society (INFOSOC) Project INFOSOC, Romania. And we also hope to obtain funding from the MOST Participation Program. However, more financial resources are likely to follow once we have proved what we can do. A powerful computer costs around $1000; this is not so much money for an institution. The project which we submitted was successful, and funding started in February this year and will run until February next year. This is very helpful to us, but we have to search for stable financing if we want this data archive to continue in the long term.
  • Well-documented and cleaned datasets: this is a problem common to all Eastern European countries. How to document the data when an enormous amount of work is required and an insufficient number of people are working? This is certainly a problem, and one of the toughest challenges we had to face. On the one hand, there were not enough people working on the data archive. On the other hand, the people who are working are also conducting some other research. Documenting and cleaning datasets are time-consuming tasks and boring for the majority of people, a fact which raises the problem of motivation. The people working in this field need to be highly motivated as otherwise they will not work at all. The most complicated job now is not how to use the specialized software, but how to get people to work. Of course, one motivating factor would be money; but money is something we all do not have. So the question is how to get work done with scarce resources?

[26]  The solution which we found is functional, but we do not claim it is universal. In our case it simply worked. In our experience, there are some human resources available, which are easy to be motivated using proper stimulants. These are the students. But why would students be motivated to do this? In the case of our Department of Sociology of the University of Bucharest, students are eager to gain experience in using computers, databases and statistical techniques. Working for a data archive would represent a valuable opportunity to acquire this experience. One other reason is that they are awarded a grade for the Practical Sociology Lab. Each student is obliged to take part in this Lab, which is also their main interest. Finally, one other method which I used to attract students is a basic statistics course on Saturday. Many of my Saturdays have been spent teaching these statistics. Using our Faculty’s computer lab, I shared my experience in using SPSS to carry out basic statistical analysis. It was a great success with around 25 students attending each Saturday. It is these students who clean and document the data. They get what they want, i.e. more statistical knowledge and a grade, and we get what we want, i.e. well-cleaned and documented datasets. At the time of writing of this report, the number of students wanting to attend the group is increasing.

[27]  We have now managed to finish about 30 to 35 NESSTAR compliant codebooks, and the others are in progress.

[28]  We received excellent support from the Faculty of Sociology and Social Work: using a computer lab in one of the Faculty’s buildings, we can work together with the students in preparing the datasets for archiving. With 10 networked computers at our disposal, we achieved good results in a reasonable amount of time. All the other expenses are covered by the Faculty as well.

[29]  The latest good news in our close relationship with the Faculty is that we have a new funding source for the data archive, which came about through one of the Faculty’s projects.

[30]  Our next aims are to be accepted in the Council of European Social Sciences Data Archives (CESSDA) and the International Federation of Data Organizations (IFDO), and to implement the LIMBER (Language Independent Metadata Browsing of European Resources) Thesaurus in our data archive. To date we have only worked with a self-designed thesaurus; LIMBER is the future.

26 April 2002

top