"Open List": How to Collect Primary Data on Soviet Terror

"Open List": How to Collect Primary Data on Soviet Terror Mishina Ekaterina Lomonosov Moscow State University, OpenList.wiki, Russian Federation zyu@inbox.ru 2019-04-12T19:30:00Z Name, Institution

Street City Country Name

Converted from a Word document

DHConvalidator Paper Short Paper Soviet terror database crowdsourcing Open list biography databases & dbms history and historiography crowdsourcing English digital humanities (history theory and methodology) public and oral history

Political terror in the USSR has remained a painful theme for Russian society. In academic and public history different waves of terror and controversial official statistics of victims are being widely discussed. Many documents that accompanied terror operations and even some investigative cases remain classified in archives, and almost 30 years after collapse of the USSR, we still know only 20% of names of political terror victims, i.e. about 3 million real names. These data are mentioned in different “Books of Memory”, created in almost every region of the former USSR and consist of short biographical cards on every victim known. All data of “Books of Memory” are collected in unified database made by International society “Memorial” (http://base.memo.ru/). It’s a SQL-based large dataset of victims, which updates every few years and support only Russian language. We initiated the creation of another database – “Open list” – to help “Memorial” collect names and eradicate mistakes from their data.

"Open list" wiki-like open dataset (https://openlist.wiki) is created on the basis of very heterogeneous and diverse dataset by International “Memorial”. That is the only unified dataset on political terror victims in Russia from 1917 to 1991 containing more than 3.1 million records and which has four pages on national languages: Russian, Ukrainian, Belorussian and Georgian. A data card of each wiki-page is unified for any of national subsections of the project. Some historical sources were not included in “Memorial” database, names and bios from these sources are mentioned only in “Open list”. The main advantage of “Open list” is that people can add new names and correct already existing information online. “Open list” updates daily. Users can edit pages via special form with user-friendly interface with 31 fields for personal data and description of arrests, or use wiki-markup to add fields mentioned above manually and upload files on pages. Editors are to verify all crowd-sourced data manually using documents and files that people provide when making their corrections. Most useful documents are digital copies from investigative cases or rehabilitation certificates. If a user cannot provide any file, page remains unapproved with special disclaimer on it.

Crowdsourcing is a very important part of the project. There are several possible activities for our users: they can add new people to the list, parse data from biographies to fields in biographical form or add templates such as "repressed relatives" using inter-wiki to link pages of persons from a kinship family. The special algorithm automatically defines potential relatives according to similar surnames, patronymics and rehabilitation dates (some kind of record linkage approach). We also have tools for our users to identify and unite duplicate pages. It usually requires much historical knowledge to identify description of two different arrests correctly, and the result of such research is almost always manually checked by editors. There are only few people who work with duplicate pages while we have about 100 thousand duplicates, so we need ideally automatic algorithms to merge them. These algorithms vary depending on the quantity of data on the particular page and mentioned type of repression. The simplest method is to compare a full name, a birth year and a birth place, but here additional parameters are required. If we need to find one person in two different historical sources, we add dates of arrest and conviction. In case of possible mistakes in full names and dates of birth we use full coincidence of biographical data in primary sources. Historians provide these algorithms and IT specialist realize them on Python.

Advanced search of “Open list” contains 21 search fields; all text fields allow using logical operators. Users can gain a long list of personal pages by request. The search is not strict and sometimes it shows more pages than were requested. Data visualization is now possible only on certain types of information like dates of birth, or arrest, or conviction, so text fields should be normalized in close future. This is also one of the project goals to make analysis of these data easier and persistent. "Occupation" is one of the most difficult fields for normalization. We use classification made by historians on materials of All-Union 1937 and 1939 censuses and NKVD internal instructions for classification of occupations. HISCO is not in use on this step of work because it seems not suitable for linking occupations to Soviet social stratification in the 1930s, as this linking is a necessary step for analyzing social portrait of terror victims. Also such kind of work with HISCO has never been done on materials of early Soviet period. Now we are only preparing for normalizing data and will do most part of work automatically.

As the database grows, we can use it for academic needs and deepen knowledge of political repression in the USSR through different ways. “Open list” provides an opportunity to make samples of data and construct a social portrait of terror victims. Pages with templates could be objects of network analysis as well as investigative cases published on some pages. Massive of biographies could be a source to study family history. We can also analyze geography of terror using field “place of living”, which in some sources contains concrete addresses. “Open list” itself conducts an archival work in cooperation with Russian state archive on compiling the united electronic “Book of memory” of Moscow and Moscow region. This work consist of two parts. Firstly, digitizing the materials of archival investigative cases is taking place. Secondly, the crowdsourcing takes the floor: our volunteers decrypts information from archival documents and create new pages in a list or contribute to existed. They use the instruction provided by professional historians based on principles of source-study of investigative cases. Editors verify all new records in electronic books. Thus, new historical source with names that had never been mentioned before is emerging online on “Open list” web site. This project helps not only to supplement the list of names, but also correct mistakes in “Books of Memory”.