The present version of Russian Newspaper Corpus includes 29506 samples from 9 Russian newspapers, the total number of text wordforms is 7 452 100. The breakdown is as follows:
samples | wordforms | ||
iz | Izvestiya | 4240 | 1 144 100 |
lg | Literaturnaya Gazeta | 1035 | 483 200 |
mk | Moskovskiy Komsomolets | 5407 | 1 457 700 |
ne | Nezavisimaya Gazeta | 3621 | 1 161 800 |
no | Novaya gazeta | 659 | 390 100 |
pr | Pravda 5 | 1551 | 370 300 |
rv | Rossiyskie Vesti | 1394 | 352 700 |
sg | Segodnya | 6096 | 1 116 600 |
sp | Sanktpeterburgskiye Vedomosti | 5503 | 975 600 |
Each sample is preceded by an index, where two first symbols refer to the newspaper, next symbol refers to the month ( a - January, b - February, etc.).
Distribution of text wordforms over months (in thousands):
January | 42 | August | 867 |
April | 43 | September | 1093 |
May | 25 | October | 1543 |
June | 345 | November | 1320 |
July | 1116 | December | 1058 |
The following two digits refer to the day of publication.
All this is followed by three symbol indication of topic:
acc | accident | fem | feminism | occ | occult knowledge |
adv | adventure | fin | finance | ped | paedagogics |
agr | agriculture | hap | happening | poe | poetry |
ani | animals | hea | health | pol | politics |
ant | anthropology | his | history | pre | press |
arc | architecture | hum | humanities | pro | prosa |
arm | army | jur | journalism (incl. polemics) | psy | psychology |
art | visual arts | lab | labour | rel | religion |
bib | bibliography | law | law | sca | scandal |
che | Chechnya | lei | leisure | sci | science |
cin | cinema | lif | life story | sem | semiotics |
com | computers | lit | literature | soc | society |
con | consumerism | lng | language | spa | space |
cor | corruption | mas | mass media | spo | sport |
cri | crime | max | maxims | spy | spying |
cul | culture | med | medicine | sta | statistics |
cur | curiosity | mem | memoir | tel | television |
doc | document | mil | military complex | the | theatre |
ecn | economics | min | minorities | tow | town |
eco | ecology | mor | moral | tra | tradition |
edu | education | mus | music | tur | tourism |
eng | engineering | nat | nature | uni | universal |
ess | essay | nec | necrologue | war | war |
fas | fashion | new | news |
This eight-digit index may be followed by optional symbols, giving further specific information:
-a | announcement | -i | interview | -v | home/NIS |
-b | book review | -l | letter | -w | NIS/foreign |
-d | dispute | -m | memoir | -x | advertisement |
-f | foreign | -p | person | -y | history |
-g | home/foreign | -r | region | ||
-h | humour | -u | NIS |