GICR | General Internet-Corpus of Russian

The General Internet-Corpus of Russian (GICR) is a megacorpus (more than 15 GT) created with a fully automated technology of collecting and tagging texts from Russian Internet and based on the latest achievements of computational linguistics.

As of autumn 2021, there are two versions of the corpus: working version 1.0, which contains social media materials: networks VKontakte, LiveJournal blogs and the texts of Журнальный Зал (https://magazines.gorky.media/), and version 2.0, which is under development.

To get access to version 1.0, you can apply via e-mail: geekrya@gmail.com.

The project has the status of an educational and scientific one, and students of the Department of Computational Linguistics of RSUH and of MIPT participate in its realization, as well as MSU and the University of Leeds (UK).

The project is open to external researchers (at the moment, with some limitations related to the fact that the project is in active development and testing).

The project is accompanied by scientific seminars, which are open to all who are interested to contribute to the creation of GICR or to conduct linguistic experiments on it.