Tgcontest Save

Telegram Data Clustering contest solution by Mindful Squirrel

Project README

TGNews

Demo

Russian: https://ilyagusev.github.io/tgcontest/ru/main.html
English: https://ilyagusev.github.io/tgcontest/en/main.html

Install

Prerequisites: CMake, Boost

$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev

For MacOS

$ brew install boost jsoncpp ossp-uuid protobuf

If you got zip archive, just go to building binary

To download code and models:

$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule update --init --recursive
$ bash download_models.sh
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip

For MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip

To build binary (in "tgcontest" dir):

$ mkdir build && cd build && Torch_DIR="../libtorch" cmake -DCMAKE_BUILD_TYPE=Release .. && make -j4

To download datasets:

$ bash download_data.sh

Run on sample:

./build/tgnews top data --ndocs 10000

Training

Russian FastText vectors training: VectorsRu.ipynb
Russian fasttext category classifier training: CatTrainRu.ipynb
Russian text embedder with triplet loss training (v3):
English FastText vectors training: VectorsEn.ipynb
English fasttext category classifier training: CatTrainEn.ipynb
English text embedder with triplet loss training (v3):
PageRank rating calculation: PageRankRating.ipynb
Russian ELMo-based sentence embedder training (not used):
XLM-RoBERTa pseudo-labeling for categorization:

Models

Language detection model (2 round): lang_detect_v10.ftz
Russian FastText vectors (2 round): ru_vectors_v3.bin
Russian categories detection model (2 round): ru_cat_v5.ftz
English FastText vectors (2 round): en_vectors_v3.bin
English categories detection model (2 round): en_cat_v5.ftz
PageRank-based agency rating: pagerank_rating.txt
Alexa agency rating: alexa_rating_4_fixed.txt
XLM-RoBERTa for categorization (pytorch-lightning checkpoint): xlmr_en_ru_cat_v1.tar.gz

Data

Russian news from 11.01.2019 to 10.05.2020 with gaps: ru_tg_1101_0510.jsonl.tar.gz
Russian news from 11.05.2020 to 17.05.2020: ru_tg_0511_0517.jsonl.tar.gz
English news from 11.01.2019 to 10.05.2020 with gaps: en_tg_1101_0510.jsonl.tar.gz
English news from 11.05.2020 to 17.05.2020: en_tg_0511_0517.jsonl.tar.gz

Markup

Russian categories raw train markup: ru_cat_v4_train_raw_markup.tsv
Russian categories aggregated train markup: ru_cat_v4_train_annot.json
Russian categories aggregated train markup in fastText format: ft_ru_cat_v4_train.txt
Russian categories manual train markup: ru_cat_v4_train_manual_annot.json
Russian categoreis manual train markup in fastText format: ft_ru_cat_v4_train_manual.txt
Russian categoreis raw test markup: ru_cat_v4_test_raw_markup.tsv
Russian categories aggregated test markup: ru_cat_v4_test_annot.json
Russian categories aggregated test markup in fastText format: ft_ru_cat_v4_test.txt
English categories aggregated train markup: en_cat_v4_train_annot.json
English categories aggregated train markup in fastText format: ft_en_cat_v4_train.txt
English categories aggregated test markup: en_cat_v4_test_annot.json
English categories aggregated test markup in fastText format: ft_en_cat_v4_test.txt
Russian clustering pairs: ru_pairs_raw_markup.tsv
English clustering pairs: en_pairs_raw_markup.tsv
Russian clustering pairs for one day (0517): ru_clustering_0517.tsv

Misc

Flamegraph: https://ilyagusev.github.io/tgcontest/flamegraph.svg

Other contestants

Round 2
- II place
  - Daring Frog: https://github.com/a-l-e-x-k/data_clustering_contest, article: https://medium.com/@alexkuznetsov/2nd-place-solution-for-telegram-data-clustering-contest-f28d55b98d30
  - Swift Skunk: https://github.com/sorrge/tg_news_cluster
- III place
  - Mindful Kitten: https://danlark.org/2020/07/31/news-aggregator-from-scratch-in-2-weeks/
- IV place
  - Bossy Gnu: https://github.com/maxoodf/tgnews
- Other:
  - Large Crab: https://github.com/ilya-ustinov/tgcontest
Round 1
- III place
  - Kooky Dragon: https://github.com/nick-baliesnyi/tgnews
- IV place
  - Sharp Sloth: https://github.com/thehemen/telegram-data-clustering
- Other
  - Desert Python: https://github.com/crazyleg/telegram_data_clustering_2019
  - Funky Peacock: https://github.com/Stepka/telegram_clustering_contest
  - Unknown animal: https://github.com/roman-rybalko/telegram-data-clustering-contest
  - Unknown animal: https://github.com/MarcoBuster/data-clustering-contest
  - Unknown animal: https://github.com/sudevschiz/tgnews
  - Unknown animal: https://github.com/crazyleg/telegram_data_clustering_2019
  - Unknown animal: https://github.com/77ph/tgnews
  - Unknown animal: https://github.com/akash-joshi/telegram-cluster
  - Unknown animal: https://github.com/dremovd/telegram-clustering

Contacts

Telegram: @YallenGusev

Open Source Agenda is not affiliated with "Tgcontest" Project. README Source: IlyaGusev/tgcontest

Stars

Open Issues

Last Commit

1 year ago

Repository

IlyaGusev/tgcontest

License

Apache-2.0

Homepage

https://contest.com/docs/data_clustering2

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/tgcontest"><img src="https://www.opensourceagenda.com/projects/tgcontest/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog