SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in Tekststrukturering pa italiensk og dansk
Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
Common Voice - Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at https://commonvoice.mozilla.org/sentence-collector. Common Voice is described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia).
FT Speech - Described in FT SPEECH : Danish Parliament Speech Corpus (Scholia).
NST
NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
Stavekontrolden - word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
The Concise Danish Dictionary/The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
In Debian-based distributions the word list may be installed with sudo aptitude install aspell-da and extracted with spell -d da dump master.
Interactive Terminology for Europe (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
Sketch Engine - cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use.
Word sets
Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in danlp.
Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in danlp.
Four words - 100 odd-one-out sets of 4 words or phrases.
Byte-Pair Encoding embedding - Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300).
NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
punctfix - "Adds punctuation and capitalization for a given text."
Named entity recognition
ScandiNER - Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese.
Babelfy - Web app and service for linking words and entities.
DBpedia Spotlight - DBpedia-based entity linker. Described in Improving Efficiency and Accuracy in Multilingual Entity Extraction (Scholia)
Sentiment analysis
afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with pip install afinn.
Hisia - Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel.
senda - Python package with transformer-based sentiment analysis from Ekstra Bladet Analyse with as of 2021 state-of-the-art performance on one dataset.
danspeech - DeepSpeech2-based Danish speech recognition in Python
kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
Speech Synthesis (text-to-speech)
espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.
Fundamental processing
DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
bornholmsk - Datasets and embeddings for the Bornholmsk dialect.
spaCy - Python-based natural language processing package