The data is provided free of charge for online use and download. From 100 to 200 languages dirk goldhahn, thomas eckart, uwe quasthoff natural language processing group, university of leipzig, germany johannisgasse 26, 04103 leipzig email. All data are available as plain text files and can be imported into a mysql database by using the provided import script. Downloads deutscher wortschatz leipzig corpora collection. Proceedings of the eighth international conference on language resources and evaluation lrec12, 2012 bibtex download. Corpus and language statistics for corpora of the leipzig corpora collection the leipzig corpora collection provides corpora in different languages using the same format and comparable sources. Issuu is a digital publishing platform that makes it simple to publish magazines, catalogs, newspapers, books, and more online. The preprocessing of the data used mainly language independent methods and were used for corpora in other languages, too. Deutsch als fremdsprache weihnachten wortschatz by brecht. For a more detailled view on or description of the data this page contains a variety of statistic pages for all. Despite the fact that the wortschatz leipzig team provides a wsdl file for their web service, it is not done with adding a. Building large monolingual dictionaries at the leipzig corpora collection. Wikipedias content can be downloaded safely as a whole in at least two forms.
Louw, was digitized and enhanced by and under the supervision of prof. The term wortschatz is translated treasure of words and because words are, in fact, precious i make a point of handling them with respect and according to their nature. This corpus file originally contains 300,000 sentences of indonesian online newspapers. Leipzig corpora collection 271 corpusbased monolingual. Leipzig corpora collection german wortschatz german. A read is counted each time someone views a publication summary such as the title, abstract, and list of authors, clicks on a figure, or views or downloads the fulltext. Publications how to cite the leipzig corpora collection for the whole collection, please cite the following general paper. Leipzig corpora collection lcc datasets the datahub. Welcome to the leipzig corpora collection deutscher wortschatz. Dirk goldhahn, thomas eckart and uwe quasthoff 2012.
Citeseerx document details isaac councill, lee giles, pradeep teregowda. Deutscher wortschatz is a german database of text corpora and can be utilized to analyze and contextualize words in the thesaurus. Germanet is a semanticallyoriented dictionary of german, similar to wordnet. Easily share your publications and get them in front of issuus. Building large monolingual dictionaries at the leipzig. The asv toolbox is a modular collection of tools for the exploration of written language data. Sonja bosch university of south africa, and converted from csv files to this rdf dataset by thomas eckart and bettina klimek leipzig university, germany. Learn section german wortschatz with free interactive flashcards. Processing extensive text data at the leipzig corpora collection dirk goldhahn mlode 2014, leipzig natural language processing group institute of computer science. We describe the leipzig corpora collection lcc, a freely available resource for corpora and corpus statistics covering more than 20 languages at the time being. Choose from 500 different sets of section german wortschatz flashcards on quizlet. In this paper the leipzig corpora collection is introduced as a contribution to the idea that there is need for standardization of multilingual language resources. The corpus is a random subset of 25,000 sentences from one of the indonesian leipzig corpora files, i.
Corpus portal for search in monolingual corpora uwe quasthoff. Dokumentation deutscher wortschatz leipzig corpora. The paper describes the production process for three dictionaries for which these corpus data were used. Processing extensive text data at the leipzig corpora. We explain the steps of building, processing and presenting corpora of comparable sizes and in a uniform format. The list is compiled from a variety of published sources and would probably be somewhat different from a list of the most common 10,000. English and german each have their very own flow and time and again i find it fascinating to transfer the true meaning of a piece into the respective other language.
Proceedings of the eighth international conference on language resources and. Wortschatz deutsch kostenlos online vokabeln lernen. Building large monolingual dictionaries at the leipzig corpora. It was created at the nlp group, leipzig university and is not actively developed anymore. Corpus portal for search in monolingual corpora citeseerx. Leipzig corpora collection 271 corpusbased monolingual dictionaries for 236 languages the leipzig corpora collection presents corpora in different languages using. Corpusbased monolingual dictionary of the language german, with 26142898 sentences. The leipzig corpora collection university of birmingham. Use code metacpan10 at checkout to apply your discount. Download at the language technology group, universitat hamburg. Citeseerx c exploiting the leipzig corpora collection.
1369 1019 56 679 636 1017 751 977 267 625 1414 1286 908 825 701 968 586 40 776 466 972 912 143 34 577 1364 92 104 1480 822 1193 401 434 223 934 996 1241