Cross-Linguistic Data Formats

CLDF logo

Why?

To allow exchange of cross-linguistic data and decouple development of tools and methods from that of databases, standardized data formats are necessary.

Once established, these dataformats could become a foundation not only for tools but also for instruction material in the spirit of Data Carpentry for historical linguistics and linguistic typology.

What?

The main types of cross-linguistic data we are concerned with here are any tabular data which is typically analysed using quantitative (automated) methods or made accessible using software tools like the `clld` framework, such as

  • wordlists (or more complex lexical data including e.g. cognate judgements),
  • structure datasets (e.g. WALS features),
  • simple dictionaries.

Design principles

  • Data should be both editable "by hand" and amenable to reading and writing by software (preferably software the typical linguist can be expected to use correctly).
  • Data should be encoded as UTF-8 text files.
  • If entities can be referenced, e.g. languages through their Glottocode, this should be done rather than duplicating information like language names.
  • Identifier should be resolvable HTTP URLs if possible. If not, they should be documented in the metadata.
  • Compatibility with existing tools, standards and practice should always be kept in mind.

Since we are concerned with tabular data here, CLDF is built on W3C's Model for Tabular Data and Metadata on the Web and Metadata Vocabulary for Tabular Data.

One of the main goals of the CLDF specification is a useful delineation of data and tools. Using a CSV based format makes it really easy to use this data in a UNIX-style pipeline of data transformation commands. This pipeline-style of data transformation and analysis seems to be at the core of typical workflows e.g. in historical linguistics, e.g. LingPy or QLC.

If suitable text- and line-based formats are available, this pipeline-style does also allow for easy extensibility; E.g. a workflow for automatic cognate judgements based on LingPy functionality could be extended with phylogenetic analysis and post-processing via phyltr, which processes sets of phylogenetic trees represented in the newick format, or ete.

If cross-linguistic comparisons procede in the footsteps of bioinformatics, workflows based UNIX pipelines may at some point be formalized using a common workflow language.

History

While data formats to exchange linguistic data have been around for some time, e.g. the SFM or Standard Format used by Toolbox, new developments in the area of language diversity research have motivated this push for a new set of formats:

  • A new interest in standardizing tabular data on the web, with a particular focus on CSV
  • A trend towards using computational methods to analyse large scale cross-linguistic data.
  • The clld framework, developed within the CLLD project has shown that many different cross-linguistic databases can be built on top of the same core data model. CLDF is an attempt to externalise this data model.

Thus, following up discussions from the first workshop on Language Comparison with Linguistic Databases a second workshop in Leipzig focused on the idea of a very simple CSV based format to exchange very simple cross-linguistic data.

Simplicity was the main design goal from the start, so the formats under consideration will evolve starting out as simple as possible. Thus, the first milestone to be reached is a proof of usefulness. With the first release of the CLDF spec we hope to provide a baseline for experiments, and possibly even this proof.