Table of Contents

  1. Language Word List Database
  2. Query Word Lists
  3. Linking to Comparalex
  4. Submitting Word Lists
  5. Editing Word Lists
  6. Regular Expressions
  7. Bug/Feature Tracker

Didn't find the answer?

Visit our FAQ page to see
answers to common
questions. Or contact us with
your question.

ComparaLex Help

Language Word List Database

Metadata - Data about data

The website database stores specific information about each language word list. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:

1Word list name (Dialect/Village):A unique name for this word list. Usually the dialect name or village/region where the data was collected. You can have more than one word list per language but the name field for each must be unique.
2ISO 639-3 code for language:The 3 letter ISO 639-3 code for the language (see www.ethnologue.org)
3Standard list used for elicitation: The standard word list used for elicitation (eg. Swadesh, Comparative African Word List, etc.)
4Location of idiolect:The village/city/region, country, etc. where the speaker spent their linguistically formative years
5CountryThe country where the language is spoken
6Date elicitedApproximate date the word list was elicited
7AcknowledgmentsOrganizations and/or individuals who should be acknowledged regarding this word list
8Elsewhere publishedAny other places where this word list is published
9Additional CommentsAny other information that may be helpful to understand the data. For example:
  • Is this a tonal language?
  • Representation markup used for phonemic tone
  • Quality of the audio recordings
10Contributing researcher(s)The name of the researcher who prepared this data.
11Copyright holderIndividual/organization that owns the copyright for the word list data and media files.
12May we display the researcher's name?If 'Yes' then the name of the researcher will appear with the language word list information. Otherwise it will be hidden.
13May we display the audio recordings (if any)?If 'Yes' then the audio recordings will be displayed for the public. Otherwise they will be hidden.
14Do you have permission of the copyright holder to submit this word list?If 'No' please provide an explanation why this data should be considered for publishing in #16.
15Do you have written permission from your data sources to distribute transcriptions and recordings of their utterances?Answer 'Yes' or 'No'. If 'No' please provide an explanation why this data should be considered for publishing in #16.
16If 'No' to #14 or #15 please provide an explanation why this data should be considered for publishingHere you provide an explanation for #14 and/or #15.
17Are you finished editing the word list?Answer 'Yes' or 'No'. If 'No' then you cannot answer 'Yes' to #18.
18Do you agree to the terms of service and grant permission to CanIL to publish your word list and associated sound files?Answer 'Yes' or 'No'. Your word list cannot be approved for publishing until you answer 'Yes'. To review the terms of service click here

Note: ComparaLex reserves the right to NOT publish submitted data on the ComparaLex website for any reason. Refer to the ComparaLex Terms of Service for more information.

Status

Before your word list can be published it needs to go through the approval process. You can view the status by clicking the "Edit My Account" button viewing the status column of your word list. There are three stages:

Not finishedIndicates that you have not completed your data submission/edits and have not given us permission to publish the data on ComparaLex.
Finished but not approvedIndicates that you have finished editing and given consent to publish your data but it has not yet been approved by the ComparaLex staff.
Finished and approvedIndicates that a ComparaLex reviewer has approved the project. The project is now available for public use in ComparaLex.

Note: If for some reason you want to remove your project from public access, you can change the status from green to red after logging in.

Data Fields

The website database stores specific information about each word. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:
1Standard Word List IDThis is the reference code of the standard word list that was used for elicitation.
2GlossThe definition (in English) used for collection.
3AudioAn audio file (.WAV and/or .MP3) of the someone pronouncing the word. For more information, see the section on audio formats.
  • If an MP3 is available you can listen to it by clicking the icon.
  • If a WAV is available you can download it by clicking the icon.
  • If the field shows a this means an audio file has been specified but cannot be found on the server.
4PhoneticPhonetic representation of the word (segmental form only).
5PhonemicPhonemic representation of the word (segmental form only).
6Phonetic PitchSurface pitch transcribed as numbers 1-7 (1=lowest, 7=highest). The numbers will be displayed graphically as lines in between square brackets. For example:
  • An utterance with two syllables, the first of which is pronounced at level 3 pitch and the second of which is pronounced at level 5, would be entered as [3 5] and would appear as [ ]. You don’t need to enter the square brackets, as this will be done automatically when you submit your data.
  • For each contour pitch, type in two numerals without an intervening space. For example, a single-syllable utterance with a falling contour pitch that begins at level 6 and falls to level 4 would be entered as [64] and would appear as [].
  • An utterance with three syllables, the first of which is a contour pitch rising from level 2 to level 4, the second of which is a level 4 pitch, and the third of which is a level 6 pitch, would be entered as [24 4 6] and would appear as [  ].
7Phonemic ToneAnalyzed Surface tone (representation is flexible but please explain).
8Word CategoryNoun, verb, etc.
9Noun/Verb ClassIdentify the noun or verb class of the word if the language has a noun or verb class system.
10Phonetic PluralPlural segmental form only.
11Phonemic PluralPlural segmental form only.
12Plural Phonetic PitchSurface pitch. For more information, see the section on phonetic pitch
13Plural Phonemic ToneAnalyzed surface tone.
14Noun/Verb Class Plural
15OrthographicOrthographic representation.
16Comments

Query Word Lists

To access the lexical data in the ComparaLex database you need to perform a query. This is done by browsing the language selector, selecting the language data, and specifying the output format.

1 Select Language(s)

Select languages
Ethnologue Browser/Language Selector: Move your mouse over
to learn more


This is a hierarchical list of all languages in the Ethnologue organized by language family. Click a language family to open it and display the familes/languages underneath. Submitted word lists are attached to languages in the tree.

Each type of item in the tree/search list has a unique icon:

The blue circle with a number indicates how many submitted word lists are beneath.

Search for

Use the seach box to quickly find what you're looking for in the Ethnologue brower. Start typing and the search box will show a list of all items that match. You can search for language name, alternate name, dialect, language family, submitter/copyright holder, or ISO 639-3 code. The more letters you type, the fewer results will be displayed. To search for a specific ISO 639-3 code, start by typing [ then the three letters. Search is case insensitive. Scroll to the one you want and click on it to find it in the tree.

Hide Empty

Click this button to temporarily hide all languages/families that don't have any word list data. Click it again to show them.

Expand/Collapse All

Click these buttons to open/close ALL the nodes of the tree

Tooltips

Hover over any item in the tree, except the first-level, and you'll see a tooltip with more information.

Selection

Click/checking a submitted word list will insert that list into the Selected Languages section (see below>

2 Output Format

Output format
Output Format: Move your mouse over to learn more


Here you can choose how you want the query results filtered and sorted. There are four options:

  1. Word List - This filters the results to match the standard word list you chose. You can select the English list gloss and/or French list gloss. If the language doesn't have a word that matches the standard word list, blanks are inserted. In some cases there may be entire rows that don't have any language data. To hide these blank lines put a check on either Hide partially empty rows or Hide completely empty rows. Lists that are blue match the word list data you've selected. You can restrict the results even more by typing in the standard word list id numbers. Ranges are accepted (eg. 1-5, 7).
  2. Domain - This filters the results according to the semantic domain you choose from the list. These semantic domains are taken from the Rapid Word Collection method.
  3. Search - This filters the results to show only records that contain a search string that you specify. You select the field to search on and the text to look for. Regular expressions are supported.
  4. None - This doesn't apply any filter and shows the language word lists 'as is'. If only one language is selected, then the list is displayed according to the standard word list that was used for elicitation. If more than one language is selected, they are referenced together and sorted alphabetically by gloss.

3 Select Columns & Reorder

Output format
Select Columns & Reorder: Move your mouse over
to learn more


Here you can select which columns you want displayed for each word list. Greyed-out columns are empty and cannot be selected.

Drag & Drop

You can reposition the word lists by clicking and dragging.

Find in the tree

Clicking the title of each wordlist will find it in the Ethnologue browser

Linking to ComparaLex

Submitted Word Lists

Every submitted word list has a metadata page that can be displayed by using a link such as: http://www.comparalex.org?w1=123 where 123 is the id number of the word list. You can find this page by hovering over the word list in the tree and clicking "View Word List" in the tooltip.

Language

You can use links that will open up the ComparaLex Ethnologue tree and reveal the language and any word lists that are in the database. Use the format http://www.comparalex.org?iso=abc where abc is the ISO 639-3 code for the language.

Submitting Language Word Lists

Preparing Word List Data for Submission

Word list data must be organized following one of the standard word lists in the ComparaLex database (e.g., SIL Comparative African Word List, Swadesh 100 Word List, etc. See all standard word lists). Transcription should be carried out using IPA symbols, with any deviations from the IPA spelled out in the “Additional Comments” section of the metadata form. For additional information, see the data encloding and data format sections below. Your data must minimally contain the following fields for each item:

In addition, the following fields are highly recommended:

Submission Procedure

Submitting data to the server involves four steps:

  1. Metadata form - Supply the details about the language word list such as name, country, ISO 639-3 code, etc. See above for a complete description of all the fields.
  2. Upload data file(s) - The word list data and sound files are uploaded and stored in a temporary location on our server.
  3. Verify import process - Here you specify the field definitions for your data and verify that it imported correctly and make adjustments if necessary.
  4. Await approval - Your word list will wait in a queue until one of our staff approves it and enters it into the database.

To find out what stage your language submission is at, click here.

Data File Format

ComparaLex can accept files in the following formats:

Files may be zipped to save space. The maximum upload file size is 127 mb. You can upload multiple data files and it will add each data set to the language.

Tab separated is recommended over comma separated because it is more likely that the comma will be used in the data in addition to serving as the data separator. If you nevertheless still choose to use comma separated and you have fields that contain commas, please enclose the entire field in double quotation marks (").

Example: 26,"voice box, larynx, Adam's apple"

If your data is stored in a spreadsheet like Microsoft Excel then you can convert it to tab or comma separated format by clicking File:Save As and changing the format to "Unicode Text" or something similar.

Data Encoding

Unicode files are preferred but ANSI/Windows 1252 and SIL IPA93 data can also be imported and automatically converted to UTF8.

Warning: If you have been using a hacked font based on another encoding standard, your data cannot be imported into this system. You will have to convert it to UTF8. A good tool to help with this is SIL TECKit.

Audio File Format

ComparaLex can accommodate both MP3 and WAV audio files.

Recordings are preferred in WAV format with a sampling rate of 44 KHz or higher and a bitrate of 16 bits. ComparaLex will automatically generate MP3 files from your WAV files. MP3 files are created at 32khz sampling, variable bit rate, quality=4, mono. The maximum allowed size of an audio clip for a word is 1000 kB. If you only have MP3 files, please upload them anyway.

Multiple files can be zipped together to save uploading time. The maximum upload file size is 127 mb. You can upload files more than once and each one will be added to the collection that already exists for that language. If a file with the same name already exists, it will be overwritten.

When you upload audio files, ComparaLex will automatically search through the audio field of the language word list and look for matches between sound file names and audio field names. If a match is found, then a link is created. If an audio file is specified but cannot be found, then a question mark will be displayed.

Viewing & Editing Your Language Word Lists

Editing Word List Metadata

If you have already submitted a language but need to make some changes, it’s easy with ComparaLex. To edit your language metadata, login and click on the Edit your account button at the top right of the screen. This will take you to your user account page and at the bottom you should see a table of all the languages that belong to you.

Language List
Sample Language List: Move your mouse over to see descriptions of each column
FinishedThis shows a check mark when you have clicked 'Yes' to the question "... are you finished editing this word list?".
ConsentThis shows a check mark when you have clicked 'Yes' to the question "...grant permission to CanIL to publish".
StatusThis shows the publishing status of the word list. For more information about publishing status, click here.
Edit DetailsTakes you to the edit window where you can edit the metadata for the language (fields like name, country, ISO 639-3 code... etc.)
Edit WordsTakes you to a page where you can edit all the fields of this language. Use this page to add or delete records.
UploadUse this to upload new data and/or audio files to the language.
Delete WordsDeletes all the words from this word list.
Delete AudioDeletes all the audio files from this word list.
Delete AllDeletes the entire word list and all related files.

Editing Words

ComparaLex has a built in editor that allows you to make changes to your word list after you have uploaded your data. We think you'll find the editor quite easy to use. In fact, you could even create an entire word list from scratch without uploading anything.

If you are logged in and you have submitted a language word list then anytime the language appears on ComparaLex the data should have a light-green background. Double-clicking on one of these cells will switch it to editing mode.

Online editor
Online Editor: Makes editing your language word lists easy

Double click the cell to switch to edit mode. Click or press 'ENTER' to save. Click the icon or press 'ESC' to cancel. If you're editing the word list from the 'Edit your account' page then you'll see a column at the far right with two more icons.

Special Behavior

Standard List ID - The standard list ID field will be turned into a drop-down list selector of the standard word list being used. Select the appropriate list record and click the save button.

Phonetic Pitch - Editing phonetic pitch fields is straightforward. Double-click the cell and the idealized pitch trace will be converted to a sequence of numbers 1-7. Make the changes you need and save. The values will be converted back to an idealized pitch trace. You don't have to type the square brackets as these will be added automatically when you save. For more information, see the section on phonetic pitch.

Audio - Double click the audio cell and it will switch to editing mode. Click the browse button to upload a new audio file for the word. The maximum allowed size of an audio clip for a word is 1000 kB. If you upload a WAV file it will automatically create an MP3 for you. For more information, see the section on audio formats.

Adding or Deleting records

The only way to add or delete records is from your Account Details page. Click the Edit your account button at the top right corner of the page. Click the icon to edit words in that language. The word list with all fields will appear and at the far right will be two icons:

Regular Expressions

Regular expressions are a powerful way of specifying a pattern for a complex search. Here is a chart to help you get started with understanding the codes:

CategoryCodeDescriptionExamples
Counts
Applies to
the previous
character
+One or more occurrencesa+rtmatches art, aart, aaart... etc
*Zero or more occurrencesda*dmatches 'dd', 'dad', 'daaaad'... etc
?Either zero or one occurrencebe?anmatches 'ban', 'bean' and 'beean'
{min,max}Match between min and max occurrencesn{1,3}matches on 'n', 'nn', 'nnn' but NOT 'nnnn'
Note: The ? also modifies any of the above to be 'non-greedy'
Useful when used with wildcards like . or classes [...]
.+?zmatches on any number of character until it reaches a 'z'
 
Position
^Beginning of a string^Thematches on 'The' when it occurs at the beginning of the string
(Note: ^ must be the first letter of the search string)
$End of a stringbeard$matches on 'beard' when it occurs at the end of the string
(Note: $ must be the last letter of the search string)
\bWord boundary\barmmatches 'arm' and 'army' but NOT 'farm'
 
Class & Group
Any one character
within the range
.Any character (including carriage return and newline)b.dmatches 'bad', 'bed', 'bZd'
[...]Any single character within the brackets[4-9]thmatches '4th', '5th', '6th' etc.
[^...]Any single character except those within the bracketsb[^ae]dmatches 'bid', 'bud' but NOT 'bed' or 'bad'
(...)Treat the contents of (...) as a single unit
Also stores the contents to be referred to later
band(stand)?matches 'band' and 'bandstand'
 
Other
|Separates alternate possibilitiesjogg(ing|ed)matches 'jogging' and 'jogged'
\Literal. When used before one of the special
characters (above) it treats it as a literal
1\+1matches '1+1'
\sWhitespace characters
(space, tab, line break, carriage return)
\s{2,}matches 2 or more spaces, tabs or new lines
\dDigits 0-9\d\dmatches 10-99
 
Examples

As you can see, when you combine the power of the above codes, you can do some amazing searches. For example:

For more information, there are many regular expression resources on the web. Please note that while different variations of regular expressions exist, they all basically share the same syntax.

Bug/Feature Tracker

The Tracker is a like a 'to do' list for ComparaLex. It where you go when you want to report a problem or make a suggestion for a change/improvement to the site. It's where the developers go when they want to know what to work on next. It's also a place where you can discuss changes and offer comments on the development of ComparaLex.

Anyone can browse the list of tracker items and read the comments. Only registered users can contribute new items and make comments on others. Click here to register for a new account.

How does it work?

Here's a sample scenario of how the tracker feature should be used:

  1. Someone is using ComparaLex and encounters a bug or has a suggestion for change/improvement
  2. The person logs in and creates a new item in the tracker, filling out all the fields (see below)
  3. An email is automatically sent to the administrator notifying of a new tracker item
  4. The administrator checks over the tracker item and...
    • edits it for clarity and accuracy
    • emails the user for more info if needed
    • assigns a priority for this item

Over time, the tracker will accumulate more and more items. This continues until a milestone is reached. Then a developer is contracted to implement the tracker items in order of priority. Once the item has been completed/implemented it will be marked as closed.

Tracker Fields

Please!

Before creating a new item it would helpful if you would browse the tracker items to see if someone else has already reported the same thing. This cuts down on our work.