Table of Contents
- Language Word List Database
- Query Word Lists
- Linking to Comparalex
- Submitting Word Lists
- Editing Word Lists
- Regular Expressions
- Bug/Feature Tracker
Didn't find the answer?
Visit our FAQ page to seeanswers to common
questions. Or contact us with
your question.
ComparaLex Help
Language Word List Database
Metadata - Data about data
The website database stores specific information about each language word list. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:
1 | Word list name (Dialect/Village): | A unique name for this word list. Usually the dialect name or village/region where the data was collected. You can have more than one word list per language but the name field for each must be unique. |
2 | ISO 639-3 code for language: | The 3 letter ISO 639-3 code for the language (see www.ethnologue.org) |
3 | Standard list used for elicitation: | The standard word list used for elicitation (eg. Swadesh, Comparative African Word List, etc.) |
4 | Location of idiolect: | The village/city/region, country, etc. where the speaker spent their linguistically formative years |
5 | Country | The country where the language is spoken |
6 | Date elicited | Approximate date the word list was elicited |
7 | Acknowledgments | Organizations and/or individuals who should be acknowledged regarding this word list |
8 | Elsewhere published | Any other places where this word list is published |
9 | Additional Comments | Any other information that may be helpful to understand the data. For example:
|
10 | Contributing researcher(s) | The name of the researcher who prepared this data. |
11 | Copyright holder | Individual/organization that owns the copyright for the word list data and media files. |
12 | May we display the researcher's name? | If 'Yes' then the name of the researcher will appear with the language word list information. Otherwise it will be hidden. |
13 | May we display the audio recordings (if any)? | If 'Yes' then the audio recordings will be displayed for the public. Otherwise they will be hidden. |
14 | Do you have permission of the copyright holder to submit this word list? | If 'No' please provide an explanation why this data should be considered for publishing in #16. |
15 | Do you have written permission from your data sources to distribute transcriptions and recordings of their utterances? | Answer 'Yes' or 'No'. If 'No' please provide an explanation why this data should be considered for publishing in #16. |
16 | If 'No' to #14 or #15 please provide an explanation why this data should be considered for publishing | Here you provide an explanation for #14 and/or #15. |
17 | Are you finished editing the word list? | Answer 'Yes' or 'No'. If 'No' then you cannot answer 'Yes' to #18. |
18 | Do you agree to the terms of service and grant permission to CanIL to publish your word list and associated sound files? | Answer 'Yes' or 'No'. Your word list cannot be approved for publishing until you answer 'Yes'. To review the terms of service click here |
Note: ComparaLex reserves the right to NOT publish submitted data on the ComparaLex website for any reason. Refer to the ComparaLex Terms of Service for more information.
Status
Before your word list can be published it needs to go through the approval process. You can view the status by clicking the "Edit My Account" button viewing the status column of your word list. There are three stages:
Not finished | Indicates that you have not completed your data submission/edits and have not given us permission to publish the data on ComparaLex. |
Finished but not approved | Indicates that you have finished editing and given consent to publish your data but it has not yet been approved by the ComparaLex staff. |
Finished and approved | Indicates that a ComparaLex reviewer has approved the project. The project is now available for public use in ComparaLex. |
Note: If for some reason you want to remove your project from public access, you can change the status from green to red after logging in.
Data Fields
The website database stores specific information about each word. This information is provided by the registered users of ComparaLex and checked by the administrators. Here is an explanation of each field:1 | Standard Word List ID | This is the reference code of the standard word list that was used for elicitation. |
2 | Gloss | The definition (in English) used for collection. |
3 | Audio | An audio file (.WAV and/or .MP3) of the someone pronouncing the word. For more information, see the section on audio formats.
|
4 | Phonetic | Phonetic representation of the word (segmental form only). |
5 | Phonemic | Phonemic representation of the word (segmental form only). |
6 | Phonetic Pitch | Surface pitch transcribed as numbers 1-7 (1=lowest, 7=highest). The numbers will be displayed graphically as lines in between square brackets. For example:
|
7 | Phonemic Tone | Analyzed Surface tone (representation is flexible but please explain). |
8 | Word Category | Noun, verb, etc. |
9 | Noun/Verb Class | Identify the noun or verb class of the word if the language has a noun or verb class system. |
10 | Phonetic Plural | Plural segmental form only. |
11 | Phonemic Plural | Plural segmental form only. |
12 | Plural Phonetic Pitch | Surface pitch. For more information, see the section on phonetic pitch |
13 | Plural Phonemic Tone | Analyzed surface tone. |
14 | Noun/Verb Class Plural | |
15 | Orthographic | Orthographic representation. |
16 | Comments |
Query Word Lists
To access the lexical data in the ComparaLex database you need to perform a query. This is done by browsing the language selector, selecting the language data, and specifying the output format.
1 Select Language(s)
to learn more
This is a hierarchical list of all languages in the Ethnologue organized by language family. Click a language family to open it and display the familes/languages underneath. Submitted word lists are attached to languages in the tree.
Each type of item in the tree/search list has a unique icon:
- Language family
- Language (ISO 639-3 codes are in [])
- Submitted word list
The blue circle with a number indicates how many submitted word lists are beneath.
Search for
Use the seach box to quickly find what you're looking for in the Ethnologue brower. Start typing and the search box will show a list of all items that match. You can search for language name, alternate name, dialect, language family, submitter/copyright holder, or ISO 639-3 code. The more letters you type, the fewer results will be displayed. To search for a specific ISO 639-3 code, start by typing [ then the three letters. Search is case insensitive. Scroll to the one you want and click on it to find it in the tree.
Hide Empty
Click this button to temporarily hide all languages/families that don't have any word list data. Click it again to show them.
Expand/Collapse All
Click these buttons to open/close ALL the nodes of the tree
Tooltips
Hover over any item in the tree, except the first-level, and you'll see a tooltip with more information.
Selection
Click/checking a submitted word list will insert that list into the Selected Languages section (see below>
2 Output Format
Here you can choose how you want the query results filtered and sorted. There are four options:
- Word List - This filters the results to match the standard word list you chose. You can select the English list gloss and/or French list gloss. If the language doesn't have a word that matches the standard word list, blanks are inserted. In some cases there may be entire rows that don't have any language data. To hide these blank lines put a check on either Hide partially empty rows or Hide completely empty rows. Lists that are blue match the word list data you've selected. You can restrict the results even more by typing in the standard word list id numbers. Ranges are accepted (eg. 1-5, 7).
- Domain - This filters the results according to the semantic domain you choose from the list. These semantic domains are taken from the Rapid Word Collection method.
- Search - This filters the results to show only records that contain a search string that you specify. You select the field to search on and the text to look for. Regular expressions are supported.
- None - This doesn't apply any filter and shows the language word lists 'as is'. If only one language is selected, then the list is displayed according to the standard word list that was used for elicitation. If more than one language is selected, they are referenced together and sorted alphabetically by gloss.
3 Select Columns & Reorder
to learn more
Here you can select which columns you want displayed for each word list. Greyed-out columns are empty and cannot be selected.
Drag & Drop
You can reposition the word lists by clicking and dragging.
Find in the tree
Clicking the title of each wordlist will find it in the Ethnologue browser
Linking to ComparaLex
Submitted Word Lists
Every submitted word list has a metadata page that can be displayed by using a link such as: http://www.comparalex.org?w1=123 where 123 is the id number of the word list. You can find this page by hovering over the word list in the tree and clicking "View Word List" in the tooltip.
Language
You can use links that will open up the ComparaLex Ethnologue tree and reveal the language and any word lists that are in the database. Use the format http://www.comparalex.org?iso=abc where abc is the ISO 639-3 code for the language.
Submitting Language Word Lists
Preparing Word List Data for Submission
Word list data must be organized following one of the standard word lists in the ComparaLex database (e.g., SIL Comparative African Word List, Swadesh 100 Word List, etc. See all standard word lists). Transcription should be carried out using IPA symbols, with any deviations from the IPA spelled out in the “Additional Comments” section of the metadata form. For additional information, see the data encloding and data format sections below. Your data must minimally contain the following fields for each item:
- Standard list number - Data in this field are the numbers for each word employed by the standard word list the data has been gathered against.
- Gloss - Data in this field are the glosses of the standard word list against which the data has been gathered.
- Phonetic - Data in this field are phonetic transcriptions of each word in the list, including phonetic pitch if language is tonal or has a pitch-accent system. If the “Phonetic Pitch” field is employed, then phonetic pitch does not need to be included in the “Phonetic” field.
In addition, the following fields are highly recommended:
- Audio - Data in this field are words that correspond identically to the file names of the corresponding individual digital recordings of each word. The preferred filename is one that includes both the standard list number and the gloss (e.g., 0001_body.wav). Each word must have a separate audio file. Ideally, the recordings should include the gloss and one token of each word. For additional information, see the audio format section below.
- Phonetic Pitch - Data in this field are preferably encoded using the numerals 1-7 as follows. For each level pitch, type in one numeral (1 = lowest pitch, 7 = highest pitch) with a single space between each of the pitches. For more information, see the section on phonetic pitch
Submission Procedure
Submitting data to the server involves four steps:
- Metadata form - Supply the details about the language word list such as name, country, ISO 639-3 code, etc. See above for a complete description of all the fields.
- Upload data file(s) - The word list data and sound files are uploaded and stored in a temporary location on our server.
- Verify import process - Here you specify the field definitions for your data and verify that it imported correctly and make adjustments if necessary.
- Await approval - Your word list will wait in a queue until one of our staff approves it and enters it into the database.
To find out what stage your language submission is at, click here.
Data File Format
ComparaLex can accept files in the following formats:
- Field Linguist’s Toolbox (.db) - See the Resources menu tab to download a sample database
- Comma Separated (.txt, .csv)
- Tab Separated (.txt, .tsv)
Files may be zipped to save space. The maximum upload file size is 127 mb. You can upload multiple data files and it will add each data set to the language.
Tab separated is recommended over comma separated because it is more likely that the comma will be used in the data in addition to serving as the data separator. If you nevertheless still choose to use comma separated and you have fields that contain commas, please enclose the entire field in double quotation marks (").
Example: 26,"voice box, larynx, Adam's apple"
If your data is stored in a spreadsheet like Microsoft Excel then you can convert it to tab or comma separated format by clicking File:Save As and changing the format to "Unicode Text" or something similar.
Data Encoding
Unicode files are preferred but ANSI/Windows 1252 and SIL IPA93 data can also be imported and automatically converted to UTF8.
Warning: If you have been using a hacked font based on another encoding standard, your data cannot be imported into this system. You will have to convert it to UTF8. A good tool to help with this is SIL TECKit.
Audio File Format
ComparaLex can accommodate both MP3 and WAV audio files.
- MP3 - lossy, low bandwidth format for listening online.
- WAV - lossless format to download to your computer for detailed analysis.
Recordings are preferred in WAV format with a sampling rate of 44 KHz or higher and a bitrate of 16 bits. ComparaLex will automatically generate MP3 files from your WAV files. MP3 files are created at 32khz sampling, variable bit rate, quality=4, mono. The maximum allowed size of an audio clip for a word is 1000 kB. If you only have MP3 files, please upload them anyway.
Multiple files can be zipped together to save uploading time. The maximum upload file size is 127 mb. You can upload files more than once and each one will be added to the collection that already exists for that language. If a file with the same name already exists, it will be overwritten.
When you upload audio files, ComparaLex will automatically search through the audio field of the language word list and look for matches between sound file names and audio field names. If a match is found, then a link is created. If an audio file is specified but cannot be found, then a question mark will be displayed.
Viewing & Editing Your Language Word Lists
Editing Word List Metadata
If you have already submitted a language but need to make some changes, it’s easy with ComparaLex. To edit your language metadata, login and click on the Edit your account button at the top right of the screen. This will take you to your user account page and at the bottom you should see a table of all the languages that belong to you.
Finished | This shows a check mark when you have clicked 'Yes' to the question "... are you finished editing this word list?". |
Consent | This shows a check mark when you have clicked 'Yes' to the question "...grant permission to CanIL to publish". |
Status | This shows the publishing status of the word list. For more information about publishing status, click here. |
Edit Details | Takes you to the edit window where you can edit the metadata for the language (fields like name, country, ISO 639-3 code... etc.) |
Edit Words | Takes you to a page where you can edit all the fields of this language. Use this page to add or delete records. |
Upload | Use this to upload new data and/or audio files to the language. |
Delete Words | Deletes all the words from this word list. |
Delete Audio | Deletes all the audio files from this word list. |
Delete All | Deletes the entire word list and all related files. |
Editing Words
ComparaLex has a built in editor that allows you to make changes to your word list after you have uploaded your data. We think you'll find the editor quite easy to use. In fact, you could even create an entire word list from scratch without uploading anything.
If you are logged in and you have submitted a language word list then anytime the language appears on ComparaLex the data should have a light-green background. Double-clicking on one of these cells will switch it to editing mode.
Double click the cell to switch to edit mode. Click or press 'ENTER' to save. Click the icon or press 'ESC' to cancel. If you're editing the word list from the 'Edit your account' page then you'll see a column at the far right with two more icons.
Special Behavior
Standard List ID - The standard list ID field will be turned into a drop-down list selector of the standard word list being used. Select the appropriate list record and click the save button.
Phonetic Pitch - Editing phonetic pitch fields is straightforward. Double-click the cell and the idealized pitch trace will be converted to a sequence of numbers 1-7. Make the changes you need and save. The values will be converted back to an idealized pitch trace. You don't have to type the square brackets as these will be added automatically when you save. For more information, see the section on phonetic pitch.
Audio - Double click the audio cell and it will switch to editing mode. Click the browse button to upload a new audio file for the word. The maximum allowed size of an audio clip for a word is 1000 kB. If you upload a WAV file it will automatically create an MP3 for you. For more information, see the section on audio formats.
Adding or Deleting records
The only way to add or delete records is from your Account Details page. Click the Edit your account button at the top right corner of the page. Click the icon to edit words in that language. The word list with all fields will appear and at the far right will be two icons:
- to add a new word/row below the current row.
- to delete the word/row.
Regular Expressions
Regular expressions are a powerful way of specifying a pattern for a complex search. Here is a chart to help you get started with understanding the codes:
Category | Code | Description | Examples | |
Counts Applies to the previous character | ||||
+ | One or more occurrences | a+rt | matches art, aart, aaart... etc | |
* | Zero or more occurrences | da*d | matches 'dd', 'dad', 'daaaad'... etc | |
? | Either zero or one occurrence | be?an | matches 'ban', 'bean' and 'beean' | |
{min,max} | Match between min and max occurrences | n{1,3} | matches on 'n', 'nn', 'nnn' but NOT 'nnnn' | |
Note: The ? also modifies any of the above to be 'non-greedy' Useful when used with wildcards like . or classes [...] | .+?z | matches on any number of character until it reaches a 'z' | ||
Position | ||||
^ | Beginning of a string | ^The | matches on 'The' when it occurs at the beginning of the string (Note: ^ must be the first letter of the search string) | |
$ | End of a string | beard$ | matches on 'beard' when it occurs at the end of the string (Note: $ must be the last letter of the search string) | |
\b | Word boundary | \barm | matches 'arm' and 'army' but NOT 'farm' | |
Class & Group Any one character within the range | ||||
. | Any character (including carriage return and newline) | b.d | matches 'bad', 'bed', 'bZd' | |
[...] | Any single character within the brackets | [4-9]th | matches '4th', '5th', '6th' etc. | |
[^...] | Any single character except those within the brackets | b[^ae]d | matches 'bid', 'bud' but NOT 'bed' or 'bad' | |
(...) | Treat the contents of (...) as a single unit Also stores the contents to be referred to later | band(stand)? | matches 'band' and 'bandstand' | |
Other | ||||
| | Separates alternate possibilities | jogg(ing|ed) | matches 'jogging' and 'jogged' | |
\ | Literal. When used before one of the special characters (above) it treats it as a literal | 1\+1 | matches '1+1' | |
\s | Whitespace characters (space, tab, line break, carriage return) | \s{2,} | matches 2 or more spaces, tabs or new lines | |
\d | Digits 0-9 | \d\d | matches 10-99 | |
Examples
As you can see, when you combine the power of the above codes, you can do some amazing searches. For example:
- '\b\d{1,3}\b' matches any number between 1 and 999
- 'sep[ae]rate' finds seperate and separate
- '\.|\? [a-z]' finds sentences starting with lowercase letter
- ' {2,}' finds double (or more) spaces
For more information, there are many regular expression resources on the web. Please note that while different variations of regular expressions exist, they all basically share the same syntax.
Bug/Feature Tracker
The Tracker is a like a 'to do' list for ComparaLex. It where you go when you want to report a problem or make a suggestion for a change/improvement to the site. It's where the developers go when they want to know what to work on next. It's also a place where you can discuss changes and offer comments on the development of ComparaLex.
Anyone can browse the list of tracker items and read the comments. Only registered users can contribute new items and make comments on others. Click here to register for a new account.
How does it work?
Here's a sample scenario of how the tracker feature should be used:
- Someone is using ComparaLex and encounters a bug or has a suggestion for change/improvement
- The person logs in and creates a new item in the tracker, filling out all the fields (see below)
- An email is automatically sent to the administrator notifying of a new tracker item
- The administrator checks over the tracker item and...
- edits it for clarity and accuracy
- emails the user for more info if needed
- assigns a priority for this item
Over time, the tracker will accumulate more and more items. This continues until a milestone is reached. Then a developer is contracted to implement the tracker items in order of priority. Once the item has been completed/implemented it will be marked as closed.
Tracker Fields
- Type - A tracker item can be one of five types:
- Bug - A problem, error, or crash has been found
- Change - A suggestion for a change in behaviour
- New Feature - A completely new feature that would be useful
- Appearance - A suggestion for change in appearance or wording
- Other - Anything else
- Title - A short summary/title of what you want you're suggesting
- Object - The page/part of ComparaLex that this tracker item applies to
- Description - A full description of what you're suggesting. Here's some tips:
- Please be as specific as possible. Don't use vague/general terms (ie. "Once in awhile when I click on it it gives me an error message")
- Provide a url to the page your refering to (the http://www.comparalex.org... part in your browser).
- Provide exact steps to reproduce the problem. If you can't reproduce it, we probably can't either.
- Browser - The web browser you are using (this will usually be filled in automatically)
- Priority - The priority level is assigned by the ComparaLex administration team. There is no timeline here so the specifications are a bit vague.
- Urgent - Critical item that must be fixed ASAP
- Next minor release - Quite important item that's not too difficult to implement, but it can wait awhile
- Next major release - Big changes that will take some time to work out
- Rainy day - Only if there's nothing else important to do
- Status - The status is assigned by the ComparaLex administration team. Can be one of two values:
- Open - This item is being considered and has not been implemented yet. This is the default for all new items
- Closed - This item is no longer being considered (either because
Please!
Before creating a new item it would helpful if you would browse the tracker items to see if someone else has already reported the same thing. This cuts down on our work.