After reading this document, you should understand the basics of VSCA. With that knowledge you should be able to use this program properly. We'll start out easy, but in the end you'll know how to write a VSCA soundchange rule, how to write exceptions for them, how to include optional characters in your rules, and more! Without further ado, let's just start.
The first thing you need to know is how VSCA really works. You need two files: a lexicon file and a ruleset file. VSCA assumes these files to be encoded in UTF-8. Sometimes you do not want that, and in those cases you can tell VSCA to assume plain 8-bit bytes text file (see Running VSCA for more info).
The lexicon file is a text file that contains the words of your original language, one word on each line. You can include comments too. Comments are lines that will be ignored by the VSCA, so that you can include text in your lexicon files that will not be soundchanged. Comments start with a # An example lexicon file might look like this.
lector doctor focus jocus districtus civitatem adoptare opera secundus # Some Latin words # Original source: http://zompist.com/sounds.htm
In this example file, the last line is a comment. Obviously, comments can occur on any line, not just the last one.
Comments, spoken of which, can also be placed on lines that list a word. This is called an inline comment. You can use this to specify a definition for a word on the same line as the word itself. VSCA will not sound-change the comment and unless otherwise specified (see the command line switches, especially those dealing with the output template) they are sent to the output file too.
For example:
lector # school's boss doctor # person who heals opera # musical performance
Use the -l command line switch to specify your lexicon file.
Here lies the real power of VSCA, the ruleset file. In this file you define the soundchanges that you want to apply to your original words, as listed in the lexicon file. The idea is simple: one rule per line. Comments work the same in the lexicon file and in the ruleset file. An example ruleset file could look like this:
# Latin to Spanish # Variable declarations # Vowels V=aeiou # Consonants C=ptcqbdgmnlrhs # Front vowels F=ie # Back vowels B=ou # Voiceless sstops S=ptc # Voiced stops Z=bdg # And the changes s//_# m//_# e//Vr_# v//V_V u/o/_# [gn]/[nh]/_ S/Z/V_V c/i/F_t c/u/B_t p//V_t [ii]/i/_ e//C_rV # Original source: http://zompist.com/sounds.htm
You see how comments are throughout the entire ruleset file. This is useful to explain what you are doing in a language you prefer. VSCA's language is not very difficult, but it's (hopefully) not your first language, and just reading things in English might be easier for you as rulesets get more and more complext.
Use the -r command line switch to specify your ruleset file.
The output file is the file where VSCA will write the words after they have gone through all sound changes. It is not necessary to specify an output file - if you omit it, VSCA will write output to the terminal. The output file will look like this if you used the above examples:
leitor doutor fogo jogo distrito cidade adotar obra segundo # Some Latin words # Original source: http://zompist.com/sounds.htm
Use the -o command line switch to specify your ruleset file, and the -c switch if you wish to send output to both the file and the terminal.
First download and install VSCA (and, if not yet installed, Perl; make sure Perl is in your system PATH).
I recommend downloading the source code if you don't know how to uncompress a compressed file. Otherwise, pick either the .zip or .tgz file - they include this documentation so that you can have a local copy.
After all that is done, let's see this VSCA in action. Open Notepad or Vim or Emacs or whatever text editor you prefer, create the following files, and save them in the directory where you put VSCA
Now open a terminal window, a DOS box, or whatever your system calls these things, and point it to the place where you got VSCA installed.
For example, on Windows you could type this:
d: cd d:\download\vsca
On Linux/UNIX it probably looks like:
cd ~/vsva
On Mac it would be something among the lines of:
cd Desktop/vsca
Now type this:
perl vsca.pl -l latin.txt -r latin2spanish.txt -o spanish.txt
Now open spanish.txt in your favourite text editor and voila, you just applied your first set of sound changes to some Latin words, in order to get their Spanish cognates!
Now that you know how this generally works, it is time to move on to actual ruleset crafting!
Let's start out simple. A rule always consists of three parts, seperated by slashes. An example would be a/e/_i. This rule means: change "a" into "e" if it occurs before "i". Another rule: a/e/#_. This changes an "a" into an "e" at the beginning of the word. The "_" says: "symbol goes here". A "#" is a word delimiter. It could go at the beginning or end of the word.
These three parts are called fields, and each field has its own, sensical name. The first field is the Original, because here you write characters as they appear in the original word. The second field has the name New, because here you write what the character will be in the new word if the rule applies. And finally the last field is the Position, because here you tell VSCA in what position (for example before an "i" or at the beginning of the word), an Original should occur for the rule to apply.
These three names will come back throughout the entire documentation, so make sure you know them. Original/New/Position.
When you first heard or read about the VSCA, you probably learned that one of VSCA's strengths was the way it deals with polygraphs. You might wonder what it means, though, so let me explain.
First, you'll need to understand what monographs are, and what they mean to VSCA. Monographs and polygraphs are both called "symbols" in VSCA jargon. Create a lexicon file and a ruleset file like this:
Lexicon file:
ab ad ag
Ruleset file:
bdg/ptk/_#
Now run VSCA on these files and look at the output.
You can see that the simple rule in that ruleset changes the three voiced stops into their voiceless counterparts at the end of the word. The rule really reads "change every b, d, or g into a p, t, or k, at the end of a word".
Those six letters are monographs, and when VSCA tries to apply your rule to a word, it has a look at each symbol in the Original. If the Original appears in the right position (as defined in the Position field), VSCA replaces the symbol with the accompanying symbol in the New field.
Now let's assume that your original language has a digraph, like "dh", and you want to change that into "th" at the end of a word. You'll understand that a rule like dh/th/_# will not do what you expect: this rule means "change every d or h into a t or h at the end of a word".
To tell VSCA that the "dh" and the "th" are polygraphs (digraphs, to be precise, but VSCA doesn't care what kind of n-graph you want to use), you surround them in square brackets:
[dh]/[th]/_#
You can mix polygraphs and monographs in a rule:
bdg[dh]/ptk[th]/_#
You can also replace a polygraph by a monograph. We already did this in our Latin-to-Spanish example. The reverse is also possible.
[ii]/i/_ i/[ii]/_
An important thing to note is that the numbers of symbols in the Original and New fields must be the same. There is one exception, and we have already seen it: you can leave the New field empty if you want sounds to disappear in the daughter language.
In this version you can't suddenly have symbols appear between two other symbols. The rule /h/a_a to insert an "h" between two "a"s does not (yet) work. You could work-around this with a rule like [aa]/[aha]/_
In Position fields you don't need to tell VSCA it's dealing with polygraphs. For example, the rule a/e/_i# means "change an a into an e if it appear before a word-final i".
No proper SCA could go without variables. A variable is simply one symbol that represents a whole number of symbols. For example, if your language has quite a number of fricatives, it would get tedious to have to write out all those fricatives in each rule that does something with them. It would be much more convenient if you could just use one symbol that stands for all fricatives at once.
Well, you're lucky, because variables do just that.
Let's discuss assignment syntax before we get into more detail. How much I like the phrase "without further ado," we do have some more ado here. Sorry about that, I know you would rather go and type up all those cool soundchange rules right now.
An assignment consists of three parts: the variable name, the assign operator (=), and the symbols you wish to assign to that variable.
A variable name can consists of at least one capital letter, optionally followed by more capital letters and numbers.
For comleteness sake, allowed characters in a variable name are A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9
The line below assigns a number of plosives to two variables:
P=ptk B=bdg
Assignments can include variables, too! Imagine you want a variable to replace all stops, but you also want a variable for the voiced variants and one for the unvoiced variants.
P=ptk B=bdg S=PB
Variables can include polygraphs too.
F=sf[sh][th]
So far, variables were singlechar variables, that is, their name consisted of only one character.
That doesn't need to be the case per se. If you want to use more descriptive names for voiced stops, voiceless stops, and just stops, you can do that. More descriptive usually means more characters, hence their name multichar variables.
# VLS stands for voiceless. By the way, this is a comment line. VLS=ptk # VS stands for voiced stop. VS=bdg # S stands for Santa. Err, I mean, stop. S=<VLS><VS>
You see those angle brackets, right? < and >? Well, those are for variable names what [ and ] are for polygraphs. I think an example here explains more than a thousand words:
# V stands for vowel V=aeiou # L stands for liquid L=ljw # PVL stands for Plosive VoiceLess PVL=ptk # PV stands for Plosive Voiced PV=bdg # Now we want a variable for all plosives. Let's see. # P stands for plosive P=PVLPV # Ouch. Wrong. # This assigns the values of P, V, L, P (again) and V (again) # to P. P=<PVL><PV> # Ah, right: assigns the voiceless and voiced plosives to P
Phew. Now we've had all that, it's time to include variables in our rules! Remember the sound change we pulled of earlier where we replaced the voiced stops with their voiceless variants?
Without variables, this looked like
bdg/ptk/_#
Now we assign those stops to variables.
# Good. First the voiced stops B=bdg # And the voiceless ones. P stands for B's voiceless cousin P=ptk # Then now comes the rule: B/P/_#
That wasn't too hard to grasp, was it? Now let's add some fricatives to the thing. To be precise, the voiced fricatives v, z, and dh, and zh
# Let's do stops first. B=bdg P=ptk # And fricatives. Same naming convention. V=vz[dh][zh] F=fs[th][sh] # And the rule again: VB/FP/_#
It's time to tell a bit more about multichar variables again, namely how to include them in a rule. Well, you just do it the same way you put them in assignments too. So if there is a variable PVL for voiceless plosives, and a variable PV for voiced plosives, and a variabe V for vowels, and you want to voice plosives intervocally, here is your rule:
<PVL>/<PV>/V_V
And near the end of this chapter we also learn something else about variables: they can appear in positions too.
There is another place where variables can occur, and I already mentioned that in the previous chapter: inside polygraph symbols.
They take an interesting role there, but I think an example still explains best. So. In this imaginary language we have some aspirated voiceless stops, namely ph, th, and kh. However, over time they change into the bilabial, coronal, and velar voiceless fricatives. In the meanwhile there are also the unaspirated voiceless stops p, t, and k, and they get voiced intervocally. Here's your ruleset:
PVL=ptk PV=bdg FVL=fsx V=aeiou # I proudly present.... *drumrolls*... the rules TADAAAA [<PVL>h]/<FVL>/_ <PVL>/<PV>/V_V
I will ask something from your imagination again. This time imagine a language that has a bunch of geminate stops (pp, kk, tt, and so on) but they change into normal, single stops.
Your first impulse might be to write a rule like this:
P=ptk P//_P
Well, that looks interesting, but note that this will also remove the first stop in a cluster like "kt". That's not what we need.
Fine, how about this?
P=ptk
PP/P/_
Sorry. Numbers of symbols in New and Original fields are not equal.
Ok, ok, you got that, sorry for underestimating you, that was bad. Of course you knew to write it as this:
P=ptk
[PP]/P/_
But again I will have to disappoint you. This creates a polyglot that will just match against two subsequent stops, so the first stop in clusters like "pt" annd "kp" will be removed here too.
Running out of clues yet? Did you got to the point already where you say, "fine, then I'll write a seperate rule for each stop"?
[pp]/p/_ [tt]/t/_ [kk]/k/_
Congratulations, you found something that works! But what if there are, say, 36 different consonants (including a number of allophones, that is) and you want to remove any geminate cluster? You really wouldn't want to write 36 rules that are all exactly the same, generally? Well... you don't have to.
I have teased you enough. Here's how you'd solve the geminate stops:
P=ptk [P+]/P/_
So easy it aint fun anymore, right? So let me quickly explain what that "+" sign does.
Pluses are parsed when variables are interpolated. Variable interpolation is the process of replacing variable names with their values. While VSCA is doing this, it sees if the variable name is followed by one or more pluses. If it is, then VSCA knows to put in each symbol of the variable's value more than once (namely, 1+n, where n is the number of pluses).
Pluses work in Position fields too.
P=ptk # Short vowels SV=aeiou # Long vowels -- yes, this way of writing that down actually works LV=[<SV>:] <LV>/<SV>/_P+ # Change a long vowel into a short vowel before a geminate stop
There are two restriction in this thing: pluses only work on variables, not on raw symbols. And because the nature of the plus (it is part of a polygraph, always), it only works within Position (and therefore Exception) fields, and within [polygraph blocks].
Ahhh... here we come to another true power of VSCA. Exceptions.
At the time you reach this chapter, you should understand how Position fields work, because I kinda covered that at the top of the document. However, just to freshen things up a bit (because I know it has been a long read), I'll demonstrate again.
Remove any vowel between a b and an r:
V//b_r
Remove all vowels in a word if the word ends on a d (don't ask how realistic it is, it's just an example):
V//d#
Replace any r with an l if the word has another l somewhere:
r/l/l
Replace any l with an r if the word starts on an r:
l/r/#r
Replace any l with an r if it appears right before an r:
l/r/_r
Replace any long o with a short one before nk:
[o:]/o/_nk
Remove any h before a consonant cluster:
h//_CC
Remove any h before a consonant cluster at the end of a word:
h//_CC#
Ok, that should be enough. This is starting to look like a chapter on Positions, though I promised to talk about Exceptions now. The reason that I threw so many examples there, however, is that Exceptions just work like Positions, except they are their negative counterparts.
For example, if you want to change a long o with a short one before a consonant cluster, but not before nk, your rule will look like this:
[o:]/o/_CC UNLESS _nk
You can combine multiple exceptions in one rule. Same example, except that you don't want to change the long o before a geminate cluster either:
[o:]/o/_CC UNLESS _nk OR _C+
And even this crazy thing works:
a/e/_# UNLESS i_ OR s_ OR mb_ OR #t
Let me break that down, just to be clear. This rule says: "replace a word-final a with an e, unless it is preceded by an i or an s or by the cluster mb or unless the word starts with a t".
All that, in just one simple rule.
Important: in the current version, the words UNLESS and OR must be written in capital letters. As a Perl programmer myself, I accidentally wrote "OR" as "or" a couple of times, but that just doesn't work. Whenever I get myself to tidying up the monster a bit, this will probably get fixed, though.
As of version 0.4, you can specify optional symbols in Position fields (and therefore in Exception fields, and technically also in polygraphs in variable values, Original fields, and New fields, although I don't know what the purpose of that would be since you have alternatives anyway).
For example, the following ruleset nasalises vowels before an "n" or an "mn":
V=aeiou NV=[<V>~] V/<NV>/_(m)n
Note that pluses inside parens don't work. I see no reason why you would need it, but for completeness sake I just want to cover this fact.
However, pluses inside polygraphs inside parens work fine, so this ruleset does what you'd expect it to do.
V=aeiou NV=[<V>~] N=mn V/<NV>/_([N+])n
That immediately answers another question: you can put variables inside parens, yes.
V=aeiou NV=[<V>~] N=mn V/<NV>/_(N)n
Also note that this feature is pretty experimental. I've tested it a bit but I can't guarantee if it works entirely properly. Anyway, feel free to play around with it and tell me if something's wrong. I did not receive any complaints for the implementation in 0.4, and with 0.5 it's still the same implementation, but I still suspect to see misbehaviour in this part of the VSCA in the near future. I can't really understand why it would work perfect :)
VSCA - Anything. Anywhere. Anytime