Friday, October 10, 2008

Oh, Canada!

I'm interested in Canada and Canadians. It's a little difficult to explain why exactly. I can't remember when I got first interested in them, or why.

I can tell you that I like their dry sense of self-irony. I like their nature. I like their literature. I like their music. But Canadians themselves find it difficult to define themselves. Usually it's by defining what they are not - at least not Americans. So maybe it isn't so surprising that I can't put my finger on why they fascinate me.

I can truly appreciate the fact that their government and everything about their country is so well documented. They have all these wonderful sources online, available free for anyone to use! There are archival materials on, pretty much anything you might be interested in.

The annoying thing is, they're usually in pdf files, as you might expect. A colleague of mine (I don't want to describe people I know in too much detail so I'll use such a grand term) complained in his dissertation that there is a worrying tendency of uploading texts online in pdf files, rather than transcribing them at all.

It's something only a corpus linguist would complain about, but I completely sympathize now. For anyone else but a corpus linguist, pdfs are usually good enough. But they're simply not viable for including in language corpora!

I got hold of this splendid corpus of present-day Canadian English. But since I'm going to have a diachronic dimension in my dissertation, it would make sense to have historical Canadian English as well. Sadly, there is only one such historical corpus in existence so far, and even that isn't available to anyone else but its creators.

Since the aforementioned colleague collected a corpus of hiw own for his dissertation project, and since I helped categorize and update it, I'm not too shy about the idea of compiling a corpus of my own.

But finding old Canadian English online in a reliable format is such a hopeless task. If you find something that has been transcribed, there's always the concern about whether it has been modernized or not. If a text is only available as a pdf file, it's pretty much of no use, unless I transcribe it myself.

I'm not really afraid of doing a lot work for my project. I wouldn't mind transcribing texts in principle. The problem is that I know that it would probably postpone the gathering of my actual research data too far in the future.

If I were to do so much work, I would have to take into account so many issues related to corpus compilation. There are numerous different views on how one should compile a corpus.

Some are willing to overlook any bias in the selection of the texts, in their length and text type, register, time of publication, anything that might affect the language of a text in relation to any other texts.

Others believe that especially a diachronic corpus should be carefully constructed so that the researcher doesn't have to worry about distortions in their data sets. Personally I think that such corpora may lull the researcher too much into believing that whatever the corpus throws up, it must be the final truth.

On the other hand, it isn't entirely straightforward to take into account everything by yourself, especially if you want to create quantitative illustrations of your data. It's particularly annoying when you're using many different corpora that were compiled according to completely different parameters, yet you'd like to compare them.

So I suppose there isn't much I can do except try my luck with getting my hands on that already existing historical corpus. Always worth a try.

No comments: