Category Archives: Humanities computing

“Rome Wasn’t Digitized in a Day”: Building a Cyberinfrastructure for Digital Classicists

“Rome Wasn’t Digitized in a Day”: Building a Cyberinfrastructure for Digital Classicists

September 10th, 2011 by Simon Mahony

A web only publication by Alison Babeu with good coverage of the Stoa and the Digital Classicist. Published under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

 

The author provides a summative and recent overview of the use of digital technologies in classical studies, focusing on classical Greece, Rome, and the ancient Middle and Near East, and generally on the period up to about 600 AD. The report explores what projects exist and how they are used, examines the infrastructure that currently exists to support digital classics as a discipline, and investigates larger humanities cyberinfrastructure projects and existing tools or services that might be repurposed for the digital classics.

(Council on Library and Information Resources)

 

via The Stoa Consortium » Blog Archive » “Rome Wasn’t Digitized in a Day”: Building a Cyberinfrastructure for Digital Classicists.

I am quite happy that my One Era’s Nonsense, Another’s Norm article is quoted in this report. If nothing else, interdisciplinary work is getting noticed.

 

The TLG and copyright

TLG Pop-up

Originally uploaded by notiX.

This window comes up when you search for a word or browse a text on the Thesaurus Linguae Graecae website. I haven’t visited the site for a while, so I can’t really tell when this was introduced. I am just wondering if it is technically possible to disallow copying of the text while browsing the TLG. Colleagues have reported that the TLG suspended their access to the site due to suspicious browsing behavior but I have no picture about how the mechanisms work or if they really exist in the first place.

My issue is this: Maybe I am wrong, but the introduction of this pop-up window shows that users do try to copy the texts digitised by the TLG for their own use and that the TLG-Project is trying to secure its rights to the electronic texts it makes available. It would be interesting to know why users try to do this. If it is because they want to use other digital tools that the TLG doesn’t offer, why not let them do so? If it’s a cost-related issue, why not introduce a download fee or something similar? Or do a user survey and try to build the tools users really want. Why not allow users to pass the text to other concordancers available on the net, like the Voyeur tools? There must be a way to combine the sustainability of the TLG-project with the actual needs of the user community… What do you think?

Unicode Code Converter

http://rishida.net/tools/conversion/

You can paste in text in Unicode, the tool will convert to several different codes that you might need for development.

This is the above text in decimal code points:
89 111 117 32 99 97 110 32 112 97 115 116 101 32 105 110 32 116 101 120 116 32 105 110 32 85 110 105 99 111 100 101 44 32 116 104 101 32 116 111 111 108 32 119 105 108 108 32 99 111 110 118 101 114 116 32 116 111 32 115 101 118 101 114 97 108 32 100 105 102 102 101 114 101 110 116 32 99 111 100 101 115 32 116 104 97 116 32 121 111 117 32 109 105 103 104 116 32 110 101 101 100 32 102 111 114 32 100 101 118 101 108 111 112 109 101 110 116 46

The tool might be extremely useful if you get text encoded in Unicode and you want to know how it is encoded. If you are dealing with extended Greek, it might be helpful for finding cases of accented letters encoded in two different sets: Greek & Greek extended. For instance:

μ U+03BC: GREEK SMALL LETTER MU (Greek and Coptic)
α U+03B1: GREEK SMALL LETTER ALPHA (Greek and Coptic)
λ U+03BB: GREEK SMALL LETTER LAMDA (Greek and Coptic)
α U+03B1: GREEK SMALL LETTER ALPHA (Greek and Coptic)
κ U+03BA: GREEK SMALL LETTER KAPPA (Greek and Coptic)
ί U+03AF: GREEK SMALL LETTER IOTA WITH TONOS (Greek and Coptic)
ε U+03B5: GREEK SMALL LETTER EPSILON (Greek and Coptic)
ς U+03C2: GREEK SMALL LETTER FINAL SIGMA (Greek and Coptic)

versus

μ U+03BC: GREEK SMALL LETTER MU (Greek and Coptic)
U+0020: SPACE (Basic Latin)
α U+03B1: GREEK SMALL LETTER ALPHA (Greek and Coptic)
U+0020: SPACE (Basic Latin)
λ U+03BB: GREEK SMALL LETTER LAMDA (Greek and Coptic)
U+0020: SPACE (Basic Latin)
α U+03B1: GREEK SMALL LETTER ALPHA (Greek and Coptic)
U+0020: SPACE (Basic Latin)
κ U+03BA: GREEK SMALL LETTER KAPPA (Greek and Coptic)
U+0020: SPACE (Basic Latin)
ί U+1F77: GREEK SMALL LETTER IOTA WITH OXIA (Greek Extended)
U+0020: SPACE (Basic Latin)
ε U+03B5: GREEK SMALL LETTER EPSILON (Greek and Coptic)
U+0020: SPACE (Basic Latin)
ς U+03C2: GREEK SMALL LETTER FINAL SIGMA (Greek and Coptic)

Τhe difference is in the GREEK SMALL LETTER IOTA which has two accented variations in Unicode, WITH TONOS (which is what is being inputted by most operating systems over the keyboard when Greek is selected) and WITH OXIA, which is found in some electronic texts in the Web. Most fonts display this with the same glyph, so confusion is the name of the day for those who are not aware of the problem. You can read more in Nick Nicholas’s pages about Unicode. Limiting yourself to using the fonts Nick is suggesting is rather impractical, and practically the WITH TONOS characters are the ones generally in use.

Herbert Weir Smyth: A Greek Grammar for Colleges

A Greek Grammar for Colleges

I found this site very useful (I have to check Smyth’s grammar for reference purposes and this site is better than Smyth@Perseus). It’s the same XML-text, I suppose, but presented differently.

Update 16/09/2011: The link is now password-protected. A PDF-Version of the actual book can be found here.

Blogged with the Flock Browser

Tools for converting Beta code to Unicode

Betacode description:

http://www.tlg.uci.edu/BetaCode.html

 


 

Online tools:

1. Sean Redmond’s Greek Font to Unicode converter: http://www.jiffycomp.com/smr/unicode/

CGI based conversion tool, supports cut&paste.

2. Cental (Centre du traitement automatique du langage) Beta Code to Unicode Converter: http://130.104.253.20/beta2uni/

Lets you upload and convert whole files from the TLG CD ROM to Unicode.

3. Michael Neuhold’s greekconverter: http://members.aon.at/neuhold/antike/grkconv.html (inactive?)

Java-Applets and downloadable Java-Classes for converting between beata code and other encodings.

Applications, JAVA-Classes usw.

1. Epidoc collaborative: Transcoder: http://sourceforge.net/projects/epidoc

Java based converter for plain text files.

2. Lucius Hartmann’s BetaCodeConverter bzw. GreekKeysConverter (Mac OS): http://www.lucius-hartmann.ch/programme/

MacOs applicaton, converts RTF and TXT files from and to to many encodings.

3. Antioch classical languages utility von Ralph Hancock: http://www.users.dircon.co.uk/~hancock/antioch.htm

VBA based conversion utility.

4. Burkhard Meißner’s View and Find: http://www2.hsu-hh.de/hisalt/projects/viewfind.htm

View & Find is a MS-DOS program to interact with, decode, extract, search and automatically index the beta code files on the Thesaurus Linguae Graecae E and Packard Humanities Institute #5.3 and #7 CD ROMs. (thanks to B. Meißner for the info)

5. betautf8 – a fast, flexible beta code to unicode (utf8) file converter: http://www2.hsu-hh.de/hisalt/projects/betautf8.htm (thanks to B. Meißner for the info)

TLG and PHI search engines supporting Unicode

 

1. Diogenes: http://www.dur.ac.uk/p.j.heslin/Software/Diogenes/index.php

Perl based, cross platform search engine for the PHI and TLG CD Roms.

2. Workplace Pack vom SilverMountain Software: http://www.silvermountainsoftware.com/workpack.html

Unicode aware search engine program for the TLG CD ROM.

Concordancers and alternatives for MacOs X

In my present job I am heavily involved with language description: reading through loads of texts, identifying interesting linguistic features, storing them in a custom-build database. That’s good for some phenomena that you can not easily identify with other means; sometimes you just have to use the computer and scan a large amount of texts for an ending or some other easily identifiable pattern. That’s where you need a concordance problem, and that’s where I have a problem with MacOs X.

There simply isn’t a decent concordancing program that runs natively in MacOs X; if there is one and I ‘ve missed it, please let me know! Yes, there are some decent concordancing programs for Windows and yes, I could use them with Parallels or dual booting – it’s just that I a) I am not prepared to purchase a Windows license just for running a concordance program and b) it won’t integrate so good with my current workflow. I still haven’t experimented using any WINE derivates or CrossOver with all concordancing programs (a first try with AntConc for Windows didn’t really work) .

In the times of Mac Os 9, life was much easier. Conc (from SIL) was brilliant (even my wife enjoyed using it); it still runs in some Macs that support Classic but it won’t support Unicode, so this is not an option (and my main Mac in the office runs Mac Os 10.5).

Ideally the perfect concordancing program would

  • fully support Unicode
  • operate on multiple files
  • be aware of a referencing scheme (so that you know where in your texts the string you have identified occurs)

Conc could two the last two – why doesn’t someone at SIL rewritte it for Intel?

So what now? What are the options for someone like me who desperately needs to search a large (ca. 2 mio words) corpus of Medieval Greek texts (I do this kind of job to cover the necessities of life – I am enjoying it sometimes but not always…). Here a list of programs I am currently using:

Laurence Anthony’s AntConc

AntConc is written in Perl and runs in a Mac under Apple’s X11. Installation is straightforward, performance not brilliant but marginally acceptable. AntConc is brilliant as a concordancer (click here for a review) and covers all my needs (regular expressions, word lists, normal and reverse sorting).

A couple of screenshots:












There is one serious flaw though, which relates to X11: improper support of MacOs X keyboard layouts. I can’t input accented Greek characters in AntConc’s search box; there are workarounds (like using the list of words to search in advanced mode) but nothing very straightforward for a not-so-much-organized person in a hurry like myself (who relates heavily on using regular expressions because of the diversity of spelling conventions in his corpus). If I can’t integrate a tool in my workflow, I tend not to use it… A further weak point of AntConc is exporting found datasets; you get flat text files with no structure at all.

Having said that, AntConc is a great concordancing tool and it’s free. It’s great on the PC but on the Mac it is quite time consuming to use.

Jedit: a programmer’s text editor

Well, this is not a concordancer but a text editor written in Java. It’s free and highly configurable and of course supports Unicode and regular expressions. What makes it interesting for my needs is what it’s called a “Hyper Search”: you perform a search and you get the results presented in a separate window – one line of text a time. You click on the result and you are being transferred to the actual passage in the file. Perfect for my needs, almost like a concordance program. It also works with multiple files and I can input my Greek with as many accents as I like in the search box. It operates on text files, so HTML and XML files are covered. Jedit was long time my favorite, until I ‘ve found something that does exactly the same but with added features. Interestingly enough, it’s almost identically called: Jedit X.

Jedit X from Artman 21

Jedit X has the same functionalities with Jedit X but with two major improvements: it’s a native Cocoa application, optimized for Leopard and, it supports a huge range of formats, including RTF & RTFD (RTF with pictures), MS Word, Open Office and others. This means that you can use Jedit X to perform regular expression based searched in a bunch of Word documents – your search results are displayed in a separate window, you click on a result and the file opens in JeditX with the search result highlighted. Not bad at all for a shareware program that will cost you 29$! It integrates perfectly with Leopard, can display HTML files as Rich text documents and performs multi-file searches with a very well designed interface. In short, the perfect tool for a not so techie person, that works with Word files and wants to search information in them. Perfect for my needs (I am a techie person but love the easy solution).

Conclusion:
Only AntConc is a full flavored concordance program, Jedit and Jedit X are just alternatives. If your needs are covered with an application that performs regular expression searches across multiple files and your source files are something else then flat text files (or, like me, you don’t always bother converting everything to XML or are not prepared to spend halve your life explaining to your colleagues why they should convert everything to XML), Jedit X is the perfect solution. Until someone rewrites Conc for Intel based Macs, that is.

Update:

Casual conc by Yasu Imao

Casual conc is a native, Ruby + Ruby Cocoa based, Unicode compliant, concordancer for MacOs 10.5. Here are a couple of screenshots:











My first impressions are only positive: it handles Greek very well, sorting is fine, all major functionalities are available. It would be even better if it could handle XML files directly but I am sure that the application will only improve in the future.

Blogged with the Flock Browser

Tags: , ,