Project Methodology of the Data Input of Tibetan Medical Texts at ITTM

10 steps to produce a final electronic work

  1. The work is typed twice (two different operators) in ACIP transliteration.
  2. The text files are compared electronically and mistakes are corrected.
  3. The text is cleaned up electronically (merging all files, cleaning, normalizing lines, etc.). We developed scripts which replace any common errors found in the text with the correct transliteration and/or add a comment in some cases to ease the next steps of correction, particularly the proofreading.
  4. A tool (developed in Java) checks the page numbers, the number of lines of each page, and characters not allowed in the ACIP transliteration scheme. It generates a report with the page, line number, and a short description of the problem.
  5. Jskad (developed in Java by THDL) converts ACIP transliteration to a Unicode text file. The software generates errors and gives warnings in the generated file (on 4 levels: none, some, more, all). All errors are corrected in the ACIP text file. The warnings level could be useful to detect more errors before the stage of proofreading, but this requires a lot of work. Note: The complete proofreading of the gso rig sman gyi khog 'bugs showed that the quality of the ACIP⇒Unicode conversion is very good.
  6. A PDF version in the Tibetan script is produced from the Unicode text file with NisusWriter Express running on Mac OS X using font TibetanNew by XenoType. (This Unicode text file can also be opened in MS Word, see below).
  7. A hard copy version of the text is printed, then proofread by a qualified Tibetan doctor. All errors and printing issues are checked. (The proofreader writes the corrections in Tibetan U-can and/or ACIP transliteration to reduce the time spent in the next step.)
  8. Mistakes detected during proofreading are corrected in the ACIP electronic text version, and checked with the original text in the case of doubts.
  9. The ACIP version is converted into Wylie with Jskad. (This step requires two internal conversions: ACIP to TibetanMachineWeb, then conversion from TMW to Wylie. The final text file contains the Wylie version with errors for all missing stacks in TMW fonts. This means that the remaining transliteration has to be done manually. During this step, there were about 30 issues in the gso rig sman gyi khog 'bugs. We added the Wylie transliteration in the form of comments in the ACIP version. The generated file has to be corrected. Another verification has shown that the ACIP⇒TMW converter works well.
  10. Preparing of the final version, PDF and text files, with the table of content.

Problems to solve

  1. We have to be aware that the most original versions are the ACIP and Unicode versions. The Wylie version may show mistakes that do not show in the ACIP or Unicode/PDF version. (We have not yet checked the Wylie results in-depth.)
  2. We used Jskad to convert the ACIP text file into TMW and then into Wylie. While correcting gso rig sman gyi khog 'bugs, the first step of the conversion (ACIP to TMW) generated 28 errors, mostly unsupported Sanskrit glyphs. Then we used the second conversion (TMW to Extended Wylie) and corrected those 28 mistakes manually. That's the current version we have. When a new version of the converter will be available, it might be easier to convert to EWTS.
  3. The conversion of ACIP to Sambhota (Esama font) with AWconv version 1.1 generates mistakes due to unsupported stacks, mainly Sanskrit words. (We keep looking for new software to solve this problem).

Future Options

  1. The PDF version in Tibetan U-can should fit Tibetan text formatting rules. This could be done with LaTex, specifically with XeTeX which supports Unicode and OpenType or AAT fonts features.
  2. To prepare an archive with the Unicode version and font to display the text on MS Windows. (It requires MS Office 2003 under Windows XP or 2000, more on >>).

Links