अनिल एकलव्य ⇔ Anil Eklavya

June 4, 2010

Shooting Oneself in the Foot

A few years ago I had received some feedback from someone about a research paper that I was going to submit to a major conference. Paraphrasing the feedback (repeating the exact words, even with the reference, will be copying: won’t it?), I was told that there was something that I had put in the paper, which, if I insisted on retaining, might make the reviewer look at my paper in a negative light. So, if I didn’t remove that part, I would be shooting myself in the foot.

This is beside the point, but I thought what I had added was correct and so I retained it. The paper was rejected, but I would like to believe that the reason for rejection was not that I had shot myself in the foot.

Getting back to the point, this is an expression that I have come across innumerable times, mostly directed at others, but sometimes directed at me. As a person who claims to be a writer, translator as well as a researcher in a language related discipline (among other things), I can’t help obsessing about how such expressions are used and what they mean, what they show and what they hide.

But I am not interested in writing an academic paper about that. So I write something here. And you are not supposed to review this piece when I submit the next Computational Linguistics paper which might come to you for review. (See the comment functionality below?).

Recently, Chomsky used this expression in a speech, saying ‘those who are being harmed are shooting themselves in the foot’. Now, most of the time that I have come across this expression, I have thought it was being used cynically to show something which wasn’t there and to hide something that was there. Or for some other questionable purposes. However, the people using this expression were mostly respectable well meaning people. Most probably they hadn’t thought about this expression in the way that I had done. May be because if they were to do it, they would be shooting themselves in the foot.

But when Chomsky uses this expression, I can’t but believe that he is using it to mean something sensible, not cynical (if this last part looks strange to you, look up the meanings and histories of these two words, especially the second one).

I do believe that what Chomsky said was basically correct. That is, there are some people who are being harmed and they are indeed shooting themselves in the foot (I am not sure whether I am one of them or not).

The reason I am writing this is that I also believe (based on evidence, not on faith) that such people are (relatively) so few that ridiculing them or offering them advice is hardly going to matter. I must add here that Chomsky did actually caution against ridiculing such people (who have realized that they are being systematically harmed). He only expressed his disappointment that instead of doing something to stop this systematic harming, they are shooting themselves in the foot.

You see, there are also people who are being harmed and are shooting themselves in the head (or ‘consuming pesticide’). You might say that they belong to the same category because the expression is metaphorically wide enough to cover them. That might be true. But then there are also a far larger number of people who are being harmed and they are doing something very different.

They are not shooting themselves in the foot (or in the head). They are shooting others (who are also being harmed) in the foot*. Often they are also shooting others (who are also being harmed) in the head. Sometimes they are doing it for a few extra peanuts, sometimes just for the fun of it and sometimes because they have been led to believe that these targets are their enemies (or the enemies of the nation, or the enemies of the society, or of the religion, or of the community etc.). And since doing it openly is a bit problematic (not cool anymore, baby!), they often have to make it appear as if their target shot himself in the foot (or in the head), whether deliberately or accidentally.

* Perhaps they are programmed in Concurrent Euclid.

So, my take on the matter is that we should be talking about people who are being harmed and who are (literally or metaphorically) shooting others who are also being harmed, whether in the foot or in the head. Because without them, the whole shooting machinery probably won’t be able to operate. In fact, to visualize a grisly scenario, if all such people stopped shooting others (who are being harmed) and started only to shoot themselves in the foot, even then the shooting machinery will probably become dysfunctional. Fortunately, most of the people will not be interested in shooting themselves in the foot (or in the head) if they are just able to find any feasible alternative. Unfortunately, no one from above can tell a person what such an alternative means in practical terms in that person’s circumstances and it’s very hard to find it out for oneself. It’s very hard to even be sure that such an alternative exists. If it does, it’s very hard to translate it into any meaningful action. Compared to a a few decades earlier, it is infinitely harder now, given the extraordinary consolidation of the global power structure (going far beyond what Foucault had studied up to his time), to a great extent due to the techno-administrative ‘advances’ (mostly in the name of security).

There are, surely, people who are being harmed but are not shooting others (being harmed or not being harmed). I won’t say anything about them right now.

(To academic busybodies and surface-style junkies: don’t bother to count the number of times the said expression has been used in this short piece: it has been done very deliberately. Perhaps the author was trying to shoot …).

 

 

For having read the above, here is a bonus link: Fascism then. Fascism now?

Advertisements

April 4, 2010

Ptypho

I had then recently joined the center. As is quite fashionable (it wasn’t when I did my graduation at some other institution), the young members of the center decided to have T-shirts made with the center’s name. The student who took up the responsibility of preparing the design for the T-shirts was earlier associated with the center but had shifted to some other more respectable center.

The design was created, T-shirts were made and they were paid for and worn by almost all the members of the center. The text on them said ‘The Langauge Cookers’ or ‘The Lagnuage Cookers’ (more likely the latter), with the Language part in a very large size.

One day I was returning to the lab, along with a couple of other graduate students. An undergraduate student (most probably from a more respectable center) came from the opposite side and stopped. He stood in front of the one who was wearing that newly made T-shirt. He put his finger on the misspelled text on the T-shirt and said the following in a tone that is used to point out the incredible stupidity of someone:

– You know that this spelling is wrong?

He was from a center not dealing in mere language.

The T-shirt wearer couldn’t say anything because he hadn’t realized that there was a spelling error. I had noticed the error and had thought that the designer of the T-shirt had chosen a smart and humorous way to say something positive about the mission of the center and the discipline. I was too shocked to reply immediately, but I found the words in time:

– It’s deliberate.

Now it was his turn to be dumbstruck.

– It’s deliberate?

– Yeah, of course it’s deliberate.

I couldn’t resist being scornful. He was still dumbstruck.

– But why?

I didn’t have time to formulate a reply because he left soon after that.

I narrated the incident once or twice to others and they seemed to share my feelings.

Well, time passed (as they say), and I came to know that there were many others in the center who had not noticed the spelling error recreated in such a large size. Or they hadn’t thought about it.

Then I found out that the general consensus outside the center was that the designer of the T-shirt (along with others) had great fun at the expense of the whole center and that the typo was indeed deliberate (what else could it be?), but the designer had wanted to say something very different from what I had imagined.

He was a well liked member of the center and later moved to an Ivy League U.S. university. He remained a well liked (albeit former) member.

My head still hurts from thinking about it. But I can’t escape it because every day something reminds me of this, especially in academics.

Do I hear someone saying that there really are some typos in many posts on this blog?

February 9, 2010

The Fundoo Funda

‘Funda’ is a Hindi word (or, more accurately, an Indian word, as it is also used in other Indian languages, including English), which is a short form of the English word fundamental. The same is the case with the word ‘fundoo’, except that is it an adjective derived from ‘funda’ according to Hindi derivational morphology. The adjective has two senses. One of these is the sense familiar to a select group of people, the kind who are educated in colleges like New Delhi’s St. Stephen’s and have a circle made up almost exclusively of people from a similar background. For this group of people, the word ‘fundoo’ means fundamentalist. And nothing else.

Thus, for them, ‘fundoo’ (the noun version) basically means a person from the Sangh Parivar. And since they (not the Sangh Parivaris) are mainly ‘secular’, it is a term of derision. Just like the other n-term they have for the Sangh Parivaris.

I first became familiar with this word when I entered an engineering college for my bachelor’s degree. In that college, the word was heavily used. It meant someone whose fundamentals (as in Thermodynamics or Theory of Machines) were very strong, i.e., who was very good at something. It could also be used with some metaphorical extension to mean high praise (with regard to anything) for someone or something. It might sound strange to many, but at that time I somehow thought that this word (and the word ‘funda’) were slang words only used in that particular college.

Later I found out that these two words are among the most heavily used words as far as the young (school or college) generation is concerned.

Being called fundoo can be a big complement, though the overuse of the term means that the complement could be highly diluted.

I didn’t become familiar (till much later) with the other sense — fundamentalist — of the word till I read a particular number of one of the most popular columns in the Indian press, written by Khushwant Singh. I had no idea that the word was also used in this sense. But what was more surprising, almost astonishing to me, was the fact that Khushwant Singh similarly seemed to have no idea that there was another sense in which this word was used.

By the way, I wrote ‘one of the most popular columns in the Indian press’ instead of ‘in the National’ or ‘in the English’ press because this particular column is syndicated by many Indian language newspapers and they publish a translation.

As I then read all kinds of magazines and newspapers etc., I found out that there were others like Khushwant Singh for whom too the word only meant one thing: fundamentalist. What was common among all these people was that they were from the select group that I mentioned in the beginning.

I have spent various periods of time in many educational institutions of India and have lived in many cities and towns and have kept my eyes and ears open, especially to language related things. Nowhere except in the writings of this group of people have I found anyone using the word ‘fundoo’ in the sense that they use. And as I said ealier, it is one of the most heavily used words and therefore I keep hearing it much too regularly.

I am aware that there might, in fact, be some other people outside this group who use the word in that rare sense. And I am not sure about the origin of the word either. It could very well be that the word was initially used in the first sense. But I have heard no one using it in that sense. Not a single person.

To repeat once more to make the point clear, the second sense of the word is used so heavily that I find it hard to believe that if you live in an Indian city or even a small town (and know either English or an Indian language), you could remain oblivious to the second sense of the word. But you could easily be unaware of the first sense because it is used so rarely. The only way this can happen is if that group of people has somehow cut itself off from the life around it and is not much in touch with it.

This cut off has to be fairly radical, because according to many yardsticks, I myself am quite cut off.

But I know the second sense. As well as the first. I knew them long before I started studying Linguistics or related fields.

Or perhaps they are words from two different languages, the first spoken by the top caste and the second by the lesser mortals.

May 29, 2009

Milk as Karma

Someone called someone milk
Milk as noun or milk as verb?
Milk as the subject or milk as the object?
Milk as the karta or milk as the karma?

The answer appears as a vision
Of huge torrents of something
(It could very well be milk
Of, you know, something)
Flowing from one end
Of the Zipf’s Law curve
To the other end

May 22, 2009

How Many Grams?

There is an automatically (intelligently) generated blog which I have read recently.

It appears to be (let’s give ‘seems’ some rest) quite a popular one in a certain section.

I know the corpus on which it was trained.

And the corpus on which it was retrained.

(Including most of the quotes and the comments, especially the long ones).

But I wonder whether the order of n-grams was five or six.

It is definitely better than four grams.

It could even be Se7en.

This brings up a new idea.

What about writing a paper on automatically guessing the order of n-grams, given some generated text?

It may be difficult in the general case, but in our case we know the corpus on which it was trained.

Any takers?

April 16, 2009

Accepted, but not Published

Academicians or researchers list their publications prominently on their home pages. After all, it is supposed to represent the best of their work. They also quite often (especially those who have a large number of publications) categorize them according to some criteria like the venue (workshop, conference, journal or book: in the reverse order of prominence) or peer review (unrefereed and refereed).

In this post we propose that there should be a new category of publications. This category is needed because a lot of researchers (for good or for bad) now come from underprivileged countries. For most of these researchers, traveling abroad to attend a conference, even if their paper has been accepted, is something very hard to do. In some sense even more than getting a paper accepted, which is relatively harder too, given the lack of certain privileges — whether you like the word or not — generous research grants, infrastructure, language resources etc., combined with the prejudice (it is there: I am not inventing it, whoever might be blamed for it). To these problems can be added the problem of compulsory attendance at a conference or a workshop. It is partly these conditions which have prompted suggestions from certain quarters that researchers from these countries should concentrate on journal papers (never mind the delay and difficulties involved or the unfairness of the proposition, even though it has some practical justification).

But you can never be sure while submitting that you certainly won’t be able to attend. Also, hope is said to be a good thing. Therefore, the event of a researcher submitting a paper and hoping to attend but not being able to attend cannot be ruled out.

This bring us to the proposal mentioned earlier. One solution to this problem is that there should be another category of papers: accepted but not published, because the author couldn’t afford to attend the conference or the workshop. (By the way, workshops are the most happening places nowadays: more on that later).

The author of this post must know because he has authored more than one such publications.

Of course, the condition will be that if and when such a paper is resubmitted (with or without modifications, but without any substantial new work), accepted again and finally published, the entry marked as ‘accepted’ should be removed and replaced by an entry marked as ‘published’.

After all, if we are serious about research, then the work (which has been peer reviewed and accepted) should be given somewhat more importance than some pages printed in some proceedings (or attendance in a conference for that matter).

This, of course, doesn’t mean that you can get basically the same thing published (or accepted) in more than one places.

(Sorry for the Gory Details)

P.S.: May be there is no need for the above apology as the depiction of the Gory Details of the Indian Reality is now getting multiple Oscars (The Academy Awards: the keyword is Academy). But may be there is because some researchers have a more (metaphorically) delicate constitution which can be hurt by the Gory Details.

Queen’s P.S.: Off with his head!

January 12, 2009

Picture of the Future

Orwell described a picture of the future rather bleakly as:

There will be no curiosity, no enjoyment of the process of life. All competing pleasures will be destroyed. But always—do not forget this, Winston—always there will be the intoxication of power, constantly increasing and constantly growing subtler. Always, at every moment, there will be the thrill of victory, the sensation of trampling on an enemy who is helpless. If you want a picture of the future, imagine a boot stamping on a human face … forever. (1984 by George Orwell: Part III, Chapter III)

This, I believed, was a dystopian picture. I still do. I have my own picture of the future, which has remained almost unchanged for the last decade (at least). Three recent events somehow seem to me to be describing my picture of the future.

The picture is mine, but the future need not necessarily be mine.

But it can very well be.

The first is the unbelievably and blatantly criminal assault by Israel on all Palestinians: man, woman and child. I won’t give references for this. It’s there prominently even in the mainstream media and has been there for some time now.

The second is a recent call by the Andhra Pradesh Human Rights Commission chief (Chairman) for “legislation to prosecute parents with diseases such as tuberculosis, HIV, leprosy and dyslexia should they, knowing that they have the disease, have children”.

Inhuman Rights Commission?

The third is the news, or rather the lack of it, about the recent death of a Hindi writer living in Jaipur (yes, the connection with ‘your’ places does make it worse) Lavleen (लवलीन) who was relatively young. She had a reputation as a ‘bold’ writer and woman. She hadn’t really established herself as a great writer, but she was known among the Hindi literary circles. Let alone the Indian English media, (it has been pointed out) even the ‘biggest Hindi daily’ Dainik Bhaskar didn’t report it, even after many requests. And even the small but very vibrant and inter-connected world of Hindi blogging (which is very enthusiastic about events like the wedding of someone’s relative among them) mostly ignored it, though they are trying very hard to find out who ‘the real Tau’ (असली ताऊ) is. Like a lot of other writers, she died with the dream of some day writing a masterpiece.

(But still, I came to know about this from a Hindi writer’s blog).

And, no, I didn’t personally know her. Nor do I know the A. P. Human Rights Commission Chairman. Nor have I ever been to Israel, though a large percentage of the people (in History) I admire happen to be Jewish and most of them (I am sure) would have or have been horrified by what Israel is doing.

I don’t know why but these three events (or should I say sets of events: being a ‘professional’ practitioner of language sciences, crafts and arts is tough when it comes to writing anything) somehow represent for me the picture of the future.

This picture is not quite as horrible as that painted by Orwell (actually, by O’Brien the character, whether or not by the author).

But it doesn’t seem very pleasant.

October 28, 2008

सांगणिक भाषाविज्ञान

जैसा मैंने पिछली प्रविष्टी (‘पोस्ट’ के लिए यह शब्द इस्तेमाल हो सकता है?) में लिखा था, अगले कुछ हफ्तों में मैं संचय के बारे में लिखने जा रहा हूं।

लेकिन क्योंकि संचय खास तौर पर (आम उपयोक्ताओं के अलावा) सांगणिक भाषाविज्ञान या भाषाविज्ञान के शोधकर्ताओं के लिए बनाया गया है, इस बात को साफ कर देना ठीक रहेगा कि सांगणिक भाषाविज्ञान या भाषाविज्ञान के माने क्या है, या अगर आप इनके माने जानते ही हैं तब भी इनसे मेरा अभिप्राय क्या है। यह दूसरी बात इसलिए कि इन विषयों (सांगणिक भाषाविज्ञान या भाषाविज्ञान) के अर्थ के बारे में आम लोगों में तो तमाम तरह की ग़लतफ़हमियाँ हैं ही, पर इन विषयों के शोधकर्ताओं में भी इनकी परिभाषा पर एक राय नहीं है।

सच तो यह है कि हिंदी जगत में तो अब भी अधिकतर लोग भाषाविज्ञान का अर्थ उस तरह के अध्ययन से लगाते हैं जो पिछली सदी के शुरू में लगाया जाता था। लेकिन बहस की इस दिशा में अभी मैं नहीं जाना चाहूंगा क्योंकि इसके बारे में कहने को इतना अधिक है कि अभी जो उद्देश्य है वो पीछे ही रह जाएगा।

वैसे सांगणिक भाषाविज्ञान या भाषाविज्ञान की परिभाषा या उनकी सीमाओं के बारे में भी कहने को बहुत-बहुत कुछ है, पर फिलहाल थोड़े से ही काम चलाया जा सकता है।

तो छोटे में कहा जाए तो भाषाविज्ञान शोध या अध्ययन का वह विषय है जिसमें किसी एक भाषा के व्याकरण का ही अध्ययन नहीं किया जाता बल्कि नैसर्गिक या मानुषिक (यानी कृत्रिम नहीं) भाषा का वैज्ञानिक रूप से अध्ययन किया जाता है। अब यह धारणा व्यापक रूप से स्वीकृत है कि मानव मस्तिष्क की संरचना का भाषा की संरचना से सीधा संबंध है और क्योंकि सभी मानवों के मस्तिष्क की संरचना मूलतः एक ही जैसी है, तो सभी नैसर्गिक या मानुषिक भाषाओं में भी सतही लक्षणों को छोड़ कर बाकी सब एक ही जैसा है। इसीलिए, जैसा कि इन विषयों के आधुनिक साहित्य में प्रसिद्ध है, अगर किसी अमरीकी के शिशु को जन्म के तुरंत बाद कोई चीनी परिवार गोद ले ले और वह बच्चा चीन में ही पले तो वह उतनी आसानी से चीनी बोलना सीखेगा जितनी आसानी से कोई चीनी परिवार का बच्चा। ऐसी ढेर सारी और बातें हैं, पर मुख्य बात है कि भाषाविज्ञान नैसर्गिक या मानुषिक भाषा का वैज्ञानिक अध्ययन है।

कम से कम कोशिश तो यही है कि अध्ययन वैज्ञानिक रहे, पर वो वास्तव में रह पाता है या नहीं, यह बहस का विषय है।

अब सांगणिक भाषाविज्ञान पर आएं तो इस विषय में हमारा ध्यान मानवों की बजाय संगणक यानी कंप्यूटर पर आ जाता है, पर पिछली शर्त फिर भी लागू रहती है: नैसर्गिक या मानुषिक भाषा का वैज्ञानिक अध्ययन। अंतर यह है कि हमारा उद्देश्य अब यह हो जाता है कि कंप्यूटर को इस लायक बनाया जा सके कि वो नैसर्गिक या मानुषिक भाषा को समझ सके और उसका प्रयोग कर सके। जाहिर है यह अभी बहुत दूर की बात है और इसमें कोई आश्चर्य भी नहीं होना चाहिए क्योंकि अभी भाषाविज्ञान में ही (पिछली सदी की असाधारण उपलब्धियों के बाद भी) वैज्ञानिक ढेर सारी बाधाओं में फंसे हैं।

फिर भी, सांगणिक भाषाविज्ञान में काफ़ी कुछ संभव हो चुका है और काफ़ी कुछ आगे (निकट भविष्य में) संभव हो सकता है। लेकिन इसमें कंप्यूटर का मानव जैसे भाषा बोलना-समझना शामिल नहीं है। जो शामिल है वो हैं ऐसी तकनीक जो दस्तावेजों को ज़्यादा अच्छी तरह ढूंढ सकें, उनका सारांश बना सकें, कुछ हद तक उनका अनुवाद कर सकें आदि।

लेकिन हिंदुस्तानी परिप्रेक्ष्य में परेशानी यह है कि हम अभी इस हालत में भी नहीं पहुंचे हैं कि आसानी से कंप्यूटर का एक बेहतर टाइपराइटर की तरह ही उपयोग कर सकें। इस दिशा में कुछ उपलब्धियाँ हुई हैं, पर अंग्रेज़ी या प्रमुख यूरोपीय भाषाओं की तुलना में हम कहीं भी नहीं हैं। जैसा कि आपमें से अधिकतर जानते ही हैं, यह एक लंबी कहानी है जिसे अभी छोड़ देना ही ठीक है।

पर संचय का विकास इसी परिप्रेक्ष्य में किया गया है, जिसके बारे में आगे बात करेंगे।

October 26, 2008

संचय का परिचय

पिछली पोस्ट (शर्म के साथ कहना पड़ रहा है कि पोस्ट के लिए कोई उपयुक्त शब्द नहीं ढूंढ पा रहा हूं) में मैंने (अंग्रेज़ी में) संचय के नये संस्करण के बारे में लिखा था। मज़े की बात है कि संचय के बारे में मैंने अभी हिंदी में शायद ही कुछ लिखा हो। इस भूल को सुधारने की कोशिश में अब अगले कुछ हफ्तों में संचय के बारे में कुछ लिखने का सोचा है।

तो संचय कौन है? या संचय क्या है?

पहले सवाल का तो जवाब (अमरीकी शब्दावली में) यह है कि संचय एक सिंगल पेरेंट चाइल्ड है जिसे किसी वेलफेयर का लाभ तो नहीं मिल रहा पर जिस पर बहुत सी ज़िम्मेदारियाँ हैं।

दूसरे सवाल का जवाब यह है कि संचय सांगणिक भाषाविज्ञान (कंप्यूटेशनल लिंग्विस्टिक्स) या भाषाविज्ञान के क्षेत्र में काम कर रहे शोधकर्ताओं के लिए उपयोगी सांगणिक औजारों का एक मुक्त (मुफ्त भी कह सकते हैं) तथा ओपेन सोर्स संकलन है। पर खास तौर से यह कंप्यूटर पर भारतीय भाषाओं का उपयोग करने वाले किसी भी व्यक्ति के काम आ सकता है। इसकी एक विशेषता है कि इसमें नयी भाषाओं तथा एनकोडिंगों को आसानी से शामिल किया जा सकता है। लगभग सभी प्रमुख भारतीय भाषाएं इसमें पहले से ही शामिल हैं और संचय में उनके उपयोग के लिए ऑपरेटिंग सिस्टम पर आप निर्भर नहीं है, हालांकि अगर ऑपरेटिंग सिस्टम में ऐसी कोई भी भाषा शामिल है तो उस सुविधा का भी आप उपयोग संचय में कर सकते हैं। यही नहीं, संचय का एक ही संस्करण विंडोज़ तथा लिनक्स/यूनिक्स दोनों पर काम करता है, बशर्ते आपने जे. डी. के. (जावा डेवलपमेंट किट) इंस्टॉल कर रखा हो। यहाँ तक कि आपकी भाषा का फोंट भी ऑपरेटिंग सिस्टम में इंस्टॉल होना ज़रूरी नहीं है।

संचय का वर्तमान संस्करण 0.3.0 है। इस संस्करण में पिछले संस्करण से सबसे बड़ा अंतर यह है कि अब एक ही जगह से संचय के सभी औजार इस्तेमाल किए जा सकते हैं, अलग-अलग स्क्रिप्ट का नाम याद रखने की ज़रूरत नहीं है। कुल मिला कर बारह औजार (ऐप्लीकेशंस) शामिल किए गए हैं, जो हैं:

  1. संचय पाठ संपादक (टैक्सट एडिटर)
  2. सारणी संपादक (टेबल एडिटर)
  3. खोज-बदल-निकाल औजार (फाइंड रिप्लेस ऐक्सट्रैक्ट टूल)
  4. शब्द सूची निर्माण औजार (वर्ड लिस्ट बिल्डर)
  5. शब्द सूची विश्लेषण औजार (वर्ड लिस्ट ऐनेलाइज़र ऐंड विज़ुअलाइज़र)
  6. भाषा तथा एनकोडिंग पहचान औजार (लैंग्वेज ऐंड एनकोडिंग आइडेंटिफिकेशन)
  7. वाक्य रचना अभिटिप्पण अंतराफलक (सिन्टैक्टिक ऐनोटेशन इंटरफेस)
  8. समांतर वांगमय अभिटिप्पण अंतराफलक (पैरेलल कोर्पस ऐनोटेशन इंटरफेस)
  9. एन-ग्राम भाषाई प्रतिरूपण (एन-ग्राम लैंग्वेज मॉडेलिंग टूल)
  10. संभाषण वांगमय अभिटिप्पण अंतराफलक (डिस्कोर्स ऐनोटेशन इंटरफेस)
  11. दस्तावेज विभाजक (फाइल स्प्लिटर)
  12. स्वचालित अभिटिप्पण औजार (ऑटोमैटिक ऐनोटेशन टूल)

अगर इनमें से अधिकतर का सिर-पैर ना समझ आ रहा हो तो थोड़ा इंतज़ार करें। आगे इनके बारे में अधिक जानकारी देने की कोशिश रहेगी।

शायद इतना और जोड़ देने में कोई बुराई नहीं है कि संचय पिछले कुछ सालों से इस नाचीज़ के जिद्दी संकल्प का परिणाम है, जिसमें कुछ और लोगों का भी सहयोग रहा है, चाहे थोड़ा-थोड़ा ही। उन सभी लोगों के नाम संचय के वेबस्थल पर जल्दी ही देखे जा सकेंगे। ये लगभग सभी विद्यार्थी हैं (या थे) जिन्होंने मेरे ‘मार्गदर्शन’ में किसी परियोजना – प्रॉजेक्ट – पर काम किया था या कर रहे हैं।

उम्मीद है कि संचय का इससे भी अगला संस्करण कुछ महीने में आ पाएगा और उसमें और भी अधिक औजार तथा सुविधाएं होंगी।

October 5, 2008

Good News and Bad News on the CL Front

First, as the saying goes, the bad news. We had submitted a proposal for the Second Workshop on NLP for Less Privileged Languages for the ACL-affiliated conferences. That proposal has not been accepted. Total proposals submitted were 41 and 34 out of them were accepted. Ours was among the not-accepted seven (euphemisms can be consoling).

Was is that bad? I hope not.

Don’t those capital letters look silly in the name of a rejected proposal?

Now the good news. The long awaited new version of Sanchay has been released on Sourceforge. (Well, at least I was awaiting). This version has been named (or numbered?) 0.3.0.

The new Sanchay is a significant improvement over the last public version (0.2). It now has one main GUI from which all the applications can be controlled. There are twelve (GUI based) applications which have been included in this version. These are:

  • Sanchay Text Editor that is connected to some other NLP/CL components of Sanchay.
  • Table Editor with all the usual facilities.
  • A more intelligent Find-Replace-Extract Tool (can search over annotated data and allows you to see the matching files in the annotation interface).
  • Word List Builder.
  • Word List FST (Finite State Transducer) Visualizer that can be useful for anyone working with morphological analysis etc.
  • One of the most accurate Language and Encoding Identifier that is currently trained for 54 langauge-encoding pairs, including most of the major Indian languages. (Yes, I know there is a number agreement problem in the previous sentence).
  • A user friendly Syntactic Annotation Interface that is perhaps the most heavily used part of Sanchay till now. Hopefully there will be an even more user friendly version soon.
  • A Parallel Corpus Annotation Interface, which is another heavily used component. (Don’t take that ‘heavily’ too seriously).
  • An N-gram Language Modeling Tool that allows you to compile models in terms of bytes, letters and words.
  • A Discourse Annotation Interface that is yet to be actually used.
  • A more intelligent File Splitter.
  • An Automatic Annotation tool for POS (Part Of Speech) tagging, chunking and Named Entity Recognition. The first two should work reasonably well, but the last one may not be that useful for practical purposes. This is a CRF (Conditional Random Fields) based tool and it has been trained for Hindi for these three purposes. If you have annotated data, you can use it to train your own taggers and chunkers.

All these components use the customizable language-encoding support, especially useful for South Asian languages, that doesn’t need any support from the operating system or even the installation of any fonts, although these can still be used inside Sanchay if they are there.

More information is available at the Sanchay Home.

The capitals don’t look so bad for a released version.

The downside of even this good news is that my other urgent (to me) work has got delayed as I was working almost exclusively on bringing out this version for the last two weeks or so.

But then you need a reason to wake up and Sanchay is one of my reasons. And I can proudly say that a half-hearted attempt to generate funding for this project by posting it on Micropledge has generated 0$.

Sanchay is still alive as a single parent child without any welfare but with a lot of responsibilities.

Now I can have nightmares about the bugs.

Next Page »

Create a free website or blog at WordPress.com.