अनिल एकलव्य ⇔ Anil Eklavya

April 4, 2010

Ptypho

I had then recently joined the center. As is quite fashionable (it wasn’t when I did my graduation at some other institution), the young members of the center decided to have T-shirts made with the center’s name. The student who took up the responsibility of preparing the design for the T-shirts was earlier associated with the center but had shifted to some other more respectable center.

The design was created, T-shirts were made and they were paid for and worn by almost all the members of the center. The text on them said ‘The Langauge Cookers’ or ‘The Lagnuage Cookers’ (more likely the latter), with the Language part in a very large size.

One day I was returning to the lab, along with a couple of other graduate students. An undergraduate student (most probably from a more respectable center) came from the opposite side and stopped. He stood in front of the one who was wearing that newly made T-shirt. He put his finger on the misspelled text on the T-shirt and said the following in a tone that is used to point out the incredible stupidity of someone:

– You know that this spelling is wrong?

He was from a center not dealing in mere language.

The T-shirt wearer couldn’t say anything because he hadn’t realized that there was a spelling error. I had noticed the error and had thought that the designer of the T-shirt had chosen a smart and humorous way to say something positive about the mission of the center and the discipline. I was too shocked to reply immediately, but I found the words in time:

– It’s deliberate.

Now it was his turn to be dumbstruck.

– It’s deliberate?

– Yeah, of course it’s deliberate.

I couldn’t resist being scornful. He was still dumbstruck.

– But why?

I didn’t have time to formulate a reply because he left soon after that.

I narrated the incident once or twice to others and they seemed to share my feelings.

Well, time passed (as they say), and I came to know that there were many others in the center who had not noticed the spelling error recreated in such a large size. Or they hadn’t thought about it.

Then I found out that the general consensus outside the center was that the designer of the T-shirt (along with others) had great fun at the expense of the whole center and that the typo was indeed deliberate (what else could it be?), but the designer had wanted to say something very different from what I had imagined.

He was a well liked member of the center and later moved to an Ivy League U.S. university. He remained a well liked (albeit former) member.

My head still hurts from thinking about it. But I can’t escape it because every day something reminds me of this, especially in academics.

Do I hear someone saying that there really are some typos in many posts on this blog?

May 29, 2009

Milk as Karma

Someone called someone milk
Milk as noun or milk as verb?
Milk as the subject or milk as the object?
Milk as the karta or milk as the karma?

The answer appears as a vision
Of huge torrents of something
(It could very well be milk
Of, you know, something)
Flowing from one end
Of the Zipf’s Law curve
To the other end

May 22, 2009

How Many Grams?

There is an automatically (intelligently) generated blog which I have read recently.

It appears to be (let’s give ‘seems’ some rest) quite a popular one in a certain section.

I know the corpus on which it was trained.

And the corpus on which it was retrained.

(Including most of the quotes and the comments, especially the long ones).

But I wonder whether the order of n-grams was five or six.

It is definitely better than four grams.

It could even be Se7en.

This brings up a new idea.

What about writing a paper on automatically guessing the order of n-grams, given some generated text?

It may be difficult in the general case, but in our case we know the corpus on which it was trained.

Any takers?

April 16, 2009

Accepted, but not Published

Academicians or researchers list their publications prominently on their home pages. After all, it is supposed to represent the best of their work. They also quite often (especially those who have a large number of publications) categorize them according to some criteria like the venue (workshop, conference, journal or book: in the reverse order of prominence) or peer review (unrefereed and refereed).

In this post we propose that there should be a new category of publications. This category is needed because a lot of researchers (for good or for bad) now come from underprivileged countries. For most of these researchers, traveling abroad to attend a conference, even if their paper has been accepted, is something very hard to do. In some sense even more than getting a paper accepted, which is relatively harder too, given the lack of certain privileges — whether you like the word or not — generous research grants, infrastructure, language resources etc., combined with the prejudice (it is there: I am not inventing it, whoever might be blamed for it). To these problems can be added the problem of compulsory attendance at a conference or a workshop. It is partly these conditions which have prompted suggestions from certain quarters that researchers from these countries should concentrate on journal papers (never mind the delay and difficulties involved or the unfairness of the proposition, even though it has some practical justification).

But you can never be sure while submitting that you certainly won’t be able to attend. Also, hope is said to be a good thing. Therefore, the event of a researcher submitting a paper and hoping to attend but not being able to attend cannot be ruled out.

This bring us to the proposal mentioned earlier. One solution to this problem is that there should be another category of papers: accepted but not published, because the author couldn’t afford to attend the conference or the workshop. (By the way, workshops are the most happening places nowadays: more on that later).

The author of this post must know because he has authored more than one such publications.

Of course, the condition will be that if and when such a paper is resubmitted (with or without modifications, but without any substantial new work), accepted again and finally published, the entry marked as ‘accepted’ should be removed and replaced by an entry marked as ‘published’.

After all, if we are serious about research, then the work (which has been peer reviewed and accepted) should be given somewhat more importance than some pages printed in some proceedings (or attendance in a conference for that matter).

This, of course, doesn’t mean that you can get basically the same thing published (or accepted) in more than one places.

(Sorry for the Gory Details)

P.S.: May be there is no need for the above apology as the depiction of the Gory Details of the Indian Reality is now getting multiple Oscars (The Academy Awards: the keyword is Academy). But may be there is because some researchers have a more (metaphorically) delicate constitution which can be hurt by the Gory Details.

Queen’s P.S.: Off with his head!

October 28, 2008

सांगणिक भाषाविज्ञान

जैसा मैंने पिछली प्रविष्टी (‘पोस्ट’ के लिए यह शब्द इस्तेमाल हो सकता है?) में लिखा था, अगले कुछ हफ्तों में मैं संचय के बारे में लिखने जा रहा हूं।

लेकिन क्योंकि संचय खास तौर पर (आम उपयोक्ताओं के अलावा) सांगणिक भाषाविज्ञान या भाषाविज्ञान के शोधकर्ताओं के लिए बनाया गया है, इस बात को साफ कर देना ठीक रहेगा कि सांगणिक भाषाविज्ञान या भाषाविज्ञान के माने क्या है, या अगर आप इनके माने जानते ही हैं तब भी इनसे मेरा अभिप्राय क्या है। यह दूसरी बात इसलिए कि इन विषयों (सांगणिक भाषाविज्ञान या भाषाविज्ञान) के अर्थ के बारे में आम लोगों में तो तमाम तरह की ग़लतफ़हमियाँ हैं ही, पर इन विषयों के शोधकर्ताओं में भी इनकी परिभाषा पर एक राय नहीं है।

सच तो यह है कि हिंदी जगत में तो अब भी अधिकतर लोग भाषाविज्ञान का अर्थ उस तरह के अध्ययन से लगाते हैं जो पिछली सदी के शुरू में लगाया जाता था। लेकिन बहस की इस दिशा में अभी मैं नहीं जाना चाहूंगा क्योंकि इसके बारे में कहने को इतना अधिक है कि अभी जो उद्देश्य है वो पीछे ही रह जाएगा।

वैसे सांगणिक भाषाविज्ञान या भाषाविज्ञान की परिभाषा या उनकी सीमाओं के बारे में भी कहने को बहुत-बहुत कुछ है, पर फिलहाल थोड़े से ही काम चलाया जा सकता है।

तो छोटे में कहा जाए तो भाषाविज्ञान शोध या अध्ययन का वह विषय है जिसमें किसी एक भाषा के व्याकरण का ही अध्ययन नहीं किया जाता बल्कि नैसर्गिक या मानुषिक (यानी कृत्रिम नहीं) भाषा का वैज्ञानिक रूप से अध्ययन किया जाता है। अब यह धारणा व्यापक रूप से स्वीकृत है कि मानव मस्तिष्क की संरचना का भाषा की संरचना से सीधा संबंध है और क्योंकि सभी मानवों के मस्तिष्क की संरचना मूलतः एक ही जैसी है, तो सभी नैसर्गिक या मानुषिक भाषाओं में भी सतही लक्षणों को छोड़ कर बाकी सब एक ही जैसा है। इसीलिए, जैसा कि इन विषयों के आधुनिक साहित्य में प्रसिद्ध है, अगर किसी अमरीकी के शिशु को जन्म के तुरंत बाद कोई चीनी परिवार गोद ले ले और वह बच्चा चीन में ही पले तो वह उतनी आसानी से चीनी बोलना सीखेगा जितनी आसानी से कोई चीनी परिवार का बच्चा। ऐसी ढेर सारी और बातें हैं, पर मुख्य बात है कि भाषाविज्ञान नैसर्गिक या मानुषिक भाषा का वैज्ञानिक अध्ययन है।

कम से कम कोशिश तो यही है कि अध्ययन वैज्ञानिक रहे, पर वो वास्तव में रह पाता है या नहीं, यह बहस का विषय है।

अब सांगणिक भाषाविज्ञान पर आएं तो इस विषय में हमारा ध्यान मानवों की बजाय संगणक यानी कंप्यूटर पर आ जाता है, पर पिछली शर्त फिर भी लागू रहती है: नैसर्गिक या मानुषिक भाषा का वैज्ञानिक अध्ययन। अंतर यह है कि हमारा उद्देश्य अब यह हो जाता है कि कंप्यूटर को इस लायक बनाया जा सके कि वो नैसर्गिक या मानुषिक भाषा को समझ सके और उसका प्रयोग कर सके। जाहिर है यह अभी बहुत दूर की बात है और इसमें कोई आश्चर्य भी नहीं होना चाहिए क्योंकि अभी भाषाविज्ञान में ही (पिछली सदी की असाधारण उपलब्धियों के बाद भी) वैज्ञानिक ढेर सारी बाधाओं में फंसे हैं।

फिर भी, सांगणिक भाषाविज्ञान में काफ़ी कुछ संभव हो चुका है और काफ़ी कुछ आगे (निकट भविष्य में) संभव हो सकता है। लेकिन इसमें कंप्यूटर का मानव जैसे भाषा बोलना-समझना शामिल नहीं है। जो शामिल है वो हैं ऐसी तकनीक जो दस्तावेजों को ज़्यादा अच्छी तरह ढूंढ सकें, उनका सारांश बना सकें, कुछ हद तक उनका अनुवाद कर सकें आदि।

लेकिन हिंदुस्तानी परिप्रेक्ष्य में परेशानी यह है कि हम अभी इस हालत में भी नहीं पहुंचे हैं कि आसानी से कंप्यूटर का एक बेहतर टाइपराइटर की तरह ही उपयोग कर सकें। इस दिशा में कुछ उपलब्धियाँ हुई हैं, पर अंग्रेज़ी या प्रमुख यूरोपीय भाषाओं की तुलना में हम कहीं भी नहीं हैं। जैसा कि आपमें से अधिकतर जानते ही हैं, यह एक लंबी कहानी है जिसे अभी छोड़ देना ही ठीक है।

पर संचय का विकास इसी परिप्रेक्ष्य में किया गया है, जिसके बारे में आगे बात करेंगे।

October 26, 2008

संचय का परिचय

पिछली पोस्ट (शर्म के साथ कहना पड़ रहा है कि पोस्ट के लिए कोई उपयुक्त शब्द नहीं ढूंढ पा रहा हूं) में मैंने (अंग्रेज़ी में) संचय के नये संस्करण के बारे में लिखा था। मज़े की बात है कि संचय के बारे में मैंने अभी हिंदी में शायद ही कुछ लिखा हो। इस भूल को सुधारने की कोशिश में अब अगले कुछ हफ्तों में संचय के बारे में कुछ लिखने का सोचा है।

तो संचय कौन है? या संचय क्या है?

पहले सवाल का तो जवाब (अमरीकी शब्दावली में) यह है कि संचय एक सिंगल पेरेंट चाइल्ड है जिसे किसी वेलफेयर का लाभ तो नहीं मिल रहा पर जिस पर बहुत सी ज़िम्मेदारियाँ हैं।

दूसरे सवाल का जवाब यह है कि संचय सांगणिक भाषाविज्ञान (कंप्यूटेशनल लिंग्विस्टिक्स) या भाषाविज्ञान के क्षेत्र में काम कर रहे शोधकर्ताओं के लिए उपयोगी सांगणिक औजारों का एक मुक्त (मुफ्त भी कह सकते हैं) तथा ओपेन सोर्स संकलन है। पर खास तौर से यह कंप्यूटर पर भारतीय भाषाओं का उपयोग करने वाले किसी भी व्यक्ति के काम आ सकता है। इसकी एक विशेषता है कि इसमें नयी भाषाओं तथा एनकोडिंगों को आसानी से शामिल किया जा सकता है। लगभग सभी प्रमुख भारतीय भाषाएं इसमें पहले से ही शामिल हैं और संचय में उनके उपयोग के लिए ऑपरेटिंग सिस्टम पर आप निर्भर नहीं है, हालांकि अगर ऑपरेटिंग सिस्टम में ऐसी कोई भी भाषा शामिल है तो उस सुविधा का भी आप उपयोग संचय में कर सकते हैं। यही नहीं, संचय का एक ही संस्करण विंडोज़ तथा लिनक्स/यूनिक्स दोनों पर काम करता है, बशर्ते आपने जे. डी. के. (जावा डेवलपमेंट किट) इंस्टॉल कर रखा हो। यहाँ तक कि आपकी भाषा का फोंट भी ऑपरेटिंग सिस्टम में इंस्टॉल होना ज़रूरी नहीं है।

संचय का वर्तमान संस्करण 0.3.0 है। इस संस्करण में पिछले संस्करण से सबसे बड़ा अंतर यह है कि अब एक ही जगह से संचय के सभी औजार इस्तेमाल किए जा सकते हैं, अलग-अलग स्क्रिप्ट का नाम याद रखने की ज़रूरत नहीं है। कुल मिला कर बारह औजार (ऐप्लीकेशंस) शामिल किए गए हैं, जो हैं:

  1. संचय पाठ संपादक (टैक्सट एडिटर)
  2. सारणी संपादक (टेबल एडिटर)
  3. खोज-बदल-निकाल औजार (फाइंड रिप्लेस ऐक्सट्रैक्ट टूल)
  4. शब्द सूची निर्माण औजार (वर्ड लिस्ट बिल्डर)
  5. शब्द सूची विश्लेषण औजार (वर्ड लिस्ट ऐनेलाइज़र ऐंड विज़ुअलाइज़र)
  6. भाषा तथा एनकोडिंग पहचान औजार (लैंग्वेज ऐंड एनकोडिंग आइडेंटिफिकेशन)
  7. वाक्य रचना अभिटिप्पण अंतराफलक (सिन्टैक्टिक ऐनोटेशन इंटरफेस)
  8. समांतर वांगमय अभिटिप्पण अंतराफलक (पैरेलल कोर्पस ऐनोटेशन इंटरफेस)
  9. एन-ग्राम भाषाई प्रतिरूपण (एन-ग्राम लैंग्वेज मॉडेलिंग टूल)
  10. संभाषण वांगमय अभिटिप्पण अंतराफलक (डिस्कोर्स ऐनोटेशन इंटरफेस)
  11. दस्तावेज विभाजक (फाइल स्प्लिटर)
  12. स्वचालित अभिटिप्पण औजार (ऑटोमैटिक ऐनोटेशन टूल)

अगर इनमें से अधिकतर का सिर-पैर ना समझ आ रहा हो तो थोड़ा इंतज़ार करें। आगे इनके बारे में अधिक जानकारी देने की कोशिश रहेगी।

शायद इतना और जोड़ देने में कोई बुराई नहीं है कि संचय पिछले कुछ सालों से इस नाचीज़ के जिद्दी संकल्प का परिणाम है, जिसमें कुछ और लोगों का भी सहयोग रहा है, चाहे थोड़ा-थोड़ा ही। उन सभी लोगों के नाम संचय के वेबस्थल पर जल्दी ही देखे जा सकेंगे। ये लगभग सभी विद्यार्थी हैं (या थे) जिन्होंने मेरे ‘मार्गदर्शन’ में किसी परियोजना – प्रॉजेक्ट – पर काम किया था या कर रहे हैं।

उम्मीद है कि संचय का इससे भी अगला संस्करण कुछ महीने में आ पाएगा और उसमें और भी अधिक औजार तथा सुविधाएं होंगी।

October 5, 2008

Good News and Bad News on the CL Front

First, as the saying goes, the bad news. We had submitted a proposal for the Second Workshop on NLP for Less Privileged Languages for the ACL-affiliated conferences. That proposal has not been accepted. Total proposals submitted were 41 and 34 out of them were accepted. Ours was among the not-accepted seven (euphemisms can be consoling).

Was is that bad? I hope not.

Don’t those capital letters look silly in the name of a rejected proposal?

Now the good news. The long awaited new version of Sanchay has been released on Sourceforge. (Well, at least I was awaiting). This version has been named (or numbered?) 0.3.0.

The new Sanchay is a significant improvement over the last public version (0.2). It now has one main GUI from which all the applications can be controlled. There are twelve (GUI based) applications which have been included in this version. These are:

  • Sanchay Text Editor that is connected to some other NLP/CL components of Sanchay.
  • Table Editor with all the usual facilities.
  • A more intelligent Find-Replace-Extract Tool (can search over annotated data and allows you to see the matching files in the annotation interface).
  • Word List Builder.
  • Word List FST (Finite State Transducer) Visualizer that can be useful for anyone working with morphological analysis etc.
  • One of the most accurate Language and Encoding Identifier that is currently trained for 54 langauge-encoding pairs, including most of the major Indian languages. (Yes, I know there is a number agreement problem in the previous sentence).
  • A user friendly Syntactic Annotation Interface that is perhaps the most heavily used part of Sanchay till now. Hopefully there will be an even more user friendly version soon.
  • A Parallel Corpus Annotation Interface, which is another heavily used component. (Don’t take that ‘heavily’ too seriously).
  • An N-gram Language Modeling Tool that allows you to compile models in terms of bytes, letters and words.
  • A Discourse Annotation Interface that is yet to be actually used.
  • A more intelligent File Splitter.
  • An Automatic Annotation tool for POS (Part Of Speech) tagging, chunking and Named Entity Recognition. The first two should work reasonably well, but the last one may not be that useful for practical purposes. This is a CRF (Conditional Random Fields) based tool and it has been trained for Hindi for these three purposes. If you have annotated data, you can use it to train your own taggers and chunkers.

All these components use the customizable language-encoding support, especially useful for South Asian languages, that doesn’t need any support from the operating system or even the installation of any fonts, although these can still be used inside Sanchay if they are there.

More information is available at the Sanchay Home.

The capitals don’t look so bad for a released version.

The downside of even this good news is that my other urgent (to me) work has got delayed as I was working almost exclusively on bringing out this version for the last two weeks or so.

But then you need a reason to wake up and Sanchay is one of my reasons. And I can proudly say that a half-hearted attempt to generate funding for this project by posting it on Micropledge has generated 0$.

Sanchay is still alive as a single parent child without any welfare but with a lot of responsibilities.

Now I can have nightmares about the bugs.

April 23, 2008

Network Goons Pay Tribute

Sometime ago I had written about the wireless notwork. Apart from the genuine technical problems, there are network goons out there who make sure that the network becomes the notwork.

The people in charge who implement ridiculous rules and block sites for no apparent reason and take action against people (who get caught) for the smallest and the silliest reason, are apparently powerless against these network goons. If the statement sounds hyperbolic, let me mention just a few facts:

  • The URL www.cs.rochester.edu has remained blocked for around two years now. The only reason (if it can be called that) seems to be that this sub-domain has a page where NLP and Computational Linguistics conferences are listed.
  • So is the India Together site which publishes articles by people like P. Sainath.
  • For some time, even the site of the national newspaper The Hindu was blocked.
  • Many other sites are blocked at one time or another, such as the YouTube.

Just a few days ago I checked the network activity on my system and found that many other systems were connected to my laptop, even though there was no reason for them to be and I had even switched on the Windows firewall. This is not happening after I did some things like blocking connection on the netbios-ssn port etc.

Why am I writing this post instead of talking to the people in charge? Because I don’t really think anything is going to come out of that. This rant was provoked by a particularly bad network notworking day.

Another thing that has happened is that the goons who are forming the private network and thereby causing problems for the others, have named their network with my initials:

Goons on the Wireless Notwork

I take it as a tribute. The people who hate you and create problems for you for no reason (whom you don’t even know) pay tribute in this way. It is one of the best tributes one can have.

Of course, there are the side effects, but, as they say, no free lunches.

Except perhaps for those who already have a lot of purchasing power.

The more, the better.

The more, the free-er.

The more, the more.

April 13, 2008

Two Laws of Reviewing

After a few years in research, I have discovered two laws which the process of reviewing (of research papers) follows. Not very original, but here they are:

  1. You can always find some reasons for accepting any paper.
  2. You can always find some reasons for rejecting any paper.

April 11, 2008

Patent Madness

So we have one more reason in support for the idea that patents are a bad idea. The latest is the news that a company called Digital Reasoning has been awarded a patent on what looks like contextual similarity. What the ‘news report’ says includes:


This breakthrough patent grants broad protection for how artificial intelligence, including neural networks, genetic algorithms, and vector space models can be used to learn the meanings of symbols – such as words, categories, or numerical values. Understanding the subtle meaning of terms in context has been one of the “Holy Grails” of artificial intelligence. Not only is Digital Reasoning® fully able to accomplish this feat, it is now patented.

Here is one comment about this:

Anyone from the ACL/ML/AI community can immediately recognize this and start citing their favorite papers on these topics starting from at least a decade ago. A promotional video from the company on YouTube can be found here. Excerpt from the video: “… We treat the text representation of human language as a signal … “.

I think everyone should stop taking patents seriously. Wishful thinking?

Here is another:

Do the people ‘in-charge’ have any clue about the previous/current reseach done in the related field? How can they accept such stuff? Doesn’t make any sense, whatsoever.

But then they had accepted patents on haldi, neem and basmati. I am worried about jal jeera and pani poori.

Also, ganne ka ras.

Madness.

No need for me to say more as so many others have already talked about this:

In August last year there was a news item about Yoga devices being patented in the US. Small mercy that the Government of India succeeded in cautioning the U.S. Government against granting patents to Yoga postures (asanas).

There was a time (in India) when patents were awarded on processes, not products. That meant that even if some company had patented a method for producing a particular medicine, someone else could come along and find a better way and sell the medicine cheaper. Now, since the patents are granted on products, under orders from the empire that rules the world, that kind of thing can’t happen.

It can a be matter of life and death for millions of people.

I look forward to the day when self-respecting researchers won’t proudly list the patents they have been able to obtain.

Patents are among the most evil inventions of humankind.

Next Page »

Create a free website or blog at WordPress.com.