Generalizing programming language patterns in CodeBlocks

beqroson · November 11, 2013, 06:53:36 PM

Quote from: dmoore on November 11, 2013, 06:22:47 PM
With both this stuff about comments and your UTF-82 talk, I think you are WAY overcomplicating things.

Overcomplicating... nope, I am not

Quote from: dmoore on November 11, 2013, 06:22:47 PM
To me, the potential "win" here is to create set of standardized translation tables

Standardized translation? Noooo...

Quote from: dmoore on November 11, 2013, 06:22:47 PM
Comments, especially the doc strings for toolkits like wxWidgets, would be nice, but they aren't necessary to get a program to compile and dealing with them in the right way has to be part of a much larger translation effort.

Agree, comments need more effort from the developer than from any translation tool creator.

Quote from: dmoore on November 11, 2013, 06:22:47 PM
To reiterate, you don't really need to integrate this into C::B to make your proof of concept.

No, that is true, to the no...

Quote from: dmoore on November 11, 2013, 06:22:47 PM
And you shouldn't because if it is useful to C::B users it will be useful to programmers more genrally.

No, I should not. But can I keep my sticky fingers from it? Nope.

Quote from: dmoore on November 11, 2013, 06:22:47 PM
Why don't you start by writing a simple tool that takes the users foreign language source files (UTF-8), a speficified translation table, and outputs the english programming language equivalent (and vice versa). From there it would be easy enough to integrate into the GCC and other toolchains. Then turn it into a Library and IDEs will be able to take advantage of it too.

No, that I will do. So, no. Wtf, I mean yes, YES.

beqroson · November 11, 2013, 07:28:46 PM

Quote from: dmoore on November 11, 2013, 06:22:47 PM
Comments, especially the doc strings for toolkits like wxWidgets, would be nice, but they aren't necessary to get a program to compile and dealing with them in the right way has to be part of a much larger translation effort.

I think you are too speeding reading. In conclusion my point all the time was that I will not touch comments, just skip over them in the translator.

beqroson · November 11, 2013, 07:56:06 PM

Quote from: dmoore on November 11, 2013, 06:22:47 PM
your UTF-82 talk

Well, you are probably correct about overcomplicating that stuff. I hope the UTF82 is just a temporary hickup. If I decide to implement it I will be way beyond my available time not to mention unnecessarily complex.

beqroson · November 11, 2013, 10:18:24 PM

Quote from: beqroson on November 05, 2013, 08:18:20 PM
The general idea is to enable that any programming language should be possible to use with very little preparation in the codeblocks IDE. I am not talking about just C++, but ANY language. By this I mean compiling, code-completion, hightlightning, everything that a developer needs.

Wtf!! What the hell was I talking about? My mouth must be running at a higher clock speed than my mind. I mean, OK for an idea, but it is not easy to implement in an afternoon. I knew that already, but I talk too much!

beqroson · November 11, 2013, 10:42:34 PM

However, I am looking to create the algorithm for the translation phase. This is what I came up with so far:

In order to translate terms in the source document, the basic algorithm is as follows:

Use a source string and a destination string for the pass.
Scan the document for comments, skip all comment by writing directly to the destination.
Create hash for each term found by doing the following
- For each byte that is not within a comment, check if it is equal or greater than 0x80, if so, it is a character to check.
- If it is less than 0x80, use a lookup table to get if it is a character to check, e.g [A-Z,a-z,0-9].
- For all characters to check, create a hash value by round robin over an uint64_t.
- When a byte that is not a character to check is entered, stop creating the hash and save the length of the byte sequence.
With the length of the string, go into a lookup array using the length as index.
In the lookup array two values are retrieved, the lower bound and the upper bound as a subset of a long list of hashes.
We now know in what range in the long hash table a match can be found.
Using binary search algorithm, the upper and lower bound converges until a definite match can be found.
If the hash value is found in the list, there is great probability that it is the correct match, but that should probably be confirmed by double checking.
Using the index of the matched hash, enter the same index into another lookup table, retrieve another index and length that goes into a compact table with the replacement string and write to the destination.

beqroson · November 11, 2013, 11:31:23 PM

Question is if there need to be the double checking of the found hashvalue at all? Since the hash value is equal to the byte sequence for the first eight bytes, all short strings will have an exact hash match. For any string of byte length nine or greater, the probability of a hash collision must be very small. Doing the double check will take relatively much processing time. The shortest string with eight bytes, however, is two character UTF8 of very high unicode point values. Question is also how often they will show up. More common short strings of UTF8 with eight bytes will be somewhere of 4+ characters.

An example where a hash collsion will occur is the terms "templates" and "semplatet". Using those terms, the only anagrams of nine byte strings that collide is where the first and last letters are interchanged. Question is if you would wait until a collision is detected by the user, and let the user switch to a checked version of the hash. I know it is substandard to make shortcuts like this. That is why I am asking.

beqroson · November 12, 2013, 12:31:19 AM

Probably it is going to be checked, that is the only reasonable way. Besides, the search string to check can reside in the same table as the replacement string, right before it. Thus the term is first checked, then replaced by just keep going in the same table, using two length values and one index value.

beqroson · November 12, 2013, 12:32:35 AM

Great, the inner loop in its most basic form seems to become somewhat defined. Now, the next item to solve in the list will be how to treat variable scope.

beqroson · November 12, 2013, 12:59:23 PM

One way to treat variable scope, is to... not treat variable scope. To avoid variable scope is to avoid using parsing technology. To avoid using parsing technology is to create one-for-all mechanism for the translation.

The idea is to use several versions of the source files as follows:

________________________________________________________
|
| Specific language cpp-source file, *.hscrp.h and *.cppscrp.cpp
|________________________________________________________

A
| Lossless bidirectional translation
V

_________________________________________________________
|
| Common language cpp-source file, *.hlang.h and *.cpplang.cpp
|________________________________________________________

| One way translation
V
___________________________________
|
| Normal cpp-source file, *.h and *.cpp
|__________________________________

thomas · November 12, 2013, 07:18:34 PM

Quote from: beqroson on November 10, 2013, 04:55:51 PMYes, my definition was that both the programming language and the native language can be one. Such as if I write wholly in English, ie "function DoSomething()" or in Italian, ie "funzione FareQualcosa()", then both the native and the programming sentences could be categorized as "Italian" language.

Now, in the world of programming, I was thinking that the translation can be only to exchange words one by one, straight.

Don't get me wrong on that, but this is the most stupid idea I've heard in a while.

Not only that, but it also won't work. Languages do not translate word by word, and languages have grossly different grammar. Many languages have characters that do not exist in others. What if someone writes Tagalog or Chinese and you expect Italian or German? How is this supposed to work? Do you expect comments being magically translated as well?
Not few terms translate in an awkward manner to say the least, even when done by professinal translators. I regularly have to stop and think what they're trying to say when I see IT translations from English to my native language done by professionals working for multi-million-dollar companies. Let alone word-by-word computer translation.

Plus, most people who are moderately familiar with programming are also firm in English.

That much for natural languages, and as far as "any programming language" goes, I can think of least 6 grossly different categories of languages, and these are certainly not all:

compiler based bytecode languages (e.g. Java)
compiler and linker based languages (e.g. C or C++)
interpreted/bytecode languages without explicit compiler (e.g. Python, Lua,... )
interpreted/bytecode embedded languages (e.g. AngelScript, Squirrel, and again Python, Lua)
interpreted/bytecode remote languages (e.g. PHP)
weirdo languages that do near unpredictable stuff (e.g. bash script, perl)
weirdo languages that nobody can understand (e.g. Lisp)

Some of these need a compiler invoked, some of them need the executable to be linked afterwards. Some need the binaries and resources packed in a zip file and a bytecode interpreter launched afterwards instead.
Some need an interpreter launched, some need a host application (including bindings).
Some need files being uploaded to a different machine where an interpreter runs as server process.
Some need ... something else.

All of these categories are so grossly different that it is hardly possible to pack them all into one unified build process or one unified notion of a "project".

beqroson · November 12, 2013, 09:08:18 PM

Quote from: thomas on November 12, 2013, 07:18:34 PM
Don't get me wrong on that, but this is the most stupid idea I've heard in a while.

Ok, I will not get you wrong.

Thus it may also become the most stupid idea that you heard in a while, that also gets to be implemented.

beqroson · November 12, 2013, 09:57:10 PM

I think it was an interesting and well written comment by thomas. But I do not know how to interpret the message. I will assume that the message was not directed for me, but instead intended for all the other forum members to get a more "healthy" viewpoint.

oBFusCATed · November 12, 2013, 10:05:38 PM

Quote from: beqroson on November 12, 2013, 09:57:10 PM
I will assume that the message was not directed for me, but instead intended for all the other forum members to get a more "healthy" viewpoint.

beqroson · November 12, 2013, 10:19:25 PM

Quote from: oBFusCATed on November 12, 2013, 10:05:38 PM
Quote from: beqroson on November 12, 2013, 09:57:10 PM
I will assume that the message was not directed for me, but instead intended for all the other forum members to get a more "healthy" viewpoint.

OK, if the idea is that bad, then I am ready to defend it politically also! So, if this has political implications, hit me! I am ready and will NOT get angry/backstabbing/sad.

Jenna · November 12, 2013, 10:43:36 PM

I think this topic has left the scope of our forum by far.

Note: this forum is dedicated to Code::Blocks and related themes, but this discussion has become much more general.

Please stop the discussion (or move it to another platform), or I will lock the topic.

Code::Blocks Forums

News:

Generalizing programming language patterns in CodeBlocks

beqroson

beqroson

beqroson

beqroson

beqroson

beqroson

beqroson

beqroson

beqroson

thomas

beqroson

beqroson

oBFusCATed

beqroson

Jenna