News:

As usual while waiting for the next release - don't forget to check the nightly builds in the forum.

Main Menu

Macro replacement in CC(tokenizer) suggestion

Started by ollydbg, June 29, 2009, 04:10:29 AM

Previous topic - Next topic

ollydbg

Hi, In the current architecture, the function ThisOrReplacement(m_Token) was only called in Tokenizer::GetToken()

wxString Tokenizer::GetToken()
{
   m_UndoTokenIndex = m_TokenIndex;
   m_UndoLineNumber = m_LineNumber;
   m_UndoNestLevel  = m_NestLevel;

   if(m_PeekAvailable)
   {
       m_TokenIndex = m_PeekTokenIndex;
       m_LineNumber = m_PeekLineNumber;
       m_NestLevel  = m_PeekNestLevel;
       m_Token   = m_PeekToken;
   }
   else
       m_Token = DoGetToken();

   m_PeekAvailable = false;

   return ThisOrReplacement(m_Token);
}



To accelerate this "Macro replacement", I think it should be moved to DoGetToken.

Here are the reasons:

1, ThisOrReplacement(m_Token) internally use a wxString--> wxString map container, so, it will use a search algorithm in this map(normally this will cause a search on a balanced BST), this will take a lot of time.

2, we can avoid many situations to call this function, for example, when m_Token is '{'  or wxEmptyString or many other string that shouldn't need macro expansion.

Any comments?

Thanks



If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

ollydbg

Also, I suggest that when the Tokenizer return a Token( wxString ), it should also combined with a "type", which means the parser can use this type information to do Syntax Analysis.

If I can remember, Ceniza call this a "Smart lexer"  :D
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

thomas

Quote from: ollydbg on June 29, 2009, 04:10:29 AMinternally use a wxString--> wxString map container, so, it will use a search algorithm in this map(normally this will cause a search on a balanced BST), this will take a lot of time.

2, we can avoid many situations to call this function, for example, when m_Token is '{'  or wxEmptyString or many other string that shouldn't need macro expansion.
I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

ollydbg

Quote from: thomas on June 29, 2009, 09:40:37 AM
I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).
Not fully understand you comments :(

I mean if we return a wxString from DoGetToken(),  (for example '{') as we know '{' certainly don't need to do macro replacement, so ,we can avoid calling ThisOrReplacement('{');.

Also, there are many wxString like '{' :D



If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

thomas

Quote from: ollydbg on June 29, 2009, 09:57:35 AM
Quote from: thomas on June 29, 2009, 09:40:37 AM
I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).
Not fully understand you comments :(
I understood that your idea is to catch special cases which cannot possibly be macros, so they need not looked up in the map<wxString,wxString>.

In other words, replace code that looks like:
return the_map.find(token);

with something like:
if(token.IsEmpty() || (token == one_constant) || (token == two_constant) || (token == three_constant))
    return token;
else
    return the_map.find(token);



My point is that maps have O(log(n)) lookup, so unless a source has 20 billion preprocessor defines, it is really nothing to worry about. For "normal" amounts, log(n) will be something like 4, maybe 5. Let's assume the worst case of 5. One "operation" is a compare and a branch.

Adding a line like the above will remove 5 operations done by the map lookup in the best case, at a cost of 1-4 additional operations (average 2). So we save 5-1 = 4 operations in the best, and 5-2 = 3 operations in the average case.
In the worst case, it will add 4 operations to the existing 5, almost doubling the work.
This scenario might still be advantageous, but it's not likely that it will be a big win.

Now, you were talking of "many" cases. Let's say "many" means 10. In this case, we will do 1-10 operations (5 average) to eliminate the 5 lookups done by the map.
So, on the average, we replace 5 operations with 5 operations (zero win, but more complicated code), and in the worst case, we add 10 operations, tripling the amount of work done.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

oBFusCATed

Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

ollydbg

Quote from: thomas on June 30, 2009, 12:23:17 PM

with something like:
if(token.IsEmpty() || (token == one_constant) || (token == two_constant) || (token == three_constant))
    return token;
else
    return the_map.find(token);



Hi, thomas, Thanks for the full explanation about your point.

But I think you still misunderstand my mind.  :(

Look at the code in DoGetToken(), I have made a pseudo code below. We can add a bool variable



bool NeedReplacement;

if (c == '_' || wxIsalpha(c))
    {         
         ..........
          NeedReplacement = true;
    }
    else if (wxIsdigit(CurrentChar()))
    {
          .........
          NeedReplacement = false;
    }
    else if (CurrentChar() == '"' || CurrentChar() == '\'')
    {
         ......
          NeedReplacement = false;
    }
    else if (CurrentChar() == '(')
    {
         ......
         NeedReplacement = false;
    }
    else
    {
            .......
    }

if(!NeedReplacement)
    return token;
else
    return the_map.find(token);


Because we have already know the information of the current token, we can still save quite a lot of time.  :D

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

ollydbg

Quote from: oBFusCATed on June 30, 2009, 01:37:44 PM
Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.
This is the great news.
If it is really true that hash map runs faster, I do suggest change to use std::tr1::unordered_map, thanks for the suggestion!!!
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Kazade

Quote from: oBFusCATed on June 30, 2009, 01:37:44 PM
Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.

I personally have been using std::tr1::unordered_map for everything for ages (at least a year), I definitely think if it's available it should be used. Apparently GCC has a __GXX_EXPERIMENTAL_CXX0X__ macro. eg:

#ifdef __GXX_EXPERIMENTAL_CXX0X__
    typedef std::tr1::unordered_map CBMap;
#else
    typedef std::map CBMap;
#endif

CBMap<wxString, wxString> myMap;


or something like that.


oBFusCATed

Kazade: unordered containers do not require the experimental c++-0x, they are in TR1

(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]