News:

As usual while waiting for the next release - don't forget to check the nightly builds in the forum.

Main Menu

Code completion with partial matching

Started by dmoore, January 17, 2014, 11:12:24 PM

Previous topic - Next topic

dmoore

@alpha: This untested pseudo code (which looks a bit like C/C++) is roughly what I had in mind:


int find(const wxString &str, const wxString &substr)
{
    int pos = str.Find(substr);
    return pos == wxNOT_FOUND? str.Len() : pos;
}

wxString acronym(const wxString &str)
{
    wxString s = str.Mid(0,1);
    for (i=1;i<str.Len();++i)
    {
        n = str.Mid(i,1);
        bool underscore = false;
        while (n == _T("_"))
        {
            underscore=true;
            ++i;
            n = str.Mid(i,1);
        }
        if (n==wxEmptyString)
            break;
        if (underscore || n == n.upper())
            s<<n;
    }
    return s;
}

int min(int x, int y)
{
    return x<y? x:y;
}

int score(const wxString &a,const wxString &value)
{
    wxString al = a.Lower();
    wxString aa = acronym(a);
    wxString aal = aa.Lower();
    int pos_a = find(a, value)*4;
    int pos_al = find(al, value.Lower())*4+1;
    int pos_aa = find(aa, value)*4+2;
    int pos_aal = find(aal, value.Lower())*4+3;
    return min(min(pos_a,pos_al),min(pos_aa,pos_aal));
}


possible that your code does exactly the same thing, but I had trouble getting my head around the bit shifts. Would want to make sure the equal scored items are alphabetized as well.
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]

Alpha

I believe your pseudo code is fairly similar to what my code attempts.  It is definitely a lot more readable than my code, so if I am able to get it to function similarly, I will try to use it to replace my current implementation.

Sorry about the bit shifts; I used them in CalcPrefixValue() so that the output score would be in the same range as CalcValue().  (Using shifts in CalcValue() was (sort of) necessary because it attempts to quantize different states, and each of which must have greater impact on the output, than the combination of all following tests.)

(I apparently got lost in the bit shifting in my code as well, because the CalcPrefixValue() I posted has a fairly bad overflow error.  ...  I will try to post a full patch later today containing the combination of ranking systems.)

Assuming the code is working how it is supposed to, alphabetization is already handled; items of equal weight are ordered by index, and the indices are pre-sorted to be alphabetical.

dmoore

My pseudo code isn't completely right either. The "Find" calls that return wxNOT_FOUND should actually be ignored and not given a (albeit large) positive score.
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]

Alpha

Attached is the hybrid.  To test without adaption, put a return at the beginning of CCManager::DoUpdateAdaption().

What do you think of the current performance?  It *feels* acceptable to me right now (but I have not tested on multiple machines).

Alpha

Another thought: the adaption mechanism can be made more deterministic running it in 'learning mode' for a session.  Afterwards (in subsequent sessions), use the results as static data to run the heuristics.

dmoore

Seems to mostly work, but in python, with completion on "os." then type "mod" I get: "mknod, chmod, fchmod, removedirs" in that order

I expected to only see chmod and fchmod, because the other two aren't exact matches for mod or acronyms that include mod.
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]

dmoore

Also, with the disappointing lack of useful namespace usage in C++, I can see why you would want the "adaption" algorithm, but at least for python it seems less of a necessity. It would be good to cleanly decouple that functionality from the basic functionality, so that it can be easily switched off. (I haven't read your code closely so maybe that's already done.) Another approach would be to give priority to tokens that are used in the local file, but I'm guessing that would be quite a bit trickier to implement.

Another thing that occurrs to me is that we only use the YCM style filtering to limit the already created list of items that match the typed prefix before CC opens, not what is actually add to the list in the first place. This is probably for the best?

Quote from: Alpha on April 16, 2014, 04:17:07 AM
Another thought: the adaption mechanism can be made more deterministic running it in 'learning mode' for a session.  Afterwards (in subsequent sessions), use the results as static data to run the heuristics.

Seems like if you would use it at all that it may as well be always running. Over time your coding evolves... Maybe just need a way of limiting the size of the dictionary, or a way of making sure that stuff that hasn't been used in a while get de-prioritized?
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]

Alpha

Quote from: dmoore on April 17, 2014, 04:17:15 PM
Seems to mostly work, but in python, with completion on "os." then type "mod" I get: "mknod, chmod, fchmod, removedirs" in that order [...]
"mod" matches all of them because "mknod" and "removedirs".  "mknod" is first, because it starts with the same character that you first typed; "chmod" and "fchmod" are next because the contain an exact match of "mod".  This the (current) behaviour of the prioritization.

Quote from: dmoore on April 17, 2014, 05:15:31 PM
Also, with the disappointing lack of useful namespace usage in C++, I can see why you would want the "adaption" algorithm, but at least for python it seems less of a necessity. It would be good to cleanly decouple that functionality from the basic functionality, so that it can be easily switched off. (I haven't read your code closely so maybe that's already done.) Another approach would be to give priority to tokens that are used in the local file, but I'm guessing that would be quite a bit trickier to implement.
Not exactly clean :) , but the two easiest ways to disable are either comment out the line that searches the cache, or disable the function that adds items to the cache (or both).
Priority of local tokens would have to be done on the plugin end, however, the infrastructure is already supported.  If the plugin marks local tokens with a lower (meaning better) priority, the sorting algorithm boosts their score (assuming that these priority tokens account for less than 1/4 of the total tokens).

Quote from: dmoore on April 17, 2014, 05:15:31 PM
Another thing that occurrs to me is that we only use the YCM style filtering to limit the already created list of items that match the typed prefix before CC opens, not what is actually add to the list in the first place. This is probably for the best?
Sort of; CCManager sends the plugins a "fake" context, that is at most one character long.  That seemed to provide an acceptable balance showing applicable results, and cost of building/filtering a large list.

Quote from: dmoore on April 17, 2014, 05:15:31 PM
Seems like if you would use it at all that it may as well be always running. Over time your coding evolves... Maybe just need a way of limiting the size of the dictionary, or a way of making sure that stuff that hasn't been used in a while get de-prioritized?
Currently, the dictionary is not saved on close.  However, when making it persistent, I think saving only the 64 (or some other magic constant) most recently (or perhaps frequently?) used entries to the config file, would be a good option.

dmoore

#38
Quote from: Alpha on April 23, 2014, 04:47:36 PM
"mod" matches all of them because "mknod" and "removedirs".  "mknod" is first, because it starts with the same character that you first typed; "chmod" and "fchmod" are next because the contain an exact match of "mod".  This the (current) behaviour of the prioritization.

I understood that, I just think it's bad. At a minimum, the exact matches should get priority over inexact matches, but I'm not really convinced that you should match imperfect matches at all unless they fit the acronym rules. (That's what my pseudocode was attempting to do)
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]

Alpha

What about, for example: clanccgcext -> clang_codeCompleteGetContexts() (from libClang)?  Imperfect matches allow typing a longer name by (mentally) skipping through, typing only the parts where it differentiates from the other items listed.
I find my typing habit often is: type a prefix (to bring up autocomp), enter acronym for the remaining parts, (if choice is still "far") enter suffix.

(I still am trying to get used to this; it is a completely different way of thinking.)

dmoore

Quote from: Alpha on April 23, 2014, 09:17:44 PM
What about, for example: clanccgcext -> clang_codeCompleteGetContexts() (from libClang)?  Imperfect matches allow typing a longer name by (mentally) skipping through, typing only the parts where it differentiates from the other items listed.
I find my typing habit often is: type a prefix (to bring up autocomp), enter acronym for the remaining parts, (if choice is still "far") enter suffix.

I guess I can see that, but how about making the inexact matches lowest priority. So an ordering like

exact match (ordered by position of first matching char, then alphabetical)
exact abbreviation match (ordered by position of first matching char, then alphabetical)
inexact matches (ordered by position of first matching char, then alphabetical)

Quote
(I still am trying to get used to this; it is a completely different way of thinking.)

Agree and me too.
Python plugins: [url="https://github.com/spillz/codeblocks-python"]https://github.com/spillz/codeblocks-python[/url]
Code::Blocks Daily Builds -- Ubuntu PPA: [url="https://launchpad.net/~damien-moore/+archive/codeblocks"]https://launchpad.net/~damien-moore/+archive/codeblocks[/url]