The content of this post is written in a hurry. Please feel free to contact me at the bottom of page.

The idea here is to build a graph of synonyms and to use it to simplify a word cloud. The simplification of the wordle is done by combining all words that are at a distance less than an adjustable threshold.

You can find an example of implementation here.

The first step is to construct a graph of synonyms from a database. For this, a recursive function is used. It takes a word as an argument. It recovers its synonyms and synonyms of its synonyms and so on. Recursion depth depth is adjustable. Here depth is set to 1 to accelerate the download.

get_distance(a, b) can then be used to return the distance between two words on the graph.

def get_synonyms(word, depth):
    print word
    if depth == 0:
        return
    url = 'http://www.crisco.unicaen.fr/des/synonymes/%s'
    r = requests.get(url % word)
    soup = BeautifulSoup(r.text)
    synonyms = [a.text.strip() for a in soup.findAll('a', attrs={'href': lambda x: x.startswith('/des/synonymes/')})]
    for w in synonyms:
        if not G.has_edge(word, w):
            G.add_edge(word, w)
        get_synonyms(w, depth - 1)

socket.socket = socks.socksocket
socks.setdefaultproxy(proxy_type=socks.PROXY_TYPE_SOCKS5, addr="127.0.0.1", port=9050)

G = nx.Graph()
for word in unique_words:
    get_synonyms(word, depth=1)


def get_distance(a, b):
    try:
        return len(nx.shortest_path(G, a, b)) - 1
    except:
        return -1

This distance function is then used to merge words that are within a distance lower than a threshold set by the user.

counts = Counter([w for w in  words_singular if w in G.nodes()])
idx = np.argsort(counts.values())[::-1]
words = np.array(counts.keys())[idx]

wordles = []
for max_dist in range(10):
    wordle = []
    word_families = []
    for i in range(len(words)):
        if words[i] in word_families:
            continue
        word_family = [w for w in words[i:] if get_distance(w, words[i]) != -1 and get_distance(w, words[i]) < max_dist]
        weight = np.sum([counts[x] for x in word_family])
        word_families += word_family
        wordle.append({'word':words[i], 'weight':weight})
    wordles.append(wordle)

with open('data.json', 'w') as outfile:
    json.dump(wordles[1:], outfile)