The content of this post is written in a hurry. Please feel free to contact me at the bottom of page.

Last week I attended to a conference of Michalis Vazirgiannis about graphs of words. He specifically referred to two papers:

I took the opportunity to test this modeling through this mini project. Your can find the output here.

Overview

I started with a list of customer comments both positive and negative that I retrieved using import.io. The idea was then to extract the main themes, to associate with each keywords / keyword groups to a sample of comments and gather all in a small exploration dashboard.

The script is divided into three parts:

  • Corpus pre-treatment (tokenizing, tagging, stemming)
  • Graph of words construction
  • Graph mining: extraction of keywords, keywords clustering and text retrieval

It has four parameters:

  • window the size of the words window used for graph construction
  • quantile related to the amount of extracted keywords
  • num_posts and num_words that parameterize the d3.js maximum number of posts and words in the wordle
window = 4
quantile = 50
num_posts = 3
num_words = 30

Corpus pre-treatment

Standard operation with conventional libraries (nltk and Stanford POS tagger). The only originality is that the documents analyzed are in French. The POS tagger is used to keep only the nouns and verbs in the graph of words.

MODEL_PATH = 'stanford-postagger-full-2014-08-27/models/french.tagger'
JAR_PATH = 'stanford-postagger-full-2014-08-27/stanford-postagger.jar'

reg_words = r"(?x)"
reg_words += "\d+(\.\d+)?\s*%" # percentages
reg_words += "| \w'"           # contractions d', l', ...
reg_words += "| \w+"           # plain words
reg_words += "| [^\w\s]"       # punctuations
pos_tagger = POSTagger(MODEL_PATH, JAR_PATH, encoding='utf8')
tag_selection = ['NC', 'V', 'VINF', 'VPP']
raw_stopword_list = stopwords.words('french')
stemmer = FrenchStemmer()

def txt_to_words(txt):
    tokens = RegexpTokenizer(reg_words).tokenize(txt.lower())
    tags = pos_tagger.tag(tokens)
    tag_filtered = [x[0] for x in tags if x[1] in tag_selection]
    stop_filtered = [x for x in tag_filtered if x not in raw_stopword_list]
    stemmed = [stemmer.stem(x) for x in stop_filtered]
    words = stop_filtered
    return words

Construction of the graph of words

Classic networkx is used as a library of graph analysis. We build an undirected graph (see Vazirgiannis for benchmarks of directed vs undirected graphs). The code below is simply the implementation of the following procedure that for a given word count build edges in the graph with the three (window - 1) words before:

Construction of the graph of words

Note that my words are stored in a list of documents of the type: [{'words' [...]}, {'words' [...]} ...].

G = nx.Graph() # may be replaced by DiGraph

for x in txt_and_words:
    G.add_nodes_from(x['words'])
    for i in range(window - 1, len(x['words'])):
        for j in range(1, window):
            G.add_edge(x['words'][i - j], x['words'][i])

G.remove_edges_from(G.selfloop_edges())

Graph mining

k-core extraction

The k-core of a graph is the maximum graph whose nodes have degree of at least k. In a graph of words where the degree of a node can be seen as a measure of the centrality of a word in the text, the k-core is a proxy for the words that are the main topics of the text. We use this approach to isolate the keywords of the corpus. We get the k-core where k is set as a quantile of the degree of the nodes.

k-core extraction

g_core = nx.k_core(G, k=np.percentile(nx.degree(G).values(), quantile))

Clustering Communities

The idea is to split the keywords of the corpus (the k-core edges) in natural clusters of keywords appearing statistically together in the corpus. The hope is that they highlight the main themes of customer statisfaction / complaint. To do this we use a graph communities clustering algorithm that calculates a partition of the graph nodes maximizing the modularity using the Louvain heuristices. The for each community, we extract the num_words highest degree nodes. These are words that ones that we plot in each wordle.

partition = community.best_partition(g_core.to_undirected())

Text retrieval

The idea is for each document and each keyword to compute a term frequency / inverse document frequency score. Then given a keyword or group of keywords, use this scoring function to retrieve the posts with the most meaningful contents. The only originality is that we use the twidf score following Vazirgiannis paper Graph and word-of-TW-IDF: New Approach to Ad Hoc IR.

b = 0.003
avdl = np.mean([len(x['words']) for x in txt_and_words])
N = len(txt_and_words)

def twidf(d, t):
    dft = np.sum([t in x['words'] for x in txt_and_words])
    d_len = len(txt_and_words[d]['words'])
    dG = nx.DiGraph() # my be replaced by Graph
    dG.add_nodes_from(txt_and_words[d]['words'])
    for i in range(window - 1, len(txt_and_words[d]['words'])):
        for j in range(1, window):
            dG.add_edge(txt_and_words[d]['words'][i - j], txt_and_words[d]['words'][i])
    twtd = dG.in_degree(t) if t in dG.nodes() else 0 # may be replaced by degree
    return twtd / (1 - b + b * d_len / avdl) * np.log((N + 1) / dft)