The content of this post is written in a hurry. Please feel free to contact me at the bottom of page.

Principle

Some time ago, I was told the concept of RPA that stands for Robotic Process Automation. This is a new type of robot using artificial intelligence technologies making the robot learns by itself. This is not a programmer who tells him what to do in the form of lines of code instructions, but the robot itself by observing and imitating an operator. Very interesting technology because much flexible. An example is the car built by George Hotz. It uses the same principle to learn how to drive.

I myself am very interested in this kind technology and I wondered how I could have heard about it earlier. I then saw that the corresponding Wikipedia article was dating from last summer.

I then had the idea of using Wikipedia automatic parsing to list new articles on a topic that could potentially indicate: a new technology, a new ecosystem actor…

The first step of the project is to download two Wikipedia dumps extracts with a few months difference. Dumps can be found at this url: https://dumps.wikimedia.org/enwiki/.

The second step of this project is to parse both Wikipedia dumps and get the article titles speaking about a chosen theme, a sequence of key words. In the example below, deep learning.

The last step is to isolate the new article names, ie those not included in the oldest dump.

Example

Around the topic deep learning, between 2015-04-03 and 2015-12-01, the following articles emerged or starting speaking about deep learning:

Code

The parsing is very basic. Tex mining state of the art give much more sophisticated tools than a simple findall.

A classification of article by category: people / company / technology / … could be implemented. In this example we only remove Wikipedia special articles new_articles = [x for x in new_articles if ':' not in x].

import bz2
import numpy as np
import re
import xml.etree.cElementTree as ET


def process_buffer(buf, regexp):
    root = ET.fromstring(buf)
    text = root.find('revision').find('text').text
    if text and np.all([len(r.findall(text.lower())) for r in regexp]):
        return root.find('title').text
    return None

def list_articles(filename, query):
    articles = []
    regexp = [re.compile(q) for q in query]
    with bz2.BZ2File(filename, 'rb') as inputfile:
        append = False
        for line in inputfile:
            if '<page>' in line:
                inputbuffer = line
                append = True
            elif '</page>' in line:
                inputbuffer += line
                append = False
                article = process_buffer(inputbuffer, regexp)
                if article is not None:
                    articles.append(article)
                    print article
            elif append:
                inputbuffer += line
    return articles


query = ['deep learning']
filename_curr = 'enwiki-20151201-pages-meta-current.xml.bz2'
filename_prev = 'enwiki-20150403-pages-meta-current.xml.bz2'
articles_curr = list_articles(filename_curr, query)
articles_prev = list_articles(filename_prev, query)

def diff(a, b):
    b = set(b)
    return [aa for aa in a if aa not in b]

new_articles = diff(articles_curr, articles_prev)
new_articles = [x for x in new_articles if ':' not in x]
print new_articles