The content of this post is written in a hurry. Please feel free to contact me at the bottom of page.

The authorship of the idea of ‚Äč‚Äčthis post excellent belongs to Andrej Karpathy who proposed to use RNN as a generator of original startup names.

We make a recurrent neural network (RNN) read a list of startup names. Then proposing him a first letter (eg A) he will extrapolate what he has read in giving names that go through his head.

The output of this post is a list of original startup names.

Learning set

It turns out that I had a dump of [AngelList (http://angel.co/), ie a list of about 400k startups. I removed the names with special characters, bringing the list to 100k startups.

company_name = filter(lambda x:re.search(r'^[a-zA-Z]\*$', x), company_name)

Then I built a learning set by concatenating the names separated by a special character.

text = '|'.join(company_name)

Learning

The resulting text can then be sent character by character as the input of a RNN built with the framework [mxnet] (https://github.com/dmlc/mxnet/tree/master/example/rnn). After a few iterations of learning the generated names are recovered by the inverse operation:

start = 'A'
seq_len = 10000
X_input_batch = np.zeros((1,1), dtype="float32")
X_input_batch[0][0] = dic[start]
out = lstm.sample_lstm(sampler, X_input_batch, seq_len)

chars = [lookup_table[int(out[i][0])] for i in range(seq_len)]
rslt = (start + "".join(chars)).split('|')
print len(new_startup_names) # => 1104

We verify their originality.

company_name_low = [x.lower() for x in company_name]
original_startup_names = filter(lambda x: x != '' and x.lower() not in company_name_low, new_startup_names)
original_startup_names = np.unique(original_startup_names)
print len(original_startup_names) # => 1063

Availability of the domain name

We want to verify that the associated domain name is available.

original_domains = [x.lower() + '.com' for x in original_startup_names]

For this we scrap https://www.whoisxmlapi.com, with two small tips:

  • Use PhantomJS to retrieve pure HTML source code of the page (thus parsable) instead of Javascript code.
  • Launch requests through Tor to avoid having our IP being banished if by any chance the website control requests made by robots.
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver

def check_registred(domain):
    print domain
    driver.get(url % domain)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source)
    return np.any([span.text == 'expiresDate' for span in soup.findAll('span', attrs={'class': 'NodeName'})])

url = 'https://www.whoisxmlapi.com/domain-availability-api.php?domainName=%s&outputFormat=xml'
service_args = ['--proxy=127.0.0.1:9050',
                '--proxy-type=socks5']
driver = webdriver.PhantomJS(service_args=service_args)
registred_domains = [check_registred(d) for d in original_domains]
driver.quit()

df = pd.DataFrame({'name':original_startup_names, 'domain':original_domains, 'already_registred':registred_domains})
print df.ix[~df.already_registred, ['name', 'domain']].shape[0] # => 744

This gives a original list of startup names.