You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
François Scharffe 654e15821a Merge pull request #16 from afeierman/patch-1 5 years ago
.gitignore Initial commit 8 years ago
README.md Update README.md 5 years ago
clean-word2vec-text-format.py removing unused garbage collection 8 years ago
requirements.txt adding requirements.txt 7 years ago
word2vec-api.py Merge branch 'master' of github.com:3Top/word2vec-api 7 years ago

README.md

word2vec-api

Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec implementation. Models are passed as parameters and must be in the Word2Vec text or binary format.

  • Install Dependencies
pip2 install -r requirements.txt
  • Launching the service
python word2vec-api --model path/to/the/model [--host host --port 1234]

or

python word2vec-api.py --model /path/to/GoogleNews-vectors-negative300.bin --binary BINARY --path /word2vec --host 0.0.0.0 --port 5000
  • Example calls
curl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Sushi&ws1=Shop&ws2=Japanese&ws2=Restaurant
curl http://127.0.0.1:5000/word2vec/similarity?w1=Sushi&w2=Japanese
curl http://127.0.0.1:5000/word2vec/most_similar?positive=indian&positive=food[&negative=][&topn=]
curl http://127.0.0.1:5000/word2vec/model?word=restaurant
curl http://127.0.0.1:5000/word2vec/model_word_set

Note: The "model" method returns a base64 encoding of the vector. "model_word_set" returns a base64 encoded pickle of the model's vocabulary.

Where to get a pretrained model

In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request.

Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture Training Algorithm Context window - size Web page
Google News 300 Google News (100B) 3M Google word2vec negative sampling BoW - ~5 link
Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Wikipedia+Gigaword 5 50 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 100 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 200 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 300 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Common Crawl 42B 300 Common Crawl (42B) 1.9M GloVe GloVe GloVe AdaGrad link
Common Crawl 840B 300 Common Crawl (840B) 2.2M GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 25 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 50 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 100 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 200 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Wikipedia dependency 300 Wikipedia (?) 174,015 Levy & Goldberg word2vec modified word2vec syntactic dependencies link
DBPedia vectors (wiki2vec) 1000 Wikipedia (?) ? Idio word2vec word2vec, skip-gram BoW, 10 link