Vector Search Engineering Lab — Untitled1

Paulo Maia
5 min readNov 7, 2022

Introduction

This blogpost is a summary on our team’s work on the MLOps.Community hackathon, from Oct 24 — Nov 4, 2022 by Redis, Saturn Cloud, MLOps Community and NVIDIA Inception centered on Vector Search using the arXiv scholarly papers dataset.

Team

Our team is comprised of the following people:

With a mix of academic backgrounds, deployment, machine learning and full stack development, we felt we had a team to come up with something awesome! 🤩

Ideation Phase

Our team went through a couple of different ideas, some of which complemented each other:

  • Building a “Netflix”/”Goodreads” for academic papers, where the users would insert a list of papers they had read, looking for similar papers to go next.
  • Summarizing the most similar papers to simplify academic researcher’s lives
  • 3D mapping of the articles to give a more interactive search

Overall, we ended up with a mix of all these ideas… Keep reading!

Untitled1 — The product

From the user perspective, you can input a set of abstracts and their titles. A multilabel classification model automatically categorizes the set of categories present in those abstracts and suggests to fill the categories for the user in the User Interface.

The user can change this, and define only the paper categories they’re interested in and search for the papers that include all these categories.

After the search button is clicked, we search for the top-K nearest papers based on the “average” representation of all the papers input in the query, allowing for the discovery of further scientific papers.

You can play with the demo in https://untitled1-vector-search.community.saturnenterprise.io/.

Let’s now go through the “gears” that make this a fully functioning product!

Vector indexation

For vector indexation, we used sentence-transformers/all-distilroberta-v1 to the embedding vector. We then upserted all the embeddings onto the Redis database.

The data was pre-processed by removing unicode characters, punctuation, newlines, lowercasing and some spacing cleanup.

Training a multilabel model

We decided to train a multilabel classification model as each paper can have more than one category (total of 125 — a hard problem!). We fine-tuned a transformers model (bert-base-uncased) and tried it with and without the evaluation. From the outcomes, we can see that the preprocessing function — same as the prior step — managed to improve the model results.

From an analysis, we noted that the model sometimes confuses categories which are semantically similar — e.g. Classical Analysis and ODEs with Analysis of PDEs. With a few more training samples, and experimenting with different models, we could’ve perhaps improved this. These models were trained in the Jupyter notebooks in Saturn Cloud and the pickles were uploaded to a public GCS bucket. With such huge datasets and transformers models, the training time is very long (half a day!)

Front end

We decided to tweak the existing frontend of redis-arXiv-search. We allowed the user input multiple papers simultaneously (to search for “mean” papers), and filter by categories in different modes (either all or any categories match). Also, we made the text input field multiline for each paper, so it would be easier to work with inputted data. In addition, it is now possible to share current results by copying the address string.

Back end

For the backend, we enabled Redis search query to join categories with “AND” in addition to the “OR” operator. Thus, one can now define only the paper categories they’re interested in and search for the papers that include all these categories, not just one.

Besides this, we also enabled multi-input vector search, using HNSW with the cosine similarity. We decided that it might be useful for a scientist to know what are the papers “in between” the papers A, B and C to ease multi-disciplinary article discovery. For that, we extended our Web UI to accept multiple article abstracts as input (along with a single set of “years” and “categories” configuration). The server encodes each abstract as a vector and then finds the mean vector of these vectors. In case of two input papers, this is the middle of the line segment between two coordinates, in the case of three papers, this is a centroid of a triangle of three coordinates.

Thus, searching similar articles within the sphere around these points of interest can bring some interesting results for paper discovery for scientists, it can help find multi-disciplinary papers related to the given papers. For example, this will allow users to find papers on usage of ML in Biology by searching two articles about ML and about Biology.

Deployment

The application was developed using FastAPI + Uvicorn. We created a managed Redis database and connected using static credentials mounted to the server/jupyter via Saturn Cloud secrets.

We also adjusted the routes in the example to change our routes:

  • /predict-categories — for multilabel classification
  • /vectorsearch/text/user — adjusted to process multiple articles. The route also accepts optional filters by years and categories, which adjust the Redis query. For each article we calculate the average embedding and a single prediction is done (replacing one prediction per input).

Our learnings

During this project, we had the opportunity to work with several great technologies, and develop our skills in each of them:

  • With Saturn Cloud, we learned how to deploy Machine Learning solutions and run Machine Learning experiments in a remote Jupyter environment over SSH.
  • With Redis, we learned to use the vector search capability to speed up our searches at large scale!
  • Huggingface allowed us to build state of the art NLP models in a quick way, and iterate on the results.

Ideas for next steps

Unfortunately, we had tons of ideas we didn’t have time to fully develop, but that we leave in this blogpost for other people to iterate upon:

  • Implementing a hybrid search, using lexical and deep models, as suggested in this talk.
  • Fine-tuning a sentence transformer to have embeddings better represent our domain (scientific articles). For this, we hypothesized training a siamese network with a triplet loss. The network would learn that papers with more categories in common needed to have a lower embedding distance than papers with less categories in common. More on this.
  • Using a dimensionality reduction algorithm (PCA or t-SNE) to map the embedding space to a 3D space which could be visualized by the users, making the results more appealing

Overall, it was a great learning experience — thanks to the organizers, to Redis and to Saturn Cloud!

Fun fact to end this: we named our team “Untitled 1” — because it looks like we forgot (or were too lazy) to name a jupyter notebook!

--

--