The database for data scientists working with HuggingFace

As a data scientist or NLP professional, managing large datasets and corpora can be a hassle. If you’ve ever struggled with loading large datasets, performing searches through your data, or simply exploring your data efficiently, NucliaDB is the tool for you.

Why a DB for data scientists and NLP professionals?

As a data scientist or NLP person your hard-drive is probably full of datasets and corpora. So, if you have found yourself crashing a notebook trying to load something too big with Pandas, doing way too many shuffles in your shell just to explore your data a bit, or just not really knowing how to perform a better search through your dataset, this is a tool for you.

What you can do with NucliaDB

▪️  Compare the vectors from different models in a super easy way.

▪️  Store text, files and vectorslabels and annotations.

▪️Access and modify your resources efficiently.

▪️  Annotate your resources.

▪️ Perform text searches, given a word or set of words, return resources in our database that contain them.

▪️ Perform semantic searches with vectors, that is, given a set of vectors return the closest matches in our database. In an NLP use case, this allows us to look for similar sentences without being constrained by exact keywords

▪️ Export your data in a format compatible with most NLP pipelines (HuggingFace datasetspytorch, etc)

Get started!

Getting started with NucliaDB is easy. You can install it locally using docker or pip, and once it’s up and running, you can start using it by installing the nucliadb-dataset and nucliadb-sdk libraries. 

1. Install NucliaDB and run it locally

				
					pip install nucliadb
nucliadb
				
			

2. Create your first KnowledgeBox

A KnowledgeBox is a data container in NucliaDB.

To help us interact with NucliaDB, let’s install our sdk first:
				
					pip install nucliadb_sdk
				
			
Then, with just a few lines of code, we can start filling NucliaDB with data:
				
					from nucliadb_sdk import create_knowledge_box

my_kb = create_knowledge_box("my_new_kb")
				
			

3. Upload data

To help us to upload data, we’re going to also use the `sentence_transformers` python package:
				
					pip install sentence_transformers
				
			

Now, we can use it to insert some vectors:

				
					from nucliadb_sdk import KnowledgeBox, File, Entity
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
       key="mykey1",
       binary=File(data=b"asd", filename="data"),
       text="I'm Sierra, a very happy dog",
       labels=["emotion/positive"],
       entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
       vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
   )
my_kb["mykey1"]
				
			
Let’s insert more data to improve our search index:
				
					sentences = ["She's having a terrible day","what a delighful day","Dog in catalan is gos","he is heartbroken","He said that the race is quite tough","love is tough"]
labels = ["emotion/negative","emotion/positive","emotion/neutral","emotion/negative","emotion/neutral","emotion/negative"]
for i in range(len(sentences)):
    resource_id = my_kb.upload(
        text=sentences[i],
        labels=[labels[i]],
        vectors={"all-MiniLM-L6-v2": encoder.encode([sentences[i]])[0]},
    )
				
			

4. Search

4.1.  Semantic search

				
					from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

query_vectors = encoder.encode(["To be in love"])[0]

results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)
				
			

Iterate over the results: 

				
					for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")
...
				
			

Results: 

				
					Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE


				
			

4.2. Full text search

				
					results = my_kb.search(text="dog")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
				
			

Results: 

				
					Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']


				
			

4.3. Search by label

				
					results = my_kb.search(filter=["emotion/positive"])
for result in results:
    print(f"Resource key: {result.key}")
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
				
			

Results:

				
					Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']
				
			

Related resources and tutorials

NucliaDB is an open source, multi-model cloud native database, storing vectors, text and entities, providing high reading performance, making it ideal for applications that require fast and efficient access to large amounts of data. 

This article is to help you get started with NucliaDB, so what we are going to do is walk through step by step the installation and main functionalities.

This article is to help you get started with NucliaDB, so what we are going to do is walk through step by step the installation and main functionalities.