Skip to main content

Text Processing


The TextProcessing Gem enables text data preparation for machine learning in two different ways:

  1. Load web URLs and extract text.
  2. Split text data into equal chunks.

Follow along to see how to use the TextProcessing Gem. For an example set of Pipelines that uses this Gem to create a Generative AI Chatbot, see this guide.

1. Load web URLs and Extract Text

Given a column with web URLs, the Load web URLs operation will scrape the content from each URL, and output the content as a binary format or as a human readable text format, depending on the operation type selected. The figure below shows the Load web URL and Extract Text operation.

Overview web scrape and extract text

1a. Configure web scrape

Configure to web scrape

Configure the (1) Operation Type to Load url (web scrape), and optionally extract the text. Specify which input (2) Column name contains the web urls. If the extract text operation is selected, the text will be converted from binary to human readable format. When would you want to use the binary format? Binary web scraping is useful for downloading content including images or archived documents.

1b. Input

ParameterDescriptionRequired
Column name (string with urls)string - the input column which contains the strings of web URLsTrue

1c. Output

ParameterDescription
Result content Load url (web scrape)binary - the contents of each web page
Result content Load url (web scrape) and extract textstring - the contents of each web page, converted from binary to human readable text

1d. Generated Code

def scrape_pages(spark: SparkSession, in0: DataFrame) -> DataFrame:
from pyspark.sql.functions import expr, array, struct
from spark_ai.webapps import WebUtils
WebUtils().register_udfs(spark)

return in0.withColumn("result_content", expr(f"web_scrape(loc)"))

2. Split text data into equal chunks

Sometimes you'd like to send text data to a foundational model or store in a vector database, but the text is too long. For this case, just split the text into "chunks" of characters.

Overview Chunkify

2a. Configure text splitting

Given a text input, the Split data operation will separate the input column entries into chunks of specified size.

Configure to Chunkify

Select the (1) Operation type to split text into equal chunks. Specify which input (2) Column name contains the relevant content. Specify an integer chunk (3) Size relevant for your generative AI use case.

2b. Input

ParameterDescriptionRequired
Column namestring - the text content which should be split into equal chunksTrue
Sizeinteger - the size of each chunk, number of characters. Example: 1000True

2c. Output

ParameterDescription
result_chunksarray(string) - an array of text strings, each string representing one chunk of the larger text content

2d. Generated code

def Chunkify(spark: SparkSession, web_bronze_content: DataFrame) -> DataFrame:
from pyspark.sql.functions import expr, array, struct
from spark_ai.files.text import FileTextUtils
FileTextUtils().register_udfs(spark)

return web_bronze_content.withColumn("result_chunks", expr(f"text_split_into_chunks(content, 1000)"))

FAQ

How does this Gem fit into the bigger picture of building a generative AI application?

For an example set of Pipelines that uses this Gem to create a Generative AI Chatbot, see this guide. Feel free to reach out and explore your use case with us.

Troubleshooting

Select a chunk size according to the limitations of your vector database index.