LLM Deploy on x86 Tutorial

The following examples demonstrates uploading a Llama v3 8B Instruct Quantized with llamacpp aka 4-bit Quantized Llama 3 Model on X86 model on x86 processors. This has the following settings parameters:

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today
The Hugging Face LLM file packaged as a Wallaroo BYOP framework in the file llama_byop_llama3_llamacpp.zip, with the framework set to wallaroo.framework.Framework.CUSTOM. This LLM leverages the llamacpp library.
The LLM model name in Wallaroo will be llama3-instruct-8b.
The input schema:
- text type String.
The output schema:
- generated_text type String.
The deployment configuration will allocate:
- 32 CPUs to the LLM with 40 Gi RAM

First we upload the model via the Wallaroo SDK.

import wallaroo

# connect to Wallaroo

wl = wallaroo.Client()

# set the input and output schemas
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])

# upload the model and save the model version to the variable `model`
llm_model = wl.upload_model('llama3-instruct-8b', 
    'llama_byop_llama3_llamacpp.zip',
    framework=wallaroo.framework.Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)
display(llm_model)


Name	llama3-instruct-8b
Version	a3d8e89c-f662-49bf-bd3e-0b192f70c8b6
File Name	llama_byop_llama3_llamacpp.zip
SHA	b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec
Status	ready
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5190
Architecture	x86
Acceleration	none
Updated At	2024-28-May 21:00:08

Once uploaded, we create our pipeline and add the LLM as a pipeline step.

llm_pipeline = wl.build_pipeline("llama-pipeline")
llm_pipeline.add_model_step(llm_model)

We build the deployment configuration with 32 CPUs and 40 Gi RAM allocated to the LLM. Only 0.5 CPU and 2 Gi RAM is allocated to the Wallaroo Native Runtime to minimize that runtime’s resources, since it has no models in this example. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.

# set the deployment config with the following:
# Wallaroo Native Runtime:  0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed:  32 CPUs and 40 Gi RAM
deployment_config = DeploymentConfigBuilder() \
    .cpus(0.5).memory('2Gi') \
    .sidekick_cpus(llm_model, 32) \
    .sidekick_memory(llm_model, '40Gi') \
    .build()

llm_pipeline.deploy(deployment_config)

Once deployed, we can check the LLMs deployment status via the wallaroo.pipeline.Pipeline.status() method.

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.124.6.17',
   'name': 'engine-77b97b577d-hh8pn',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-pipeline',
      'status': 'Running',
      'version': '57fce6fd-196c-4530-ae92-b95c923ee908'}]},
   'model_statuses': {'models': [{'name': 'llama3-instruct-8b',
      'sha': 'b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec',
      'status': 'Running',
      'version': 'a3d8e89c-f662-49bf-bd3e-0b192f70c8b6'}]}}],
 'engine_lbs': [{'ip': '10.124.6.16',
   'name': 'engine-lb-767f54549f-gdqqd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.124.6.19',
   'name': 'engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

With the LLM deployed, the LLM is ready to accept inference requests through the method wallaroo.pipeline.Pipeline.infer which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the results as the same.

data = pd.DataFrame({'text': ['Summarize what LinkedIn is']})
result = llm_pipeline(data)
result["out.generated_text"][0]

'LinkedIn is a social networking platform designed for professionals and businesses to connect, share information, and network. It allows users to create a profile showcasing their work experience, skills, education, and achievements. LinkedIn is often used for:\n\n1. Job searching: Employers can post job openings, and job seekers can search and apply for positions.\n2. Networking: Professionals can connect with colleagues, clients, and industry peers to build relationships and stay informed about industry news and trends.\n3. Personal branding: Users can showcase their skills, expertise, and achievements to establish themselves as thought leaders in their industry.\n4. Business development: Companies can use LinkedIn to promote their products or services, engage with customers, and build brand awareness.\n5. Learning and development: LinkedIn offers online courses, tutorials, and certifications to help professionals upskill and reskill.\n\nOverall, LinkedIn is a powerful tool for professionals to build their professional identity, expand their network, and advance their careers.'

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today