LLM Deploy on GPU Tutorial


The following example demonstrates uploading a Llama 3 8B Instruct model on NVIDIA GPUs processors. This has the following settings parameters:

For access to these sample models and a demonstration on using LLMs with Wallaroo:

The model has the following features has the following features.

  • The Hugging Face LLM file packaged as a Wallaroo Custom Model aka BYOP framework in the file llama_byop_llama3_instruct_8b.zip, with the framework set to wallaroo.framework.Framework.CUSTOM.
  • The LLM model name in Wallaroo will be llama3-instruct-8b.
  • The input schema:
    • text type String.
  • The output schema:
    • generated_text type String.
  • The deployment configuration will allocate to the LLM:
    • 2 CPUs
    • 40 Gi RAM
    • 1 GPU

First we upload the model via the Wallaroo SDK.

import wallaroo

# connect to Wallaroo

wl = wallaroo.Client()

# set the input and output schemas
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])

# upload the model and save the model version to the variable `model`
llm_model = wl.upload_model('llama3-instruct-8b', 
    'llama_byop_llama3_instruct_8b.zip',
    framework=wallaroo.framework.Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)
display(llm_model)
  
Namellama3-instruct-8b
Versiona3d8e89c-f662-49bf-bd3e-0b192f70c8b6
File Namellama_byop_llama3_instruct_8b_new.zip
SHAb92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5190
Architecturex86
Accelerationnone
Updated At2024-28-May 21:00:08

Once uploaded, we create our pipeline and add the LLM as a pipeline step.

llm_pipeline = wl.build_pipeline("llama-pipeline")
llm_pipeline.add_model_step(llm_model)

We build the deployment configuration with 2 CPUs, 1 GPU and 40 Gi RAM allocated to the LLM. Only 0.5 CPU and 2 Gi RAM is allocated to the Wallaroo Native Runtime to minimize that runtime’s resources, since it has no models in this example. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.

# set the deployment config with the following:
# Wallaroo Native Runtime:  0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed:  2 CPUs, 1 GPU, and 40 Gi RAM
deployment_config = DeploymentConfigBuilder() \
    .cpus(0.5).memory('2Gi') \
    .sidekick_cpus(llm_model, 2) \
    .sidekick_memory(llm_model, '40Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label(deployment_label) \
    .build()

llm_pipeline.deploy(deployment_config)

Once deployed, we can check the LLMs deployment status via the wallaroo.pipeline.Pipeline.status() method.

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.124.6.17',
   'name': 'engine-77b97b577d-hh8pn',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-pipeline',
      'status': 'Running',
      'version': '57fce6fd-196c-4530-ae92-b95c923ee908'}]},
   'model_statuses': {'models': [{'name': 'llama3-instruct-8b',
      'sha': 'b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec',
      'status': 'Running',
      'version': 'a3d8e89c-f662-49bf-bd3e-0b192f70c8b6'}]}}],
 'engine_lbs': [{'ip': '10.124.6.16',
   'name': 'engine-lb-767f54549f-gdqqd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.124.6.19',
   'name': 'engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

With the LLM deployed, the LLM is ready to accept inference requests through the method wallaroo.pipeline.Pipeline.infer which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the results as the same.

data = pd.DataFrame({'text': ['Summarize what LinkedIn is']})
result = llm_pipeline(data)
result["out.generated_text"][0]

'LinkedIn is a social networking platform designed for professionals and businesses to connect, share information, and network. It allows users to create a profile showcasing their work experience, skills, education, and achievements. LinkedIn is often used for:\n\n1. Job searching: Employers can post job openings, and job seekers can search and apply for positions.\n2. Networking: Professionals can connect with colleagues, clients, and industry peers to build relationships and stay informed about industry news and trends.\n3. Personal branding: Users can showcase their skills, expertise, and achievements to establish themselves as thought leaders in their industry.\n4. Business development: Companies can use LinkedIn to promote their products or services, engage with customers, and build brand awareness.\n5. Learning and development: LinkedIn offers online courses, tutorials, and certifications to help professionals upskill and reskill.\n\nOverall, LinkedIn is a powerful tool for professionals to build their professional identity, expand their network, and advance their careers.'

For access to these sample models and a demonstration on using LLMs with Wallaroo: