LLM Deploy on ARM Tutorial
The following examples demonstrates uploading a Llama v3 8B Instruct Quantized with llamacpp aka 4-bit Quantized Llama 3 Model on ARM model on ARM processors. This has the following settings parameters:
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
The model has the following features has the following features.
- The Hugging Face LLM file packaged as a Wallaroo Custom Model aka BYOP framework in the file
llama_byop_llama3_llamacpp.zip
, with the framework set towallaroo.framework.Framework.CUSTOM
. This LLM leverages the llamacpp library. - The LLM model name in Wallaroo will be
llama3-instruct-8b
. - The input schema:
text
type String.
- The output schema:
generated_text
type String.
- The deployment configuration will allocate:
- 32 ARM CPUs to the LLM with 40 Gi RAM
First we upload the model via the Wallaroo SDK.
import wallaroo
# connect to Wallaroo
wl = wallaroo.Client()
# set the input and output schemas
input_schema = pa.schema([
pa.field("text", pa.string())
])
output_schema = pa.schema([
pa.field("generated_text", pa.string())
])
# upload the model and save the model version to the variable `model`
llm_model = wl.upload_model('llama3-instruct-8b',
'llama_byop_llama3_llamacpp.zip',
framework=wallaroo.framework.Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
arch = wallaroo.engine_config.Architecture.ARM
)
display(llm_model)
Name | llama3-instruct-8b |
Version | a3d8e89c-f662-49bf-bd3e-0b192f70c8b6 |
File Name | llama_byop_llama3_llamacpp.zip |
SHA | b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5190 |
Architecture | arm |
Acceleration | none |
Updated At | 2024-28-May 21:00:08 |
Once uploaded, we create our pipeline and add the LLM as a pipeline step.
llm_pipeline = wl.build_pipeline("llama-pipeline")
llm_pipeline.add_model_step(llm_model)
We build the deployment configuration with 32 CPUs and 40 Gi RAM allocated to the LLM. Only 0.5 CPU and 2 Gi RAM is allocated to the Wallaroo Native Runtime to minimize that runtime’s resources, since it has no models in this example. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.
The deployment configuration inherits the LLMs architecture setting, so this deployment will automatically use nodes from the nodepool with ARM processors.
# set the deployment config with the following:
# Wallaroo Native Runtime: 0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed: 32 CPUs and 40 Gi RAM
deployment_config = DeploymentConfigBuilder() \
.cpus(0.5).memory('2Gi') \
.sidekick_cpus(llm_model, 32) \
.sidekick_memory(llm_model, '40Gi') \
.build()
llm_pipeline.deploy(deployment_config)
Once deployed, we can check the LLMs deployment status via the wallaroo.pipeline.Pipeline.status()
method.
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.124.6.17',
'name': 'engine-77b97b577d-hh8pn',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llama-pipeline',
'status': 'Running',
'version': '57fce6fd-196c-4530-ae92-b95c923ee908'}]},
'model_statuses': {'models': [{'name': 'llama3-instruct-8b',
'sha': 'b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec',
'status': 'Running',
'version': 'a3d8e89c-f662-49bf-bd3e-0b192f70c8b6'}]}}],
'engine_lbs': [{'ip': '10.124.6.16',
'name': 'engine-lb-767f54549f-gdqqd',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.124.6.19',
'name': 'engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
With the LLM deployed, the LLM is ready to accept inference requests through the method wallaroo.pipeline.Pipeline.infer
which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the results as the same.
data = pd.DataFrame({'text': ['Summarize what LinkedIn is']})
result = llm_pipeline(data)
result["out.generated_text"][0]
'LinkedIn is a social networking platform designed for professionals and businesses to connect, share information, and network. It allows users to create a profile showcasing their work experience, skills, education, and achievements. LinkedIn is often used for:\n\n1. Job searching: Employers can post job openings, and job seekers can search and apply for positions.\n2. Networking: Professionals can connect with colleagues, clients, and industry peers to build relationships and stay informed about industry news and trends.\n3. Personal branding: Users can showcase their skills, expertise, and achievements to establish themselves as thought leaders in their industry.\n4. Business development: Companies can use LinkedIn to promote their products or services, engage with customers, and build brand awareness.\n5. Learning and development: LinkedIn offers online courses, tutorials, and certifications to help professionals upskill and reskill.\n\nOverall, LinkedIn is a powerful tool for professionals to build their professional identity, expand their network, and advance their careers.'
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today