Nitish Maheshwari

Nitish Maheshwari
Niitsh

Monday, March 22, 2010

Performance Testing Best Practices !!!!

Performance Testing Best Practices

By

Nitish Maheshwari

After working in a virtualized performance test bed for a certain project, I observed that few professionals have conspicuously used incorrect testing process in this context and I would like to jot my words of wisdom out here on this topic.

When I discuss about virtualization of performance test bed with performance engineers, the common reaction is that they think “It sounds cool.” I wouldn’t disagree with that. Never. Running a virtual guest operating system on a host operating system obviously sounds cool. You would be capitalizing on those ‘idle’ resources by pooling resources and making them shared. But, what engineers forget is that virtualization works best only in certain type of workloads. To be specific — it works best with diverse workloads. However, in a load testing bed, the load injector machine or load generator machines during execution always tend to work with a fixed workload with resource utilization almost reaching the threshold level while emulating higher load levels.

So, when you plan to create a test environment with load generators running on a virtualized platform, the first thing which you ought to know is that VMware or any virtualization software for instance has too many overheads like excessive context switching, too many interrupts, shared network resources, high IO activity etc. All these factors sum up to become an “Irrepressible effect” on your virtualized LoadGenerator machines. The whole visualization paradigm causes processes to generally run slower than when on a physical machine. Hence, it causes a Vusers to run slower than usual. In fact because of the aforementioned points the integrity of your test results will be questionable.
Having said that, I feel, one should have load generators on virtual machines only if it is inevitable for them (Like, due to a stubborn client maybe!).

If you are moving ahead accepting the digressions from best practices, then, it is recommended to be wary of the limitation, document them and then act upon it carefully in order minimize the shortcomings of the approach.

Best practices:

Create multiple virtual LoadGenerator machine instances.

Having multiple virtualized LoadGenerators on a host machine is better than having a single virtual LoadGenerator machine because the hypervisor of a virtual platform scheduler works better when there is a diverse workload than when there is a homogeneous workload. Remember that virtualization software like VMware – be it whether it is a ‘Type-1’ or ‘Type-2’ hypervisor — it is designed to support as many virtual instances as possible on a single physical machine. A simple illustration: Load generator working on single virtual machine emulating 100 Vusers is likely to use more CPU resource than when the same 100 Vusers are split across into 2 virtual machines – 2 load generators.

Have an eye on the CPU utilization.
Ensure that all your virtualized load injector boxes are running at an optimal CPU utilization watermark (say, within 80%), this can avoid annoying issues in recorded response time like those negative response time for transactions which is an infamous problem in this context! This issue triggers because the VMware guest OS clocks are synchronized with host operating system time (which intern depends on your physical clock) and if the CPU is 100% busy while VMware guest operating system is attempting to sync the clock, then the process gets placed in the run queue of the processor causing a clock drift in guest operating system which intern messes up your transaction response time recorded by LoadRunner.
Tap those Load Generator machine resources metrics while executing tests and make sure none of them are starving for resources.

Never forget to add think time and pacing.
Also make sure that you strictly follow your business requirement than simply firing requests which otherwise, would be more like doing a stress test. Incorporate realistic think time in your scripts with appropriate pacing values. Since, CPU is a shared resource in a virtualized environment the above suggestion ensures that no virtual OS instance is deprived of CPU resource at any point in time unnecessarily.

Network Performance Testing

Most often it is not necessary to go about doing something specifically such as a network test per se. I would recommend engineers to perform their routine test cases on the AUT such as load test, stress test, endurance test etc. with client specified workload, but during the course of these tests pay attention to the network monitoring part of the test endeavor. There are certain servers monitoring tools specifically meant for monitoring network resources(netstat and network protocol analyzers(wireshark)) which I would recommend you to use other than any load testing tools.

On the other hand if the client has a scenario where there is a massive batch run or backup software causing huge amount of data moving across the network for long hours then I would recommend you to go about performing network tests specifically to emulate such scenarios.

For a network test you can have a Goal Oriented scenario setup with a certain Throughput as the target. Monitor the throughput and time to first buffer graph to find network related issues along with netstat and network protocol analyzers.

For a LoadRunner user the important counters which can be used are Total Bytes counter/sec, server bytes/sec, connections established counter; but note that it is even more important to understand the usage of these counters in identifying bottlenecks. Go through the documentation and get a clear understanding of these counters up front. On a high level be it LoadRunner or IxChariot or say any tool that you use to monitor and collect server metrics, the most important metric you have to make sure you have are:
1. Data Volume: Amount of data sent across then network.
2. Throughput: The speed at which data is sent through the network.
3. Data error rate: Large number of network errors that require retransmission of data will slow down throughput and degrade application performance.

When you’re trying to break down a performance bottleneck the first step you will have to make is to write down a matrix with traffic type against layer/tier. This will help you isolate the network which is causing the problem and henceforth to tune it without having to exercise dart-throwing as you struggle to understand your web site bottlenecks.

Example:
Client-Server communication:
Traffic:
-User HTTP requests
-Server HTML responses
- HTML page elements, such as gifs, jpegs, flash objects
etc..

Server to server communications (Middle tier)
Traffic:
-HTTP session data sharing within a cluster
-Application database transfers
-Traffic to services node (web services)
-Traffic to mail or messaging services
DNS traffic. etc…

Backend communications
Traffic:
-Databases transfers
-Database to application traffic
Etc.

During the analysis process, isolate the network portion which has the problem from the above layers and then find a way to troubleshoot the issue by correlating with the type of traffic which is observed in that layer.

Commonly observed network performance bottlenecks:
1) Faulty network component causing packet storms in the network.
2) Improperly configured NIC cards: Especially a node involving Multiple NICs can have a problem such as improper binding of NIC cards causing few NICs to be over utilized and others underutilized.
3) Improperly configured Load Balancer; Example: Affinity routing aka IP routing the requests to servers when the requests are coming from behind a proxy.
4) Insufficient bandwidth (This problem can be easily detected from the throughput graph when it becomes flat.)
5) Firewall component can be a major performance bottleneck: I would recommend engineers to first test the application system without firewall so that at least one variable is avoided.
6) Duplex mismatch: This problem occurs when one of the two communicating element is operating in full duplex whereas the other is operating in a half duplex mode. Unlike full duplex communicating element the half duplex element can either send or receive data packets but cannot do both together hence causing slowness in overall transfer speed due to heavy packet loss.
7) Using excessively chatty protocols: Too many handshake signals are definitely an overhead not just in the application front but also in the network resource front. A protocol analyzer can be very handy to detect such issues.

Myths and wrong practices related to analysis of CPU Utilization?


When it comes to this analyzing and interpreting CPU metrics, there are many myths, wrong practices and misconceptions which prevail. In this post I would be touching upon a few critical issues and will try explaining the facts and best practices in detail.

A wrong practice
Fetching system wide CPU utilization of server node which is a symmetric multiprocessing system (SMP: A node having multiple cores/CPU’s sharing same memory, bus and IO)

For most of the performance engineers or testers the common practice is to just fetch a system wide CPU utilization of a multi-core server node during load test and then start analyzing the result. Capturing a system wide CPU using tools like LoadRunner on your UNIX/LINUX server nodes is highly misleading and could result in wrong interpretation of the cause of the bottleneck and henceforth deriving wrong solutions and futile efforts in an attempt to fix it.

Different software’s such as a database or middleware have their own proprietary algorithms to deal with multi-core and multi-CPU’s and when you observe a system wide CPU utilization to be under the threshold limit during a load test it does not always mean that each individual core is equally loaded. Let me illustrate this: if you observe that a certain load testing tool is showing CPU utilization of 25% and if this particular server node has a quad core processor then it does not mean that all the cores are equally loaded with processes/threads. A single core might be utilized up to 100% and the rest might be left with a 100% idle task.

The possibility of loading each individual core equally or proportionately depends a lot on the underlying parallelism of the application and also on whether the software under test has the capability to perform intra query parallelization between cores available in the SMP; example: Oracle supports intra query parallelization (i.e. splitting the work of one single query/independent query to two or more cores available in the SMP-Node) whereas MySQL database does not support this feature as a result independent query will be processed using only one core even if the utilization of that core/CPU is exceeding the threshold)

Another example: You might have observed while performing load tests that few transactions associated with slow queries mysteriously fails at the database node even though the CPU utilization is within threshold. Now, if you dig deeper into those slow queries and correlate it with CPU utilization of each individual core then you might be able to see that few cores are maxing out to 100% while the rest are well below the threshold which is the root cause of the problem

Tools that I recommend to measure individual core/CPU’s utilization:

Prstat

(option –m or -mL): Available in Solaris operating system allows you to measure utilization of CPU on per-thread basis.

mpstat:

Available for linux/unix based OS (Note: Mpstat comes as a part of package of tools in Sysstat.)

System Monitor:

GUI oriented tool readily available in Red Hat Linux Operating systems.

A misconception

Getting misled by 100% CPU utilization shown by the monitoring tool.
You may be scratching and probably even banging your head against the monitor for not being able to find a feasible solution to fix the high CPU utilization problem observed on some of the server nodes. Your nodes might be experiencing 100% utilization even with an unimaginably less load.
Well, the good news is that 100% utilization doesn’t always mean that CPU is being a performance bottleneck. Especially in case of a UNIX or Linux operation system until and unless you see the ‘r’ value(process queue) in the vmstat output exceeding the total number of CPU count in the SMP-server node i.e. if r=5 but your SMP has only 4 CPU’s or cores then there is a bottleneck for sure.
The whole idea behind queuing is that if the CPU or CPUs are not busy when a thread is put into the run/processor queue(r), it is immediately executed by a CPU. But if all of the available CPUs are busy executing threads, then the incoming threads will have to wait in the run queue/processor queue until there is a CPU available to process the waiting threads.

That being said next time you see an alarmingly high value on the CPU metrics ask yourself this question: “What is CPU utilization?”
Answer: CPU utilization = 100% – (% of time spent in idle task)
If have understood the above equation you will never again misinterpret a CPU utilization metric.
When it comes to comprehending the severity of a CPU related resource crunch many engineers forget that the fundamental purpose of process dispatchers or schedulers of an operating system is to make sure the CPU utilization is always high in the time of need. It doesn’t always imply that high CPU utilization slows down the transaction rate, for there are CPU’s available in the market such as the ‘IBM System z’ which is capable to work at 100% busy CPU state by exploiting the presence of diverse workloads. The type of workload which arrives at CPU queue greatly decides severity of high CPU utilization scenario. For instances, short lived, high priority processes triggered by 100s of concurrent users can affect the responsiveness of the application greater than a scenario where there are heterogeneous processes with different priorities arriving at the processor queue. A CPU operates at a certain clock speed and processes a unit of work at certain speed (Frequency); Whether the CPU is 10% utilized or 100% utilized the processor ideally should deliver the unit of work at the same speed, but since all processes share the same physical resources such as CPU buses, caches and CPs the CPU time per transaction increases as utilization of CPU becomes high. Speed of thread/process execution gets affected only when the total number of process/threads waiting for CPU exceeds the total number of processors available in the SMP/server node. Again, do not forget to watch the “r (run queue)” value closely next time during load testing/performance testing.

Is training the only solution to handle poor technical competency of a tester

Let us try looking at this issue from another perspective. If it were that training was a solitary solution to fixing this issue, then the industry would have never faced a problem called “deficit of skill/incompetency” which unarguably persists across not just a specific country but every geographical area in the world. I have seen

situations where in spite of giving massive training programs to professionals before being deployed to projects, resources still have shown severe lack aptitude and skills when put to work.

According to me, any technical field is very capricious per se and new technologies crop up every day; in such a juncture, I feel, it is every professionals responsibility to align his/her skills to the evolving technologies on their own by the method of self-learning. Today, loadrunner statistically has a market share of just over 63% with other emerging small players like Compuware, Boarland, Radview etc. sharing the rest of the market and it wouldn’t be a surprise to me if in another 5 years we see loadrunner market share to have dwindled to less than 50%, so does this mean that all the professionals need to be trained officially by their employers on every competing tool which enters the industry? I work in a company where different clients prefer different tools, ranging from completely open sources tools like Grinder, Jmeter to enterprise level tool sets from HP, IBM etc., now if every resource were to be trained on all these tools before being deployed to projects then the employer would never ever get his ROI; not to forget that these days ‘Training Programs’ are priced exorbitantly high.

The solution is in the hands of every professional; if a professional has the right kind of approach in problem solving and a good interest level in the subject then he/she would learn any tool/technology without any hassle.

I recently registered to Google and Yahoo LoadRunner groups and saw a disturbing habit of several members in the group of posting basic questions of LoadRunner like “What is LoadRunner?”. This is ridiculous! If people are capable to register in Google/Yahoo groups with an intention to learn loadrunner from the members of it then I think they are definitely also capable to register in http://itrc.hp.com portal which is a gateway to acquire tons of Loadrunner resources, and these resources are undoubtedly the best anyone can ever get.If recruiters pay a little more attention to a candidate’s interest level towards the job profile, his/her true problem solving capability, smartness and general attitude rather than just looking at the “University” he/she graduated from and his/her GPA — then I would say they have hired a candidate who is already 90% qualified to do the job and the rest 10% would automatically come up as he becomes more experienced. “Training Programs” I feel, contributes more in refining the capabilities of a resource than in shaping one.

Performance Target: Concurrent users(active) vs. Connected users(passive)?

In my observation, persistently there has been a lack of awareness of the difference between “Simultaneously connected active users” and “Simultaneously connected passive users”. Commonly this problem is observed both from executive stakeholder side as well as the performance engineering/testing team side.

Impact of not understanding these concepts makes a lot of difference when it comes to meeting the actual performance test requirement and can make a huge difference on the legitimacy of your load test report.

For your information: Simultaneously connected active users can be also termed as “active concurrent application users” and simultaneously connected passive users can be termed as “connected users in passive mode”.

Connected users in passive mode:
I will quickly explain the differences and the nuances with an example: Imagine if “Facebook” was your AUT (Application Under Test); Many end users of Facebook.com often set www.facebook.com as their default browser page and hence when the browser is opened — a connection (may not be persistent) would be maintained between the end user and the Facebook.com server. The user may leave the Facebook.com session idle without firing any query like searching for a friend or adding a friend. This case is a typical example of connected user who is in a passive state.
Facebook.com might have 100’s of thousands of such users who are in this state and it is important to simulate this condition during load tests because even passive sessions or connections can hold certain resources of the server as well as can have impact on the overall thread pool size and can also affect various tunable configuration parameters of web/app/db servers.

Active concurrent application users:
Continuing with the same Facebook.com example, it is known that there are also users who would not just be connected to Facebook.com but would also be actively using facebook.com features like adding a friend, poking a friend, updating status, uploading photos etc. It is during this kind of user connection state that the actual payload gets affected heavily.Higher the rate of user activity higher would be the transactional throughput.
Care should be taken while setting a test execution window, because clients might say facebook .com has 200,000 concurrent active connections in a day with an average user base executing 9,600,000 transactions daily (this means 10 million transactions such as adding friend, poking friend, photo upload etc.), but when you break the transactions to fit your test window – you often forget to think about the persistence duration of users per login-logout sessions against your test window.
What I meant to say, is that, if you are executing a 60 minutes load test and for scenario configuration purpose if you’re breaking the daily stats to fit your 60 minutes test window by dividing total users per day by 24 (approx. 8300 virtual users). And, also you would have calculated total transactions fired per hour, which is equal to 2 (derived the result after dividing 9.6 million by 200,000 = 48 transactions/user/day; 48/24 = 2 transactions/hour), but ignored the persistence of transactions; you may face a situation where all virtual users executing those 2 transactions (like: adding a friend and poking a friend) would finish execution within seconds leaving the rest of the test window void without any transactions, which will surely cause over-stressing of the server leading to inaccurate result. In the other extreme case without thoughtfully setting virtual user transaction persistence can cause a situation, specifically with ‘login – Perform a transaction – logout’ scenario type, where one user finishes a cycle quickly before the other enters; this would actually lead to lesser active user concurrency and resulting in inaccurate result.
The solution to maintain realistic transaction persistence is by making virtual users to stay back in the connected session with proper “pacing” between transactions and “realistic think time” , so that transactions are optimally(try following actual/live pattern) spaced within the test execution duration/window.
Emulating realistic concurrency of user at application usage level is the key to understand the true application behavior during load test.


Many people say that process oriented work is the key for success. Even I strongly support that statement but I always wonder why people do not give prominence to those 7 basic quintessential points of performance test activity which can truly make or break a project.

When a resource is allocated to a performance test project the first thing the project manager would ask him/her to do is to start working on preparation of a test plan. But during test plan preparation phase, it is very often seen that test managers and his/her subordinates would heedlessly prepare the document with most of the contents of the document copied from previous projects or from arbitrary template available in the internet and proceed to the next phase without giving any prominence to those requirements/specifications mentioned in the test plan. It is undeniably a known fact that preparation of a test plan is always done as a formality obligated to the company’s process standards (like ISO tick it) and as a consequence this document is never really used in conjunction with project execution.

Time to be invested for a test plan preparation I would say should be a significant part of the total project execution time. But all these facts are unfortunately talked more in theoretical sense and less in professional sense and hence practitioners often fail to strictly bind their activities as per test plan intimidated by the large overhead of process associated with it and also believe that it would slow them down.

This undoubtedly is a bad practice, but even if you negate the project planning phase – I feel we can still expect light in the end of the tunnel if the test engineers incorporates those 7 basic quintessential facets of performance test activity religiously during project execution phase.

The 7 tenets are:
1. Know what the SLA states.
2. Understand the real user usage patterns.
3. Know how to load the server.
4. Know how much to the load server.
5. Know what type of load needs to be induced on the AUT.
6. Know your test tool well to maximize on its capabilities.
7. Know your test environment.

The best part of these tenets is that you just have to make sure they are a part of your consciences in your professional frontier. You need not document these things neither would you have to present it to the client; you just have to practice it. Because in the end, what the client expects is not how good or bad was your load test plan nor would the client be bothered of those different standards that you followed, instead the client would always be more bothered about the degree of accuracy of the result that you have presented to them and how much of help would these statistics be to improve the performance of the application. And trust me knowing each of aspects mentioned in the tenets and acting upon it carefully can definitely help achieve positive outcome of the load test project.
Conclusion: On failure to comply by the standard processes of performance testing, at least incorporate the 7 tenets of performance testing which could help you to effectively compensate for those digressions from best practices.

How to efficiently utilize the resources of loadgenerators?

“How do I calibrate the total number of Loadgenerators required to generate a load of 300 Virtual user if I have each loadgenerators equiped with 3Gb of RAM and 3.3GHz of cpu clock speed?”

My answer:

This is a highly debatable topic I must say. It is very subjective. Memory foot print which HP as a vendor provides is not very accurate in a general sense. The impact of Memory and CPU footprint of any protocol on a Loadgenerator also depends greatly on the length of the business flow and on the workload profile which the client expects the system to achieve. Example: Hypothetically, if in a SAP GUI application if you have 100 virtual users just entering employee name and number in a field and submitting it to the server and are iterating it only 2 times in a half hour period (which if is your test duration) then I would say you may not need more than 1 load generator which is sized with 3Gb of RAM and 3.3 Ghz dual core Intel processor.

Adding to the above, HP does not provide any kind of CPU footprint report. Hence I generally go about telling people to calculate the heaviness/payload of loadrunner process on their own. This can be done by identifying metrics of resource consumption by 1 user and theoretically determining total resource needed by multiplying the observed consumption for single virtual user with the expected number of concurrent users.

Using Windows Task manager(Process tab) while executing a single Vu will help you determine the memory consumed by a single ‘MDRV’ process which if multiplied with expected concurrent virtual user count will help you in calibrating total memory required to run a load test. Similarly to estimate the CPU resource for a certain protocol, I would recommend you to monitor and record few CPU counters of your LoadGenerator through ‘perfmon’ while deriving baseline result and multiply the metrics with the target Vu count which is mentioned in the workload profile.

It is ideally recommended to have as many boxes as possible to match the theoretical requirement, but as we all know, one of the many challenges a performance tester faces is lack of sufficient loadgenerator machines. Hence for the benefit of time and to avoid further pestering from your boss to somehow resolve the issue – try utilizing your loadgenerator boxes as efficiently as possible by:

1. Log only when error occurs.
2. Precisely calculate think-time between transactions. This can help
in exerting less load on the load generators. Strictly follow real user screen pause patterns.(note this point as
very important!).
3. Uninstall any daemon applications running on Loadgenerators
which is not a part of AUT.
4. Avoid excessively using the “Show Vuser” option during test
execution.
5. Declaration of variables with large size should be avoided, which
directly helps in conserving overall memory. Consult the development
team if your unable to estimate the right size(This is a major concern
in Java Vuser and VB Vuser scripts).
6. Never over-iterate a script.

Performance Testing with WANem (Network Emulation Tool).

Why use WAN Emulation using WANem?

WANem is free software. Developed by TCS, published by Free Software Foundation. It can modify or redistributed on terms of GNU General Public License version 2.

Rigorous performance testing and optimization is a critical factor in the successful delivery of any business application. Yet frequently the performance of deployed applications doesn’t live up to business requirements or end-user expectations. One reason behind these unpleasant “surprises” is the fact that most performance staging labs only test the application with local users (in a local area network (LAN) environment), while the fully deployed application is used by a variety of end-users, some local and others accessing the application remotely over different network links. The different network conditions that exist between end-users and application servers have a tremendous effect on the overall performance that remote end-users experience. This deviation from performance test results obtained in the lab is further exacerbated for N-Tier applications where each tier may reside in a different geographical location with its own unique set of network conditions.

Network Emulation tools can be used to accurately replicate existing or projected conditions in the distributed production environment – including infrastructure, application traffic and the distribution of end-users.

To sum up on a high level — the benefits of using WAN Emulation tools are:
- WANem is a freeware.
- mitigate applications deployment risk
- find errors before deployment
- test new WAN topologies and technologies
- emulate remote users experience
- stress models of the network to find vulnerabilities

2.1 System Requirements:
Minimum an i386 based PC with 1 CPU, 512 MB RAM and 1 Network interface card – 100 Mbps (preferably 1 Gbps).

2.1 Applications supported by the WANem(but not limited to)
1. Web applications,
2. Video Streaming
3. Interactive applications – telnet like application.

2.2 Setting up WANem
WANem is distributed in the form of a bootable CD with Linux Knoppix O/S. This CD comes with WANem preinstalled. There are no installation steps. When an i386 architecture based PC is booted with the PC WANem is ready for use.

*Refer WANem user guide for more information on launching WANem and usage.

Drawbacks of WANem:

1. Multiple NIC cards when there is need for performing distributed load generation.
2. Dedicated PC required for WANem setup.
3. Network address translation needs to be done when a client application is running on a different network.
4. Cannot be integrated with LoadRunner.
5. Installation is relatively difficult.

Conclusion

WANem is free software and hence cost effective. Advanced network options present in WANem keeps trust and usability of the tool on the same level as of any other contemporary network emulation tool.

How to learn those rarely used protocols of LoadRunner?

Recently a newbie to Loadrunner asked me the following question:

“I have seen you somewhere in the loadrunner groups. I like to know if you
are aware of some sites where we can have a look at with different protocols
that uses.

For example, I like to see that the applications that uses different
protocol.

Web services
Oracle NCA ,
SAP GUI
DB

Can we use any of the client S/W like, SQL query analyzer, SQL developer,
TOAD any of such tools to connet to DB and work on DB protocol with LR?
I have good knowledge & experience on Load Runner with Web protocol.

But Like to know more about other protocols.”

My Answer to the above question in a general sense:
Let me tell one thing frankly. According to me LoadRunner is a small tool but the technologies it supports is very vast. I haven’t come across any website which is comprehensive enough to help people deal with rarely used protocols like Oracle 2 tier or Tuxedo for instance. When people asked me whether it is possible to directly connect to the database and work on it – using Toad or SQL query – they put a smile on my face. People, you can connect to the database for sure using ODBC protocol, Oracle 2 tier protocol but you surely need not have any SQL query analyzer or Toad to do that. Vugen alone is sufficient.

If you are looking forward to really learn the different protocols of Loadrunner then start learning the technologies up front on which these protocol works. Example if it is SAP GUI, start googling information on SAP GUI and keep you basic understanding of SAP GUI good enough so that if you challenged to work on it tomorrow – you would at least have fair amount of idea of where to start with.

Most of the LR professionals I would say are self thought. You need to have the aptitude and drive to perform R&D on something you feel is challenging you. In this particular case I would suggest you install Oracle DB on your machine and try connecting to the Db using ODBC protocol of LR and once you establish connection – write queries to fetch and insert data. You would love the way you learnt new things by this method. Believe me, it is a tried and tested method of learning Loadrunner. It just works.

When should we use .NET Vuser protocol in LoadRunner

NOV 21ST

Basically when programmers goes for .NET/VB/Java applet driven
development – most oftenly the application would not use open
standards like HTTP and hence loadrunner protocols would fail to
capture any of the messages sent through the application layer.

Developers would generally use custom messaging formats when they
develop an application using .NET/VB/Java applets, so this means when
the client and server are speaking with each other none of the
loadrunner protocols can identify the message because it hasn’t been
predefined.

Solution: Since contemporary record-replay option is ruled you can use
decompilers to make those DLL’s of your application more readable and
you should also consider getting help from the developers to build
workflow scripts using the source code of the application itself
in .NET Vu or VB Vu. This task could be challenging if the developers
have shrouded/obfuscated the code intentionally for security purpose,
hence i strongly suggest you get developer’s support during script
preperation
.
This tool can come in handly if your application :http://
www.remotesoft.com/salamander/

Worst case if nothing works, for the benifit of time you may choose
the following approach only if your workload scenario is light weight
with 10-50 virtual users, you may use RDP(analog recording) protocol
which records inputs of Keyboards and Mouse Clicks. You may need
higher end loadgenerator hardware if your targetting a higher range of
virtual user.

You can also try Winsock protocol. You may be lucky if your
application buffer passing through the sockets is small in size and
probably in which case your winsock script might be more readable and
may relatively be more feasible than any of the above mentioned methods
(again…it is subjective).