
Based on the architecture and the purpose of the app, without even looking, I can say fairly conclusively that it is the database. 10 hour processing signals bottlenecks in the database. We aren't doing intense computation (such as modeling, simulations or rendering video/audio) so that rules that out. We are simply matching records in a database. I have seen this in similar apps. Database optimizations and query results generally yielded the biggest performance improvements. But sure, let's put an APM tool on it like new relic or appdynamics to diagnose where it is, if there isn't one already. That will tell us conclusively. AppDynamics (now a Cisco product) is my personal favorite as I used to speak at their conferences (AppJam) and was one of their early customers, but New Relic also works well, and I've used both so I can attest to their effectiveness. If we have an APM, that should be the first place to look, so we don't keep guessing. 73 Ria, N2RJ On Mon, Nov 30, 2020 at 5:36 PM Mickey Baker <fishflorida@gmail.com> wrote:
Since you're going deep, the first step is to measure the resources on whatever platform it is today. I believe I've heard that it isn't virtualized, and that the database is (was?) an ancient database from SAP's R2 era that is no longer supported...
I'd like to see the resource starvation that is causing 10 hour turnaround times. I worked for the storage vendor where LoTW was platformed in December 2012 and communicated with Keane - the application had saturated the storage system and he had ordered new servers and solid state storage, although the storage system was BROKEN, causing I/O delays and several day's loss of uploads. It was not under contract to repair, but I was authorized to offer a no-charge repair. He didn't take me up on it.
At the time, I told him that I thought that this was a temporary fix and that he needed to take a closer look at the architecture so that all these lookups were using a more efficient algorithm.
I'm not prepared to judge what resource is the constraint. I would suggest some simple measurements that would show what's going on, now, at the time we're seeing this, so we are able to make some valid assumptions.
This is not a predictable "big shop" environment like a bank or a worldwide television network. We need to grab whatever data we can when we can. If this isn't MySQL, Oracle or MS SQL Server, we won't be able to purchase dbserver resources, but an early step would be to get everything on those databases.
... and I've seen issues like "scratchpad" or journaling IO cause delays where fixing the DB server might not help.
What usually happens in small shops is that performance demand outstrips capacity and applications queue. Response times on these queues cause upstream timeouts because of these queued requests. (Little's Law), things time out and then they fail.
Where?
Can you send me or point me to the IT Committee reports?
Thanks,
Mickey Baker, N4MB Palm Beach Gardens, FL “The servant-leader is servant first… It begins with the natural feeling that one wants to serve, to serve first. Then conscious choice brings one to aspire to lead." Robert K. Greenleaf
On Mon, Nov 30, 2020 at 4:15 PM rjairam@gmail.com <rjairam@gmail.com> wrote:
When I did this sort of thing (elections news coverage) we scaled up the day before and set thresholds and auto dynamic scaling. Right now actually I do things like this for Black Friday and cyber Monday when billions of credit card swipes take place, but using private cloud.
However we needn’t go that far, YET.
An easy win would be to put the database on a managed database service like AWS RDS. The web server isn’t usually the bottleneck. The database is. The application server isn’t even to blame because in our app we aren’t doing intensive compute. We are mostly doing database inserts and selects.
A managed database will have capacity and scaling completely transparently and you don’t even have to think about this. Amazon just scales the database using its theoretically infinite resources.
Hopefully our attitude about cloud computing can change. When I mentioned cloud to Mike Keane in 2019 he said that we shouldn’t bother because we had “sunk cost” into things like HVAC in HQ. Dave AA6YQ also said we should be investing in programmers and not hardware/cloud.
I did recommend a cloud database for LoTW in one of the IT infra committee reports. Hopefully we can at least do that. I can almost guarantee that it will alleviate 90% of the post contest crashiness.
73 Ria N2RJ
On Mon, Nov 30, 2020 at 1:46 PM Mickey Baker <fishflorida@gmail.com> wrote:
One of the issues David Minster hit on during the A&F meeting corresponded to some posts I made years ago in the LoTW online group.
There is no reason that this system should NOT be running on a cloud-based virtualized server. Even if portions of the design are single threaded, the ability to add processing power (memory, cpu, IO bandwidth) on the fly and in anticipation of these peak workload, is essential and CAN BE DONE QUICKLY without a total system re-write.
System capacity can likely be then bumped up 4x-10x around contest weekends without purchasing gear.
Sticking it on a hardware server in a data center is prudent but not smart with a workload that varies like this.
Corporations do this routinely with workload, such as month/year end closings, trial balances, etc.
I have skills needed to design a plan to do this and will offer, again, to DONATE time to analyze our current state develop a confidential plan.
One of the reasons I ran for this position is that the prior director was clueless about League IT challenges. I know this is one of many, but it is “low hanging fruit” and can be addressed before a design of lotw 2.0 can be agreed upon.
73,
Mickey N4MB
On Mon, Nov 30, 2020 at 1:30 PM Roderick, Rick, K5UR via arrl-odv <arrl-odv@reflector.arrl.org> wrote:
It's backlogged due to all of of the downloads from the CQWW contest this past weekend. It was running 10 hours behind earlier this morning.
Rick
-----Original Message----- From: Minster, David NA2AA (CEO) <dminster@arrl.org> To: Tharp, Mark, KB7HDX (VD, NW) <kb7hdx@gmail.com>; arrl-odv <arrl-odv@arrl.org> Sent: Mon, Nov 30, 2020 12:26 pm Subject: [arrl-odv:31410] Re: LOTW Broke?
I just jumped on to see if I could replicate. Went through ARRL network and cellular network. Went into logs and awards. No errors.
I will forward onto the team in case this is the tip of the iceberg.
David
From: arrl-odv <arrl-odv-bounces@reflector.arrl.org> On Behalf Of Mark J Tharp Sent: Monday, November 30, 2020 1:22 PM To: arrl-odv <arrl-odv@arrl.org> Subject: [arrl-odv:31409] LOTW Broke?
Third time this morning I have seen this error.
Is it just me?
Mark, HDX _______________________________________________ arrl-odv mailing list arrl-odv@reflector.arrl.org https://reflector.arrl.org/mailman/listinfo/arrl-odv _______________________________________________ arrl-odv mailing list arrl-odv@reflector.arrl.org https://reflector.arrl.org/mailman/listinfo/arrl-odv
-- “Ends and beginnings—there are no such things. There are only middles.” Robert Frost _______________________________________________ arrl-odv mailing list arrl-odv@reflector.arrl.org https://reflector.arrl.org/mailman/listinfo/arrl-odv