FIX: Files of traversed directories are added to transfer queue in reversed order

Moderator: Project members

Post Reply
Message
Author
LauXjpn
500 Command not understood
Posts: 3
Joined: 2016-12-11 19:37

FIX: Files of traversed directories are added to transfer queue in reversed order

#1 Post by LauXjpn » 2016-12-11 20:04

This issue has been reported multiple times over the last 10 years. It can be found as trac feature request #4416. Though there are more entries about this.

The fix is quite simple:

Code: Select all

src\interface\remote_recursive_operation.cpp:
302: -    for (int i = pDirectoryListing->GetCount() - 1; i >= 0; --i) {
302: +    for (int i = 0; i < pDirectoryListing->GetCount(); ++i) {
There is no reason to add the files to the transfer queue in reverse order, so this behavior should be changed in favor of the server returned file order, which is expected by users.
I would have contributed the patch directly via SVN but I could not find any information about the project specific process to do that and ultimately failed at the commit login prompt. :)

User avatar
botg
Site Admin
Posts: 35555
Joined: 2004-02-23 20:49
First name: Tim
Last name: Kosse

Re: FIX: Files of traversed directories are added to transfer queue in reversed order

#2 Post by botg » 2016-12-11 20:16

The current order has been chosen for performance.

Have you measured the performance impact of your change? Try transferring complex directory hierarchies with millions of empty files in millions of directories.

LauXjpn
500 Command not understood
Posts: 3
Joined: 2016-12-11 19:37

Re: FIX: Files of traversed directories are added to transfer queue in reversed order

#3 Post by LauXjpn » 2016-12-11 23:58

Of course I have not. :)

I am running the numbers now. I created a file system hierarchy with each folder containing 10 empty files and 10 sub folders. The hierarchy is 6 levels deep and thus contains 1,000,000 directories and 10,000,000 empty files in total. Each run will take about 1000 minutes, so this will take a little while to complete.

I have to admit though that I don't think the results of this benchmark will show any significant performance difference. That is because it does not matter (performance wise) in which order I queue or even download all files of a given directory. All commands and round trips are still the same, just in a different order. Also having large, small or even empty files does not make any difference as does not the number of directories or files.

What am I missing?

User avatar
botg
Site Admin
Posts: 35555
Joined: 2004-02-23 20:49
First name: Tim
Last name: Kosse

Re: FIX: Files of traversed directories are added to transfer queue in reversed order

#4 Post by botg » 2016-12-12 07:26

Order matters. For example, appending n items at the end of a vector is amortized O(1), whereas prepending n items to the front of a vector is O(n). Same goes for removal of items.

Please also test deletion of directories. It's an operation that updates cached listings which is sensitive to order.
I am running the numbers now. I created a file system hierarchy with each folder containing 10 empty files and 10 sub folders. The hierarchy is 6 levels deep and thus contains 1,000,000 directories and 10,000,000 empty files in total. Each run will take about 1000 minutes, so this will take a little while to complete.
The number of entries in an individual directory is quite low.
What am I missing?
What is the correct order? The directory listings are unsorted, they are in whatever order the server sends it to the client.

LauXjpn
500 Command not understood
Posts: 3
Joined: 2016-12-11 19:37

Re: FIX: Files of traversed directories are added to transfer queue in reversed order

#5 Post by LauXjpn » 2016-12-12 16:20

Alright, I tested both versions in two realistic environments. One from a web server to a web server (fast connection) and another one from my laptop to a web server. In each test, the results between the reverse sorted (original) vs the unsorted (proposed fix) were basically identical (queuing and overall download time each within a couple of seconds of each other over a runtime of about 5 hours (server-server)/11 hours (client-server)). So I guess, I am performance monitoring the wrong thing here.

So what exactly should I performance monitoring and suspect to have any significant different result?

What to test:
  1. the time it takes to queue all items to the transfer queue
  2. the time it takes to queue and download all items
How to test:
  1. counting in network time
  2. ignoring network time
Because I already tested for 1a and 2a I guess it has to be one of the ignoring network time variants, unless the quality of the test method I used was not good enough (file structure).

Since millions of directories with millions of files will take a long time to test, what is the minimal concrete environment here that should display a significant different test result? And for how much performance impact am I looking for?

Reading through your answers so far I guess the performance scenario you are taking about is 1b (or 1a with the least possible amount of network time) with many files in less directories being more important than less files in more directories, to test for the actual time the code itself takes to run.
Is this the scenario I should test for?

Despite the testing here which I am happy to do, I have to admit that I am not sure about seeing any real world scenario being addressed here so far. Performance optimization in general is of cause always a goal for any application, but regarding the millions of empty files in millions of directories (with tests still remaining of cause) I assume that we are talking about the icing on the cake here and not something a lot of users will notice at all.

On the other hand, the real world scenario of adding a directory (or directory structure) containing large files to the queue and needing them to be downloaded in some order to be able to start using some of them while others may still take a while to be downloaded is a common use case scenario. And adding large quantities of large files manually to the queue has a performance impact (in user time) that can be easily measured in seconds to minutes (depending on the complexity of the file structure). It is caused by browsing directories, manually creating local directories to mirror remote directories, sorting files in the remote directory and then adding them to the queue.

Sorting files of a traversed directory in the order the current directory is sorted, is by no doubt the actual user expected behavior for recursively adding files to the queue (which would of cause have a minor performance impact, because of the sorting).
I know that adding files to the queue in the order they are returned by the server is not the same as explicitly sorting them and is in theory arbitrary. But practically speaking, the file order returned by ftp servers in general is usually implicitly in an ascending order, which also happens to be the most commonly expected sorting order for recursively adding files to the queue.

So by changing just one line of code, this most common use case scenario can be implemented with little to no impact on the rest of the application and performance (test results pending).
In the long run, explicit sorting should be implemented (there could be an option for the method to use in the preferences (current order (default option), server default, performance optimized), though i would consider this unnecessary), but for now, using the server returned order would make a huge difference and is the simplest possible solution.
Please also test deletion of directories. It's an operation that updates cached listings which is sensitive to order.
I would postpone that particular test until the performance impact of queuing for downloading is clear to not further complicate things.

Post Reply