Dependencies
- C17
- Access to Posix Library
Summary
- Recursively search through directories given a root directory, and wrap files within each directory according to specific guidelines
- Directory recrusion is done with multi threading, which is sepertaed by the “worker” threads that work on wrapping the files
Design
Data structures
- Two queues are globally maintained for storing files to be normalized and directories to be traversed. This uses a queue struct, which contain a mutex and cond for synchronizing, a value to tell whether the queue has been closed, and a start and end node.
- “Nodes” are a linked list structure which contain a pointer to a pathName struct and a pointer to the next node in the list.
- For ease of composing filenames when they need to be used, the data stored in our queues’ nodes are pathName structs.
- the pathName struct contains two strings which represent the “prefix” of a file’s path and the actual file’s name. For example, a file named “example.txt” inside of the directory “dir” will have a prefix of “dir” and a fileName of “example.txt”. Our program has helper functions to compose an “input name” and an “output name” from a given pathName struct. This basically adds the necessary slashes and combines these two strings into a single path to the file which the struct represents for use in opening files or directories. The only difference between an “input name” and an “output name” is that the filename has “wrap.” added before it.
- The flag struct is most easily described as an integer with a mutex. This is used for globally tracking the number of active directory threads, as well as globally tracking the status which the program should exit with.
Main method: Preparing the fileQueue and dirQueue for traversal
- We first ensure the inputs within our main method have the proper number for non-recursive or recursive mode. The arguments to the program are also checked for validity throughout, such as for “-rM,N” where M and N need to be positive integers.
- The program determines whether it is running in recursive mode or non-recursive mode by checking for the first argument to start with “-r”
- This is later used to ensure the recursive and non-recursive modes are not used incorrectly, and for running different parts of main depending on the application
- After the lineLength argument is tested for validity and converted from string to integer, the rest of the arguments are assumed to be files and directories (if they are not existing files/dirs, the program will detect it later on and report the issue). We initialize the global data structures (queues and flags) and then iteratively enqueue these files and directories to the fileQueue and dirQueue, respectively.
- If no files or directories are given as arguments and the program is not in recursive mode, it reads from STDIN and outputs to STDOUT, after which it exits with whatever status normalize() returns. If the program is in recursive mode and no files or directories are given, it reports the issue and exits with EXIT_FAILURE.
- If the program is being run in recursive mode, the wrapper threads and directory threads are created to deal with the enqueued files and directories in their respective queues. The worker threads are immediately joined to the main thread once they terminate, the rest of dynamically allocated data is freed, and the program exits with whatever status is stored in the global exitFlag.
- If the program is being run in non-recursive mode, main reuses the same basic machinery as recursive mode to process files. First, main dequeues all of the files given as arguments and outputs their wrapped version to STDOUT. Then, it dequeues all directory arguments and enqueues every file contained in only the top-level directory (and not in subdirectories). Lastly, it dequeues all remaining files in the file queue (those which were contained in directories) and outputs the wrapped version to “wrap.filename” wherever the original file is located. Main deallocates all dynamically allocated data and exits with the globally stored exitFlag.
Working with Threads
- In our main method, we first initialize the number of threads for the file worker method, which takes care of normalizing the files within a root directory and it’s subdirectories, and the number of threads for the directory worker method, which takes care of traversing the directories and sending files to the fileworker for it to work on.
- The way these two workers collaborate/communicate is with use of the fileQueue. Directory threads add to it and wrapper threads take from it.
- The directory workers contains a directory queue, which is locked so only one thread can enqueue and dequeue at a time.
- Whenever the directory worker encounters a directory file, it enqueue’s it onto the directory queue, for another thread to pick it up. If it encounters a regular file, it enqueues it onto the file queue.
- The file worker will then pick up any files from the fileQueue, one thread at a time, and begin to call the normalize method and place it’s output in the correct spot.
- The fileWorker threads will repeat this process until the fileQueue is empty AND the directory queue is closed, meaning that there are no more files coming into the queue.
- The dirQueue threads will repeat until the directoryQueue is empty AND there are no active directory threads currently traversing a directory, signaling that there are no more subdirectories/files to be read.
- In order to do this, we created a flag struct that keeps track of the number of current threads running. This is done by having an integer variable within that struct, which is incremented and decremented at the beginning and end respectively of the main loop of directory worker.
- This struct is initiated globally to be accessed by all threads.
- Therefore, this struct also contains a mutex lock so that only one thread can modify this value at a given time.
- Any given thread should initialize this variable when enqueuing a directory, and decrement when done reading through the directory it picked up.
- The locking mechanism for the queue struct lies within the queue’s enqueue and dequeue method. Since there were multiple parts of the directory and file worker that could be worked on asynchronous, it was better to leave the mutual exclusion logic within the enqueue and dequeue methods since they were responsible for accessing and modifying the queues in order to minimize lock and unlock calls.
- The amount of active directory threads is kept track by a global struct of type flag that lives within the directory worker method.
- The threads (both file and directory) are created and joined within the main method and do not return any values
- In order to catch the error flag that can be encountered in numerous different locations of our program, we create another struct of type flag globally called errorFlag.
- The integer value of this struct represents the error condition encountered within our program at any given time.
- Since any thread could encounter an error at any given type, its lock is also used to ensure only one thread can update the error flag’s condition
- Some sources of errorFlag being updated are: When normalize does not return successful, if a thread encounters any type of error opening a file, directory, or any general error condition that may prevent the program from completing such as malloc return types, etc.
Calling Normalize Method
- The normalize method is where our word wrapping algorithm exists. It contains three parameters as follows:
- int inputFD (the file descriptor for the input file)
- int lineLength (the size that our lines should be wrapped to)
- int outputID (the file descriptor of our output file)
- In the scenario where only the line length is provided to our program, we will make the following call to normalize: “normalize(0, argv[1], 1);”. The inputFD parameter is set to 0, as this refers to the file descriptor for standard input as a source. The line length parameter is set to be the second argument in our argv list (argv[1]), since argv[0] contains the name of our program. Finally, outputFD is set to 1, since this is the file descriptor for standard output.
- In the scenario where there are two arguments given to the program and the second argument is a regular file, we will make the following call to normalize: “normalize(fd, argv[1], 1);”. Before calling normalize(), open is called on the input file given in argv[2] and assigned to fd. This scenario still requires writing to standard output, so outputFD is set to 1.
- In the scenario, there are two arguments provided and the second argument is a directory file, we will make the following call to normalize: “normalize(fd, argv[1], outputFD)”. This is done serially within a while loop with each file in the directory provided by argv[2]. First, we create a directory pointer and assign it to dr using opendr. Then, while there are more directory entries to be read, use open on each filename if it does not start with “.” or “wrap.” (if it does, move on to the next filename). The file is then assigned a file descriptor fd using open, and then passed to the normalize function one at a time as above. The outputFD descriptor is assigned to a file with the same filename as the input file but with the prefix “wrap.”, using a helper function getOutputName(filename). The new output filename returned from getOutputName is passed to the open function using the O_CREATE flag (for the situation where it has not yet been made) to get the file descriptor passed to outputFD.
Normalize Method
- The general structure of normalize is to:
- Read the file’s contents to a buffer, BUFFER_SIZE bytes at a time
- Parse the buffer character-by-character and recognize special characters such as white space and newlines
- Store the non-whitespace characters in a write buffer called currWord (to allow storage of words which span calls of read())
- Each time a whitespace character is reached (i.e. the end of a “word”),
- Start a new paragraph if needed by writing two consecutive newlines to the output file (the need for this is kept track of using a variable newPG)
- For words not at the beginning of a new line, write a single space to output
- For words which will not fit at the end of the current output file line, write a single newline to output
- Finally, write the output buffer currWord to the output file
- Repeat until read returns no more bytes to process
- If there is a final word left in the currWord buffer, such as a file which ends in a non-whitespace character, write it to output with the necessary whitespace
- Return with EXIT_FAILURE if a word was too long to fit in a line or the input file was empty, return with EXIT_SUCCESS otherwise
- Our read buffer is set to be a default of 16, defined as a macro with the name BUFFER_SIZE, but this works with any size buffer
- Any whitespace written to output is always written immediately before a new word is written.
- This was a deliberate choice which makes it easier to prevent spaces without words after them or empty paragraphs from being written to the output file.
- Although the program moves through the buffer character-by-character, the necessary trackers are made to note the current spot in a given line in the output file, consecutive newline characters, or when a new paragraph needs to be started before the next write of a word. All of these are used to ensure text is properly normalized to the lineLength parameter and has spaces only between words and exclusively 1-2 newline characters between lines.
- We start off by first initializing/declaring variables such as currWord (the write buffer), trackers for important characters in input and current location in output, and a buffer to contain what read has gotten from input
- The currWord buffer is initially the size of line length + 1 (extra \0 character at the end to allow for strlen() functionality).
- currWord’s size is doubled to accomodate consecutive non-whitespace characters exceeding the length of a line
- therefore, size of currWord later denotes whether there was a word that exceeds the line length
- The currWord buffer is initially the size of line length + 1 (extra \0 character at the end to allow for strlen() functionality).
- An initial call to read is done first to see if there are any bytes to be read from input, which is then stored in the buffer
- We then enter our main loop that continues as long as read gets back more than 0 bytes
- The buffer is then iterated over one at a time, and each character is checked to be either a non-whitespace character or a whitespace character (with newlines as a special subset of whitespace)
- In the situation where we have a non-whitespace character, we will add this value to currWord
- In the situation where we have a whitespace character, we will write the contents of currWord to the file, preceded by a space character
- In the situation where we have a newline character, we will note that we have seen a new line character, and if the next character in the buffer/future buffer is also a new line, we recognize this as a paragraph scenario with the tracker newPG. Otherwise, we will ignore this new line character
- When there is another word to be written to output, newPG will signify to start a new paragraph before writing this new word.